You are on page 1of 18

ItstheEffectSize,Stupid1

Whateffectsizeisandwhyitisimportant
PaperpresentedattheBritishEducationalResearchAssociationannualconference,
Exeter,1214September,2002

RobertCoe

SchoolofEducation,UniversityofDurham,LeazesRoad, DurhamDH11TA
Tel01913344184Fax01913344180Emailr.j.coe@dur.ac.uk

Abstract
Effectsizeisasimplewayofquantifyingthedifferencebetweentwogroups
thathasmanyadvantagesovertheuseoftestsofstatisticalsignificancealone.Effect
sizeemphasisesthesizeofthedifferenceratherthanconfoundingthiswithsample
size.However,primaryreportsrarelymentioneffectsizesandfewtextbooks,research
methodscoursesorcomputerpackagesaddresstheconcept.Thispaperprovidesan
explicationofwhataneffectsizeis,howitiscalculatedandhowitcanbeinterpreted.
Therelationshipbetweeneffectsizeandstatisticalsignificanceisdiscussedandthe
useofconfidenceintervalsforthelatteroutlined.Someadvantagesanddangersof
usingeffectsizesinmetaanalysisarediscussedandotherproblemswiththeuseof
effectsizesareraised.Anumberofalternativemeasuresofeffectsizearedescribed.
Finally,adviceontheuseofeffectsizesissummarised.

Effectsizeissimplyawayofquantifyingthesizeofthedifferencebetween
twogroups.Itiseasytocalculate,readilyunderstoodandcanbeappliedtoany
measuredoutcomeinEducationorSocialScience.Itisparticularlyvaluablefor
quantifyingtheeffectivenessofaparticularintervention,relativetosomecomparison.
Itallowsustomovebeyondthesimplistic,Doesitworkornot?tothefarmore
sophisticated,Howwelldoesitworkinarangeofcontexts?Moreover,byplacing
theemphasisonthemostimportantaspectofanintervention thesizeoftheeffect
ratherthanitsstatisticalsignificance(whichconflateseffectsizeandsamplesize),it
promotesamorescientificapproachtotheaccumulationofknowledge.Forthese
reasons,effectsizeisanimportanttoolinreportingandinterpretingeffectiveness.
Theroutineuseofeffectsizes,however,hasgenerallybeenlimitedtometa
analysisforcombiningandcomparingestimatesfromdifferentstudiesandisall
toorareinoriginalreportsofeducationalresearch(Keselman etal.,1998).Thisis
despitethefactthatmeasuresofeffectsizehavebeenavailableforatleast60years
(Huberty,2002),andtheAmericanPsychologicalAssociationhasbeenofficially
encouragingauthorstoreporteffectsizessince1994butwithlimitedsuccess
(Wilkinson etal.,1999).Formulaeforthecalculationofeffectsizesdonotappearin
moststatisticstextbooks(otherthanthosedevotedtometaanalysis),arenotfeatured
inmanystatisticscomputerpackagesandareseldomtaughtinstandardresearch
methodscourses.Forthesereasons,eventheresearcherwhoisconvincedbythe

1
Duringthe1992USPresidentialelectioncampaign,BillClintonsfortunesweretransformed
whenhisadvisorshelpedhimtofocusonthemainissuebywritingItstheeconomy,stupidona
boardtheyputinfrontofhimeverytimehewentouttospeak.

1
wisdomofusingmeasuresofeffectsize,andisnotafraidtoconfronttheorthodoxyof
conventionalpractice,mayfindthatitisquitehardtoknowexactlyhowtodoso.
Thefollowingguideiswrittenfornonstatisticians,thoughinevitablysome
equationsandtechnicallanguagehavebeenused.Itdescribeswhateffectsizeis,
whatitmeans,howitcanbeusedandsomepotentialproblemsassociatedwithusing
it.

1.Whydoweneedeffectsize?
ConsideranexperimentconductedbyDowson(2000)toinvestigatetimeof
dayeffectsonlearning:dochildrenlearnbetterinthemorningorafternoon?Agroup
of38childrenwereincludedintheexperiment.Halfwererandomlyallocatedto
listentoastoryandanswerquestionsaboutit(ontape)at9am,theotherhalftohear
exactlythesamestoryandanswerthesamequestionsat3pm.Theircomprehension
wasmeasuredbythenumberofquestionsansweredcorrectlyoutof20.
Theaveragescorewas15.2forthemorninggroup,17.9fortheafternoon
group:adifferenceof2.7.Buthowbigadifferenceisthis?Iftheoutcomewere
measuredonafamiliarscale,suchasGCSEgrades,interpretingthedifferencewould
notbeaproblem.Iftheaveragedifferencewere,say,halfagrade,mostpeople
wouldhaveafairideaoftheeducationalsignificanceoftheeffectofreadingastory
atdifferenttimesofday.However,inmanyexperimentsthereisnofamiliarscale
availableonwhichtorecordtheoutcomes.Theexperimenteroftenhastoinventa
scaleortouse(oradapt)analreadyexistingonebutgenerallynotonewhose
interpretationwillbefamiliartomostpeople.

(a) (b)
Figure1

Onewaytogetoverthisproblemistousetheamountofvariationinscoresto
contextualisethedifference.Iftherewerenooverlapatallandeverysinglepersonin
theafternoongrouphaddonebetteronthetestthaneveryoneinthemorninggroup,
thenthiswouldseemlikeaverysubstantialdifference.Ontheotherhand,ifthe
spreadofscoreswerelargeandtheoverlapmuchbiggerthanthedifferencebetween
thegroups,thentheeffectmightseemlesssignificant.Becausewehaveanideaof
theamountofvariationfoundwithinagroup,wecanusethisasayardstickagainst
whichtocomparethedifference.Thisideaisquantifiedinthecalculationofthe
effectsize.TheconceptisillustratedinFigure1,whichshowstwopossiblewaysthe
differencemightvaryinrelationtotheoverlap.Ifthedifferencewereasingraph(a)
itwouldbeverysignificantingraph(b),ontheotherhand,thedifferencemight
hardlybenoticeable.

2
2.Howisitcalculated?
Theeffectsizeisjustthestandardisedmeandifferencebetweenthetwo
groups.Inotherwords:

EffectSize= [Meanofexperimentalgroup] [Meanofcontrolgroup]

StandardDeviation
Equation1

Ifitisnotobviouswhichoftwogroupsistheexperimental(i.e.theone
whichwasgiventhenewtreatmentbeingtested)andwhichthecontrol(theone
giventhestandardtreatment ornotreatment forcomparison),thedifferencecan
stillbecalculated.Inthiscase,theeffectsizesimplymeasuresthedifference
betweenthem,soitisimportantinquotingtheeffectsizetosaywhichwayroundthe
calculationwasdone.
Thestandarddeviationisameasureofthespreadofasetofvalues.Hereit
referstothestandarddeviationofthepopulationfromwhichthedifferenttreatment
groupsweretaken.Inpractice,however,thisisalmostneverknown,soitmustbe
estimatedeitherfromthestandarddeviationofthecontrolgroup,orfromapooled
valuefrombothgroups(seequestion7,below,formorediscussionofthis).
InDowsonstimeofdayeffectsexperiment,thestandarddeviation(SD)=
3.3,sotheeffectsizewas(17.915.2)/3.3=0.8.

3.Howcaneffectsizesbeinterpreted?
Onefeatureofaneffectsizeisthatitcanbedirectlyconvertedintostatements
abouttheoverlapbetweenthetwosamplesintermsofacomparisonofpercentiles.
AneffectsizeisexactlyequivalenttoaZscoreofastandardNormal
distribution.Forexample,aneffectsizeof0.8meansthatthescoreoftheaverage
personintheexperimental groupis0.8standarddeviationsabovetheaverageperson
inthecontrolgroup,andhenceexceedsthescoresof79%ofthecontrolgroup.With
thetwogroupsof19inthetimeofdayeffectsexperiment,theaveragepersoninthe
afternoongroup(i.e.theonewhowouldhavebeenranked10th inthegroup)would
havescoredaboutthesameasthe4th highestpersoninthemorninggroup.
Visualisingthesetwoindividualscangivequiteagraphicinterpretationofthe
differencebetweenthetwoeffects.
TableIshowsconversionsofeffectsizes(column1)topercentiles(column2)
andtheequivalentchangeinrankorderforagroupof25(column3).Forexample,
foraneffectsizeof0.6,thevalueof73%indicatesthattheaveragepersoninthe
experimentalgroupwouldscorehigherthan73%ofacontrolgroupthatwasinitially
equivalent.Ifthegroupconsistedof25people,thisisthesameassayingthatthe
averageperson(i.e.ranked13th inthegroup)wouldnowbeonapar withtheperson
ranked7th inthecontrolgroup.Noticethataneffectsizeof1.6wouldraisethe
averagepersontobelevelwiththetoprankedindividualinthecontrolgroup,so
effectsizeslargerthanthisareillustratedintermsofthetoppersoninalargergroup.
Forexample,aneffectsizeof3.0wouldbringtheaveragepersoninagroupof740
levelwiththepreviouslytoppersoninthegroup.

3
TableI: Interpretationsofeffectsizes
Probabilitythat
Rankofperson
personfrom
Percentageof inacontrol Equivalent
Probabilitythat experimental
controlgroup groupof25who correlation,r
youcouldguess groupwillbe
whowouldbe wouldbe (=Differencein
Effect whichgroupa higherthan
belowaverage equivalentto percentage
Size personwasin personfrom
personin theaverage successfulin
fromknowledge control,ifboth
experimental personin eachofthetwo
oftheirscore. chosenat
group experimental groups,BESD)
random
group
(=CLES)
0.0 50% 13th 0.50 0.00 0.50
0.1 54% 12th 0.52 0.05 0.53
0.2 58% 11th 0.54 0.10 0.56
th
0.3 62% 10 0.56 0.15 0.58
0.4 66% 9th 0.58 0.20 0.61
0.5 69% 8th 0.60 0.24 0.64
0.6 73% 7th 0.62 0.29 0.66
0.7 76% 6th 0.64 0.33 0.69
0.8 79% 6th 0.66 0.37 0.71
th
0.9 82% 5 0.67 0.41 0.74
1.0 84% 4th 0.69 0.45 0.76
1.2 88% 3rd 0.73 0.51 0.80
1.4 92% 2nd 0.76 0.57 0.84
1.6 95% 1st 0.79 0.62 0.87
1.8 96% 1st 0.82 0.67 0.90
st st
1 (or1 outof
2.0 98% 0.84 0.71 0.92
44)
1st (or1st outof
2.5 99% 0.89 0.78 0.96
160)
1 (or1st outof
st
3.0 99.9% 0.93 0.83 0.98
740)

Anotherwaytoconceptualisetheoverlapisintermsoftheprobabilitythat
onecouldguesswhichgroupapersoncamefrom,basedonlyontheirtestscore or
whatevervaluewasbeingcompared.Iftheeffectsizewere0(i.e.thetwogroups
werethesame)thentheprobabilityofacorrectguesswouldbeexactlyahalf or
0.50.Withadifferencebetweenthetwogroupsequivalenttoaneffectsizeof0.3,
thereisstillplentyofoverlap,andtheprobabilityofcorrectly identifyingthegroups
risesonlyslightlyto0.56.Withaneffectsizeof1,theprobabilityisnow0.69,just
overatwothirdschance.Theseprobabilitiesareshowninthefourthcolumnof Table
I.Itisclearthattheoverlapbetweenexperimentalandcontrolgroupsissubstantial
(andthereforetheprobabilityisstillcloseto0.5),evenwhentheeffectsizeisquite
large.

4
Aslightlydifferentwaytointerpreteffectsizesmakesuseofanequivalence
betweenthestandardisedmeandifference(d)andthecorrelationcoefficient,r.If
groupmembershipiscodedwithadummyvariable(e.g.denotingthecontrolgroup
by0andtheexperimentalgroupby1)andthecorrelationbetweenthisvariableand
theoutcomemeasurecalculated,avalueof rcanbederived.Bymakingsome
additionalassumptions,onecanreadilyconvertd intoringeneral,usingtheequation
r2 = d2 / (4+d2)(seeCohen,1969,pp2022forotherformulaeandconversiontable).
RosenthalandRubin(1982)takeadvantageofaninterestingpropertyof rtosuggesta
furtherinterpretation,whichtheycallthebinomialeffectsizedisplay(BESD).Ifthe
outcomemeasureisreducedtoasimpledichotomy(forexample,whetherascoreis
aboveorbelowaparticularvaluesuchasthemedian,whichcouldbethoughtofas
successorfailure),rcanbeinterpretedasthedifferenceintheproportionsineach
category.Forexample,aneffectsizeof0.2indicatesadifferenceof0.10inthese
proportions,aswouldbethecaseif45%ofthecontrolgroupand55%ofthe
treatmentgrouphadreachedsomethresholdofsuccess.Note,however,thatifthe
overallproportionsuccessfulisnotcloseto50%,thisinterpretationcanbe
somewhatmisleading(Strahan,1991McGraw,1991).ThevaluesfortheBESDare
shownincolumn5.
Finally,McGrawandWong(1992)havesuggestedaCommonLanguage
EffectSize(CLES)statistic,whichtheyargueisreadilyunderstoodbynon
statisticians(shownincolumn6ofTableI).Thisistheprobabilitythatascore
sampledatrandomfromonedistributionwillbegreaterthanascoresampledfrom
another.Theygivetheexampleoftheheightsofyoungadultmalesandfemales,
whichdifferbyaneffectsizeofabout2,andtranslatethisdifferencetoaCLESof
0.92.Inotherwordsin92outof100blinddatesamongyoungadults,themalewill
betallerthanthefemale(p361).
ItshouldbenotedthatthevaluesinTableIdependontheassumptionofa
Normaldistribution.Theinterpretationof effectsizesintermsofpercentilesisvery
sensitivetoviolationsofthisassumption(seequestion7,below).
Anotherwaytointerpreteffectsizesistocomparethemtotheeffectsizesof
differencesthatarefamiliar.Forexample,Cohen(1969,p23)describesaneffectsize
of0.2assmallandgivestoillustrateittheexamplethatthedifferencebetweenthe
heightsof15yearoldand16yearoldgirlsintheUScorrespondstoaneffectofthis
size.Aneffectsizeof0.5isdescribedasmedium andislargeenoughtobevisible
tothenakedeye.A0.5effectsizecorrespondstothedifferencebetweentheheights
of14yearoldand18yearoldgirls.Cohendescribesaneffectsizeof0.8asgrossly
perceptibleandthereforelargeandequatesittothedifferencebetweentheheightsof
13yearoldand18yearoldgirls.Asafurtherexamplehestatesthatthedifferencein
IQbetweenholdersofthePh.D.degreeandtypicalcollegefreshmeniscomparable
toaneffectsizeof0.8.
Cohendoesacknowledgethedangerofusingtermslikesmall,mediumand
largeoutofcontext.Glassetal.(1981,p104)areparticularlycriticalofthis
approach,arguingthattheeffectivenessofaparticularinterventioncanonlybe
interpretedinrelation tootherinterventionsthatseektoproducethesameeffect.
Theyalsopointoutthatthepracticalimportanceofaneffectdependsentirelyonits
relativecostsandbenefits.Ineducation,ifitcouldbeshownthatmakingasmalland
inexpensivechangewouldraiseacademicachievementbyaneffectsizeofevenas
littleas0.1,thenthiscouldbeaverysignificantimprovement,particularlyifthe
improvementapplieduniformlytoallstudents,andevenmoresoiftheeffectwere
cumulativeovertime.

5
TableII: Examplesofaverageeffectsizesfromresearch
Effect
Intervention Outcome Source
Size

Reducingclasssizefrom23 Studentstestperformanceinreading 0.30 FinnandAchilles,


to15 (1990)
Studentstestperformanceinmaths 0.32

Small(<30)vslargeclass Attitudesofstudents 0.47 SmithandGlass


size (1980)
Attitudesofteachers 1.03

Studentachievement(overall) 0.00
Settingstudentsvsmixed Studentachievement(forhigh Mosteller,Light
0.08 andSachs(1996)
abilitygrouping achievers)
Studentachievement(forlowachievers) 0.06
Open(childcentred)vs Studentachievement 0.06 Giaconiaand
traditionalclassroom Hedges(1982)
organisation Studentattitudestoschool 0.17
Mainstreamingvsspecial
WangandBaker
education(forprimaryage, Achievement 0.44 (1986)
disabledstudents)
Kulik,Bangert
Practicetesttaking Testscores 0.32 andKulik(1984)
Shymansky,
Inquirybasedvstraditional Hedgesand
Achievement 0.30 Woodworth
sciencecurriculum
(1990)
Therapyfortestanxiety(for
Testperformance 0.42 Hembree(1988)
anxiousstudents)
Feedbacktoteachersabout
FuchsandFuchs
studentperformance Studentachievement 0.70 (1986)
(studentswithIEPs)
Achievementoftutees 0.40 Cohen,Kulikand
Peertutoring Kulik,(1982)
Achievementoftutors 0.33
Bangert,Kulik
Individualisedinstruction Achievement 0.10 andKulik(1983)

Computerassisted Achievement(allstudies) 0.24 FletcherFlinnand


instruction(CAI) Gravatt(1995)
Achievement(inwellcontrolledstudies) 0.02
Kavaleand
Additivefreediet Children'shyperactivity 0.02 Forness(1983)
Hymanetal.
Relaxationtraining Medicalsymptoms 0.52 (1989)
Targetedinterventionsforat Slavinand
Achievement 0.63 Madden(1989)
riskstudents
Schoolbasedsubstance BangertDrowns
Substanceuse 0.12 (1988)
abuseeducation
Treatmentprogrammesfor
Delinquency 0.17 Lipsey(1992)
juveniledelinquents

Glassetal.(1981,p102)givetheexamplethataneffectsizeof1corresponds
tothedifferenceofaboutayearofschoolingontheperformanceinachievementtests
ofpupilsinelementary(i.e.primary)schools.However,ananalysisofastandard
spellingtestusedinBritain(VincentandCrumpler,1997)suggeststhattheincrease

6
inaspellingagefrom11to12correspondstoaneffectsizeofabout0.3,butseemsto
varyaccordingtotheparticulartestused.
InEngland,thedistributionofGCSEgradesincompulsorysubjects(i.e.
MathsandEnglish)havestandarddeviationsofbetween1.5 1.8 grades,soan
improvementofoneGCSEgraderepresentsaneffectsizeof0.50.7.Inthecontext
ofsecondaryschoolstherefore,introducingachangeinpracticewhoseeffectsizewas
knowntobe0.6wouldresultinanimprovementofaboutaGCSEgradeforeach
pupilineachsubject.Foraschoolinwhich50%ofpupilswerepreviouslygaining
fiveormoreA*Cgrades,thispercentage(otherthingsbeingequal,andassuming
thattheeffectappliedequallyacrossthewholecurriculum)wouldriseto73%.1 Even
Cohenssmalleffectof0.2wouldproduceanincreasefrom50%to58% a
differencethatmostschoolswouldprobablycategoriseasquitesubstantial.Olejnik
andAlgina(2000)giveasimilarexamplebasedontheIowaTestofBasicSkills
Finally,theinterpretationofeffectsizescanbegreatlyhelpedbyafew
examplesfromexistingresearch.TableIIlistsaselectionofthese,manyofwhichare
takenfromLipseyandWilson(1993).Theexamplescitedaregivenforillustrationof
theuseof effectsizemeasurestheyarenotintendedtobethedefinitivejudgementon
therelativeefficacyofdifferentinterventions.Ininterpretingthem,therefore,one
shouldbearinmindthatmostofthemetaanalysesfromwhichtheyarederivedcan
be(andoftenhavebeen)criticisedforavarietyofweaknesses,thattherangeof
circumstancesinwhichtheeffectshavebeenfoundmaybelimited,andthattheeffect
sizequotedisanaveragewhichisoftenbasedonquitewidelydifferingvalues.
Itseemstobeafeatureofeducationalinterventionsthatveryfewofthem
haveeffectsthatwouldbedescribedinCohensclassificationasanythingotherthan
small.Thisappearsparticularlysoforeffectsonstudentachievement.Nodoubtthis
ispartlyaresultofthewidevariationfoundinthepopulationasawhole,against
whichthemeasureofeffectsizeiscalculated.Onemightalsospeculatethat
achievementishardertoinfluencethanotheroutcomes,perhapsbecausemostschools
arealreadyusingoptimalstrategies,orbecausedifferentstrategiesarelikelytobe
effectiveindifferentsituationsacomplexitythatisnotwellcapturedbyasingle
averageeffectsize.

4.Whatistherelationshipbetweeneffectsizeandsignificance?
Effectsizequantifiesthesizeofthedifferencebetweentwogroups,andmay
thereforebesaidtobeatruemeasureofthesignificanceofthedifference.If,for
example,theresultsofDowsonstimeofdayeffectsexperimentwerefoundto
applygenerally,wemightaskthequestion:Howmuchdifferencewoulditmaketo
childrenslearningiftheyweretaughtaparticulartopicintheafternooninsteadofthe
morning?Thebestanswerwecouldgivetothiswouldbeintermsoftheeffectsize.
However,instatisticsthewordsignificanceisoftenusedtomeanstatistical
significance,whichisthelikelihoodthatthedifferencebetweenthetwogroupscould
justbeanaccidentofsampling.Ifyoutaketwosamplesfromthesamepopulation
therewillalwaysbeadifferencebetweenthem.Thestatisticalsignificanceisusually
calculatedasapvalue,theprobabilitythatadifferenceofatleastthesamesize
wouldhavearisenbychance,eveniftherereallywerenodifferencebetweenthetwo
populations.Fordifferencesbetweenthemeansoftwogroups,thispvaluewould
normallybecalculatedfromattest.Byconvention,ifp<0.05(i.e.below5%),the
differenceistakentobelargeenoughtobesignificantifnot,thenitisnot
significant.
Thereareanumberofproblemswithusingsignificancetestsinthisway
(see,forexampleCohen,1994Harlowetal.,1997Thompson,1999).Themainone
isthatthepvaluedependsessentiallyontwothings:thesizeoftheeffectand thesize

7
ofthesample.Onewouldgetasignificantresulteitheriftheeffectwereverybig
(despitehavingonlyasmallsample)orifthesamplewereverybig(eveniftheactual
effectsizeweretiny).Itisimportanttoknowthestatisticalsignificanceofaresult,
sincewithoutitthereisadangerofdrawingfirmconclusionsfromstudieswherethe
sampleistoosmalltojustifysuchconfidence.However,statisticalsignificancedoes
not tellyouthemostimportantthing:thesizeoftheeffect.Onewaytoovercomethis
confusionistoreporttheeffectsize,togetherwithanestimateofitslikelymarginfor
errororconfidenceinterval.

5.Whatisthemarginforerrorinestimatingeffectsizes?
Clearly,ifaneffectsizeiscalculatedfromaverylargesampleitislikelytobe
moreaccuratethanonecalculatedfromasmallsample.Thismarginforerrorcan
bequantifiedusingtheideaofaconfidenceinterval,whichprovidesthesame
informationasisusuallycontainedinasignificancetest:usinga95%confidence
intervalisequivalenttotakinga5%significancelevel.Tocalculatea95%
confidenceinterval,youassumethatthevalueyougot(e.g.theeffectsizeestimateof
0.8)isthetruevalue,butcalculatetheamountofvariationinthisestimateyou
wouldgetifyourepeatedlytooknewsamplesofthesamesize(i.e.differentsamples
of38children).Forevery100ofthesehypotheticalnewsamples,bydefinition,95
wouldgiveestimatesoftheeffectsizewithinthe95%confidenceinterval.Ifthis
confidenceintervalincludeszero,thenthatisthesameassayingthattheresultisnot
statisticallysignificant.If,ontheotherhand,zeroisoutsidetherange,thenitis
statisticallysignificantatthe5%level.Usingaconfidenceintervalisabetterway
ofconveyingthisinformationsinceitkeepstheemphasisontheeffectsizewhichis
theimportantinformation ratherthanthepvalue.
Aformulaforcalculatingtheconfidenceintervalforaneffectsizeisgivenby
HedgesandOlkin(1985,p86).Iftheeffectsizeestimatefromthesampleisd,thenit
isNormallydistributed,withstandarddeviation:

Equation2

(WhereNE andNC arethenumbersintheexperimentalandcontrolgroups,


respectively.)

Hencea95%confidenceintervalfordwouldbefrom

d1.96 s[d] to d+1.96 s[d]


Equation3

Tousethefiguresfromthetimeofdayexperimentagain,NE =NC =19and


d=0.8,so s[d]= (0.105+0.008)=0.34.Hencethe95%confidenceintervalis
[0.14, 1.46].Thiswouldnormallybeinterpreted(despitethefactthatsuchan
interpretationisnotstrictlyjustifiedseeOakes,1986foranenlighteningdiscussion
ofthis)asmeaningthatthetrueeffectoftimeofdayisverylikelytobebetween

8
0.14and1.46.Inotherwords,itisalmostcertainlypositive(i.e.afternoonisbetter
thanmorning)andthedifferencemaywellbequitelarge.

6.Howcanknowledgeabouteffectsizesbecombined?
Oneofthemainadvantagesofusingeffectsizeisthatwhenaparticular
experimenthasbeenreplicated,thedifferenteffectsizeestimatesfromeachstudycan
easilybecombinedtogiveanoverallbestestimateofthesizeoftheeffect.This
process ofsynthesisingexperimentalresultsintoasingleeffectsizeestimateisknown
asmetaanalysis.Itwasdevelopedinitscurrentformbyaneducationalstatistician,
GeneGlass(SeeGlassetal.,1981)thoughtherootsofmetaanalysiscanbetraceda
gooddealfurtherback(seeLepperetal.,1999),andisnowwidelyused,notonlyin
education,butinmedicineandthroughoutthesocialsciences.Abriefandaccessible
introductiontotheideaofmetaanalysiscanbefoundinFitzGibbon(1984).
Metaanalysis,however,candomuchmorethansimplyproduceanoverall
averageeffectsize,importantthoughthisoftenis.If,foraparticularintervention,
somestudiesproducedlargeeffects,andsomesmalleffects,itwouldbeoflimited
valuesimply tocombinethemtogetherandsaythattheaverageeffectwasmedium.
Muchmoreusefulwouldbetoexaminetheoriginalstudiesforanydifferences
betweenthosewithlargeandsmalleffectsandtotrytounderstandwhatfactorsmight
accountforthedifference.Thebestmetaanalysis,therefore,involvesseeking
relationshipsbetweeneffectsizesandcharacteristicsoftheintervention,thecontext
andstudydesigninwhichtheywerefound(Rubin,1992seealsoLepperetal. (1999)
foradiscussionof theproblemsthatcanbecreatedbyfailingtodothis,andsome
otherlimitationsoftheapplicabilityofmetaanalysis).
Theimportanceofreplicationingainingevidenceaboutwhatworkscannotbe
overstressed.InDowsonstimeofdayexperimenttheeffectwasfoundtobelarge
enoughtobestatisticallyandeducationallysignificant.Becauseweknowthatthe
pupilswereallocatedrandomlytoeachgroup,wecanbeconfidentthatchanceinitial
differencesbetweenthetwogroupsareveryunlikelytoaccountforthedifferencein
theoutcomes.Furthermore,theuseofapretestofbothgroupsbeforethe
interventionmakesthisevenlesslikely.However,wecannotruleoutthepossibility
thatthedifferencearosefromsomecharacteristicpeculiartothechildreninthis
particularexperiment.Forexample,ifnoneofthemhadhadanybreakfastthatday,
thismightaccountforthepoorperformanceofthemorninggroup.However,the
resultwouldthenpresumablynotgeneralisetothewiderpopulationofschool
students,mostofwhomwouldhavehadsomebreakfast.Alternatively,theeffect
mightdependontheageofthestudents.Dowsonsstudentswereaged7or8itis
quitepossiblethattheeffectcouldbediminishedorreversedwitholder(oryounger)
students.Thisillustratesthedangerofimplementingpolicyonthebasisofasingle
experiment.Confidenceinthegeneralityofaresultcanonlyfollowwidespread
replication.
Animportantconsequenceofthecapacityofmetaanalysistocombineresults
isthatevensmallstudiescanmakeasignificantcontributiontoknowledge.Thekind
ofexperimentthatcanbedonebyasingleteacherinaschoolmightinvolveatotalof
fewerthan30students.Unlesstheeffectishuge,astudyofthissizeismostunlikely
togetastatisticallysignificantresult.Accordingtoconventionalstatisticalwisdom,
therefore,theexperimentisnotworthdoing.However,iftheresultsofseveralsuch
experimentsarecombinedusingmetaanalysis,theoverallresultislikelytobehighly
statisticallysignificant.Moreover,itwillhavetheimportantstrengthsofbeing
derivedfromarangeofcontexts(thusincreasingconfidenceinitsgenerality)and
fromreallifeworkingpractice(therebymakingitmorelikelythatthepolicyis
feasibleandcanbeimplementedauthentically).

9
Onefinalcaveatshouldbemadehereaboutthedangerofcombining
incommensurableresults.Giventwo(ormore)numbers,onecanalwayscalculatean
average.However,iftheyareeffectsizesfromexperimentsthatdiffersignificantlyin
termsoftheoutcomemeasuresused,thentheresultmaybetotallymeaningless.It
canbeverytempting,onceeffectsizeshavebeencalculated,totreatthemasallthe
sameandlosesightoftheirorigins. Certainly,thereareplentyofexamplesofmeta
analysesinwhichthejuxtapositionofeffectsizesissomewhatquestionable.
Incomparing(orcombining)effectsizes,oneshouldthereforeconsider
carefullywhethertheyrelatetothesameoutcomes.Thisadviceappliesnotonlyto
metaanalysis,buttoanyothercomparisonofeffectsizes.Moreover,becauseofthe
sensitivityofeffectsizeestimatestoreliabilityandrangerestriction(seebelow),one
shouldalsoconsiderwhetherthoseoutcomemeasuresarederivedfromthesame(or
sufficientlysimilar)instrumentsandthesame(orsufficientlysimilar)populations.
Itisalsoimportanttocompareonlylikewithlikeintermsofthetreatments
usedtocreatethedifferencesbeingmeasured.Intheeducationliterature,thesame
nameisoftengiventointerventionsthatareactuallyverydifferent,forexample,if
theyareoperationaliseddifferently,oriftheyaresimplynotwellenoughdefinedfor
ittobeclearwhethertheyarethesameornot.Itcouldalsobethatdifferentstudies
haveusedthesamewelldefinedandoperationalisedtreatments,buttheactual
implementationdiffered,orthatthesametreatmentmayhavehaddifferentlevelsof
intensityindifferentstudies.Inanyofthesecases,itmakesnosensetoaverageout
theireffects.

7.Whatotherfactorscaninfluenceeffectsize?
Althougheffectsizeisasimpleandreadilyinterpretedmeasureof
effectiveness,itcanalsobesensitivetoanumberofspuriousinfluences,sosomecare
needstobetakeninitsuse.Someoftheseissuesareoutlinedhere.

Whichstandarddeviation?
Thefirstproblemistheissueofwhichstandarddeviationtouse.Ideally,the
controlgroupwillprovidethebestestimateofstandarddeviation,sinceitconsistsof
arepresentativegroupofthepopulationwhohavenotbeenaffectedbythe
experimentalintervention.However,unlessthecontrolgroupisverylarge,the
estimateofthetruepopulationstandarddeviationderivedfromonlythecontrol
group islikelytobeappreciablylessaccuratethananestimatederivedfromboththe
controlandexperimentalgroups.Moreover,instudieswherethereisnotatrue
controlgroup(forexamplethetimeofdayeffectsexperiment)thenitmaybean
arbitrarydecisionwhichgroupsstandarddeviationtouse,anditwilloftenmakean
appreciabledifferencetotheestimateofeffectsize.
Forthesereasons,itisoftenbettertouseapooledestimateofstandard
deviation.Thepooledestimateisessentially anaverageofthestandarddeviationsof
theexperimentalandcontrolgroups(Equation 4).Notethatthisisnotthesameasthe
standarddeviationofallthevaluesinbothgroupspooledtogether.If,forexample
eachgrouphadalowstandarddeviationbutthetwomeansweresubstantially
different,thetruepooledestimate(ascalculatedby Equation 4)wouldbemuchlower
thanthevalueobtainedbypoolingallthevaluestogetherandcalculatingthestandard
deviation.Theimplicationsofchoicesaboutwhichstandarddeviationtouseare
discussedbyOlejnikandAlgina(2000).

10
Equation4
(WhereNE andNC arethenumbersintheexperimentalandcontrolgroups,
respectively,andSDE andSDC aretheirstandarddeviations.)

Theuseofapooledestimateofstandarddeviationdependsontheassumption
thatthetwocalculatedstandarddeviationsareestimatesof thesamepopulationvalue.
Inotherwords,thattheexperimentalandcontrolgroupstandarddeviationsdifferonly
asaresultofsamplingvariation.Wherethisassumptioncannotbemade(either
becausethereissomereasontobelievethatthetwostandarddeviationsarelikelyto
besystematicallydifferent,oriftheactualmeasuredvaluesareverydifferent),thena
pooledestimateshouldnotbeused.
IntheexampleofDowsonstimeofdayexperiment,thestandarddeviations
forthemorningandafternoongroupswere4.12and2.10respectively.WithNE =NC
=19,Equation2thereforegivesSDpooled as3.3,whichwasthevalueusedin Equation
1 togiveaneffectsizeof0.8.However,thedifferencebetweenthetwostandard
deviationsseemsquitelargeinthiscase.Giventhattheafternoongroupmeanwas
17.9outof20,itseemslikelythatitsstandarddeviationmayhavebeenreducedbya
ceilingeffect i.e.thespreadofscoreswaslimitedbythemaximumavailablemark
of20.Inthiscasetherefore,itmightbemoreappropriatetousethemorninggroups
standarddeviationasthebestestimate.Doingthiswillreducetheeffectsizeto0.7,
anditthenbecomesasomewhatarbitrarydecisionwhichvalueoftheeffectsizeto
use.Ageneralruleofthumbinstatisticswhentwovalidmethodsgivedifferent
answersis:Ifindoubt,citeboth.

Correctionsforbias
Althoughusingthepooledstandarddeviationtocalculatetheeffectsize
generallygivesabetterestimatethanthecontrolgroupSD,itisstillunfortunately
slightlybiasedandingeneral givesavalueslightlylargerthanthetruepopulation
value(HedgesandOlkin,1985).HedgesandOlkin(1985,p80)giveaformulawhich
providesanapproximatecorrectiontothisbias.
InDowsonsexperimentwith38values,thecorrectionfactorwillbe0.98,so
itmakesverylittledifference,reducingtheeffectsizeestimatefrom0.82to0.80.
Giventhelikelyaccuracyofthefiguresonwhichthisisbased,itisprobablyonly
worthquotingonedecimalplace,sothefigureof0.8stands.Infact,thecorrection
onlybecomessignificantforsmallsamples,inwhichtheaccuracyisanywaymuch
less.Itisthereforehardlyworthworryingaboutitinprimaryreportsofempirical
results.However,inmetaanalysis,whereresultsfromprimarystudiesarecombined,
thecorrectionisimportant,sincewithoutitthisbiaswouldbeaccumulated.

Restrictedrange
Supposethetimeofdayeffectsexperimentweretoberepeated,oncewiththe
topsetinahighlyselectiveschoolandagainwithamixedabilitygroupin a
comprehensive.Ifstudentswereallocatedtomorningandafternoongroupsat
random,therespectivedifferencesbetweenthemmightbethesameineachcaseboth
meansintheselectiveschoolmightbehigher,butthedifferencebetweenthetwo
groupscouldbethesameasthedifferenceinthecomprehensive.However,itis
unlikelythatthestandarddeviationswouldbethesame.Thespreadofscoresfound

11
withinthehighlyselectedgroupwouldbemuchlessthanthatinatruecrosssection
ofthepopulation,asforexampleinthemixedabilitycomprehensiveclass.This,of
course,wouldhaveasubstantialimpactonthecalculationoftheeffectsize.Withthe
highlyrestrictedrangefoundintheselectiveschool,theeffectsizewouldbemuch
largerthanthatfoundinthecomprehensive.
Ideally,incalculatingeffectsizeoneshouldusethestandarddeviationofthe
fullpopulation,inordertomakecomparisonsfair.However,therewillbemany
casesinwhichunrestrictedvaluesarenotavailable,eitherinpracticeorinprinciple.
Forexample,inconsideringtheeffectofaninterventionwithuniversitystudents,or
withpupilswithreadingdifficulties,onemustrememberthatthesearerestricted
populations.Inreportingtheeffectsize,oneshoulddrawattentiontothisfactifthe
amountofrestrictioncanbequantifieditmaybepossibletomakeallowanceforit.
Anycomparisonwitheffectsizescalculatedfromafullrangepopulationmustbe
madewithgreatcaution,ifatall.

NonNormaldistributions
TheinterpretationsofeffectsizesgiveninTableIdependontheassumption
thatbothcontrolandexperimentalgroupshaveaNormaldistribution,i.e.the
familiarbellshapedcurve,shown,forexample,inFigure1.Needlesstosay,ifthis
assumptionisnottruethentheinterpretationmaybealtered,andinparticular,itmay
bedifficulttomakeafaircomparisonbetweenaneffectsizebasedonNormal
distributionsandonebasedonnonNormaldistributions.

StandardNormal
Distribution
(S.D.=1)

Similarlookingdistribution
withfatterextremes
(S.D.=3.3)

4 3 2 1 0 1 2 3 4

Figure2: ComparisonofNormalandnonNormaldistributions

AnillustrationofthisisgiveninFigure2,whichshowsthefrequencycurves
fortwodistributions,oneofthemNormal,theotheracontaminatednormal
distribution(Wilcox,1998),whichissimilarinshape,butwithsomewhatfatter
extremes.Infact,thelatterdoeslookjustalittlemorespreadoutthantheNormal
distribution,butitsstandarddeviationisactuallyoverthreetimesasbig.The
consequenceofthisintermsofeffectsizedifferencesisshowninFigure3.Both
graphsshowdistributionsthatdifferbyaneffectsizeequalto1,buttheappearanceof
theeffectsizedifferencefromthegraphsisratherdissimilar.Ingraph(b),the

12
separationbetweenexperimentalandcontrolgroupsseemsmuchlarger,yetthe
effectsizeisactuallythesameasfortheNormaldistributionsplottedingraph(a).In
termsoftheamountofoverlap,ingraph(b)97%ofthe'experimental'groupare
abovethecontrolgroupmean,comparedwiththevalueof84%fortheNormal
distributionofgraph(a)(asgiveninTableI).Thisisquiteasubstantialdifference
andillustratesthedangerofusingthevaluesinTableIwhenthedistributionisnot
knowntobeNormal.

3 2 1 0 1 2 3 4 3 2 1 0 1 2 3 4 5 6

(a) (b)
Figure3: NormalandnonNormaldistributionswitheffectsize=1

Measurementreliability
Athirdfactorthatcanspuriouslyaffectaneffectsizeisthereliabilityofthe
measurementonwhichitisbased.Accordingtoclassicalmeasurementtheory,any
measureofaparticularoutcomemaybeconsideredtoconsistofthetrueunderlying
value,togetherwithacomponentoferror.Theproblemisthattheamountof
variationinmeasuredscoresforaparticularsample(i.e.itsstandarddeviation)will
dependonboththevariationinunderlyingscoresandtheamountoferrorintheir
measurement.
Togiveanexample,imaginethetimeofdayexperimentwereconducted
twicewithtwo(hypothetically)identicalsamplesofstudents.Inthefirstversionthe
testusedtoassesstheircomprehensionconsistedofjust10itemsandtheirscores
wereconvertedintoapercentage.Inthesecondversionatestwith50itemswas
used,andagainconvertedtoapercentage.Thetwotestswereofequaldifficultyand
theactualeffectofthedifferenceintimeofdaywasthesameineachcase,sothe
respectivemeanpercentagesofthemorningandafternoongroupswerethesamefor
bothversions.However,itisalmostalwaysthecasethatalongertestwillbemore
reliable,andhencethestandarddeviationofthepercentagesonthe50itemtestwill
belowerthanthestandarddeviationforthe10itemtest.Thus,althoughthetrue
effectwasthesame,thecalculatedeffectsizeswillbedifferent.
Ininterpretinganeffectsize,itisthereforeimportanttoknowthereliabilityof
themeasurementfromwhichitwascalculated.Thisisonereasonwhythereliability
ofanyoutcomemeasureusedshouldbereported.Itistheoreticallypossibletomake
acorrectionforunreliability(sometimescalledattenuation),whichgivesanestimate
ofwhattheeffectsizewouldhavebeen,hadthereliabilityofthetestbeenperfect.
However,inpracticetheeffectofthisisratheralarming,sincetheworsethetestwas,
themoreyouincreasetheestimateoftheeffectsize.Moreover,estimatesofreliability
aredependentontheparticularpopulationinwhichthetestwasused,andare
themselvesanywaysubjecttosamplingerror.Forfurtherdiscussionoftheimpactof
reliabilityoneffectsizes,seeBaugh(2002).

13
8.Aretherealternativemeasuresofeffectsize?
Anumberofstatisticsaresometimesproposedasalternativemeasuresof
effectsize,otherthanthestandardisedmeandifference.Someofthesewillbe
consideredhere.

Proportionofvarianceaccountedfor
Ifthecorrelationbetweentwovariablesisr,thesquareofthisvalue(often
denotedwithacapitalletter:R2)representstheproportionofthevarianceineachthat
isaccountedforbytheother.In otherwords,thisistheproportionbywhichthe
varianceoftheoutcomemeasureisreducedwhenitisreplacedbythevarianceofthe
residualsfromaregressionequation.Thisideacanbeextendedtomultipleregression
(whereitrepresentstheproportionofthevarianceaccountedforbyallthe
independentvariablestogether)andhascloseanalogiesinANOVA(whereitis
usuallycalledetasquared, h2).Thecalculationof r(andhenceR2 )forthekindof
experimentalsituationwehavebeenconsideringhasalreadybeenreferredtoabove.
BecauseR2 hasthisreadyconvertibility,it(oralternativemeasuresofvariance
accountedfor)issometimesadvocatedasauniversalmeasureofeffectsize(e.g.
Thompson,1999).Onedisadvantageofsuchanapproachisthateffectsizemeasures
basedonvarianceaccountedforsufferfromanumberoftechnicallimitations,suchas
sensitivitytoviolationofassumptions(heterogeneityofvariance,balanceddesigns)
andtheirstandarderrorscanbelarge(OlejnikandAlgina,2000).Theyarealso
generallymorestatisticallycomplexandhenceperhapslesseasilyunderstood.
Further,theyarenondirectionaltwostudieswithpreciselyoppositeresultswould
reportexactlythesamevarianceaccountedfor.However,thereisamorefundamental
objectiontotheuseofwhatisessentiallyameasureofassociationtoindicatethe
strengthofaneffect.
Expressingdifferentmeasuresintermsofthesamestatisticcanhideimportant
differencesbetweentheminfact,thesedifferenteffectsizesarefundamentally
different,andshouldnotbeconfused.Thecrucialdifferencebetweenaneffectsize
calculatedfromanexperimentandonecalculatedfromacorrelationisinthecausal
natureoftheclaimthatisbeingmadeforit.Moreover,thewordeffecthasan
inherentimplicationofcausality:talkingabouttheeffectofAonBdoessuggesta
causalrelationshipratherthanjustanassociation.Unfortunately,however,theword
effectisoftenusedwhennoexplicitcausalclaimisbeingmade,butitsimplication
issometimesallowedtofloatinandoutofthemeaning,takingadvantageofthe
ambiguitytosuggestasubliminalcausallinkwherenoneisreallyjustified.
Thiskindofconfusionissowidespreadineducationthatitisrecommended
herethatthewordeffect(andthereforeeffectsize)shouldnotbeusedunlessa
deliberateandexplicitcausalclaimisbeingmade.Whennosuchclaimisbeing
made,wemaytalkaboutthevarianceaccountedfor(R2)orthestrengthof
association(r),orsimply andperhapsmostinformatively justcitetheregression
coefficient(Tukey,1969).Ifacausalclaimisbeingmadeitshouldbeexplicitand
justificationprovided.FitzGibbon(2002)hasrecommendedanalternativeapproach
tothisproblem.Shehassuggestedasystemofnomenclaturefordifferentkindsof
effectsizesthatclearlydistinguishesbetweeneffectsizesderivedfrom,forexample,
randomisedcontrolled,quasiexperimentalandcorrelationalstudies.

Othermeasuresofeffectsize
Ithasbeenshownthattheinterpretationofthestandardisedmeandifference
measureofeffectsizeisverysensitivetoviolationsoftheassumptionofnormality.
Forthisreason,anumberofmorerobust(nonparametric)alternativeshavebeen
suggested.AnexampleoftheseisgivenbyCliff(1993).Therearealsoeffectsize

14
measuresformultivariateoutcomes.AdetailedexplanationcanbefoundinOlejnik
andAlgina(2000).Finally,amethodforcalculatingeffectsizeswithinmultilevel
modelshasbeenproposedbyTymmsetal.(1997).Goodsummariesofmanyofthe
differentkindsofeffectsizemeasuresthatcanbeusedandtherelationshipsamong
themcanbefoundinSnyderandLawson(1993),Rosenthal(1994)andKirk(1996).
Finally,acommoneffectsizemeasurewidelyusedinmedicineistheodds
ratio.Thisisappropriatewhereanoutcomeisdichotomous:successorfailure,a
patientsurvivesordoesnot.Explanationsoftheoddsratiocanbefoundinanumber
of medicalstatisticstexts,includingAltman(1991),andinFleiss(1994).

Conclusions
Adviceontheuseofeffectsizescanbesummarisedasfollows:
Effectsizeisastandardised,scalefreemeasureoftherelativesizeoftheeffectof
anintervention. Itisparticularlyusefulforquantifyingeffectsmeasuredon
unfamiliarorarbitraryscalesandforcomparingtherelativesizesofeffectsfrom
differentstudies.
Interpretationofeffectsizegenerallydependsontheassumptionsthatcontrol
andexperimentalgroupvaluesareNormallydistributedandhavethesame
standarddeviations.Effectsizescanbeinterpretedintermsofthepercentilesor
ranksatwhichtwodistributionsoverlap,intermsofthelikelihoodofidentifying
thesourceofavalue,orwithreferencetoknowneffectsoroutcomes.
Useofaneffectsizewithaconfidenceintervalconveysthesameinformationasa
testofstatisticalsignificance,butwiththeemphasisonthesignificanceofthe
effect,ratherthanthesamplesize.
Effectsizes(withconfidenceintervals)shouldbecalculatedandreportedin
primarystudiesaswellasinmetaanalyses.
Interpretationofstandardisedeffectsizescanbeproblematicwhenasamplehas
restrictedrangeordoesnotcomefromaNormaldistribution,orifthe
measurementfromwhichitwasderivedhasunknownreliability.
Theuseofanunstandardisedmeandifference(i.e.therawdifferencebetween
thetwogroups,togetherwithaconfidenceinterval)maybepreferablewhen:
theoutcomeismeasuredonafamiliarscale
thesamplehasarestrictedrange
theparentpopulationissignificantlynonNormal
controlandexperimentalgroupshaveappreciablydifferentstandard
deviations
theoutcomemeasurehasveryloworunknownreliability
Caremustbetakenincomparingoraggregatingeffectsizesbasedondifferent
outcomes,differentoperationalisationsofthesameoutcome,differenttreatments,
orlevelsofthesametreatment,ormeasuresderivedfromdifferentpopulations.
Thewordeffectconveysanimplicationofcausality,andtheexpressioneffect
sizeshouldthereforenotbeusedunlessthisimplicationisintendedandcanbe
justified.

1
Thiscalculationisderivedfromaprobittransformation(Glassetal.,1981,p136),basedonthe
assumptionofanunderlyingnormallydistributedvariablemeasuringacademicattainment,some
thresholdofwhichisequivalenttoastudentachieving5+A* Cs.Percentagesforthechangefrom
astartingvalueof50%forothereffectsizevaluescanbereaddirectlyfromTable I.Alternatively,if
F(z)isthestandardnormalcumulativedistributionfunction, p1 istheproportionachievingagiven
thresholdand p2 theproportiontobeexpectedafterachangewitheffectsize, d,then,
p2 = F{F1(p1)+ d}

15
References
ALTMAN,D.G.(1991) PracticalStatisticsforMedicalResearch.London:Chapman
andHall.
BANGERT, R.L., KULIK, J.A.ANDKULIK, C.C.(1983)Individualisedsystemsof
instructioninsecondaryschools. ReviewofEducationalResearch,53,143
158.
BANGERTDROWNS, R.L. (1988)Theeffectsofschoolbasedsubstanceabuse
education:ametaanalysis. JournalofDrugEducation,18,3,24365.
BAUGH, F.(2002)Correctingeffectsizesforscorereliability:Areminderthat
measurementandsubstantiveissuesarelinkedinextricably.Educationaland
PsychologicalMeasurement,62,2,254263.
CLIFF, N.(1993)DominanceStatisticsordinalanalysestoanswerordinal
questions PsychologicalBulletin,114,3.494509.
COHEN, J.(1969)StatisticalPowerAnalysisfortheBehavioralSciences.NY:
AcademicPress.
COHEN, J.(1994)TheEarth isRound(p<.05). AmericanPsychologist,49,997
1003.
COHEN, P.A., KULIK, J.A.ANDKULIK, C.C.(1982)Educationaloutcomesoftutoring:
ametaanalysisoffindings. AmericanEducationalResearchJournal,19,237
248.
DOWSONV. (2000)Timeofdayeffectsinschoolchildren'simmediateanddelayed
recallofmeaningfulmaterial.TERSEReport
http://www.cem.dur.ac.uk/ebeuk/research/terse/library.htm
FINN, J.D.ANDACHILLES, C.M.(1990)Answersandquestionsaboutclasssize:A
statewideexperiment. AmericanEducationalResearchJournal,27,557577.
FITZGIBBONC.T. (1984)Metaanalysis:anexplication. BritishEducational
ResearchJournal,10,2,135144.
FITZGIBBONC.T. (2002)ATypologyofIndicatorsforanEvaluationFeedback
ApproachinA.J.VisscherandR.Coe(Eds.)SchoolImprovementThrough
PerformanceFeedback.Lisse:SwetsandZeitlinger.
FLEISS, J.L.(1994)MeasuresofEffectSizeforCategoricalDatainH.Cooperand
L.V.Hedges(Eds.), TheHandbookofResearchSynthesis.NewYork:Russell
SageFoundation.
FLETCHERFLINN, C.M.ANDGRAVATT, B.(1995)TheefficacyofComputerAssisted
Instruction(CAI):ametaanalysis. JournalofEducationalComputing
Research,12(3),219242.
FUCHS, L.S.ANDFUCHS, D.(1986)Effectsofsystematicformativeevaluation:a
metaanalysis. ExceptionalChildren,53,199208.
GIACONIA, R.M.ANDHEDGES, L.V.(1982)Identifyingfeaturesofeffectiveopen
education. ReviewofEducationalResearch,52,579602.
GLASS, G.V., MCGAW, B. AND SMITH, M.L.(1981) MetaAnalysisinSocial
Research.London:Sage.
HARLOW, L.L., MULAIK, S.S. ANDSTEIGER, J.H.(Eds)(1997) Whatiftherewereno
significancetests? MahwahNJ:Erlbaum.

16
HEDGES, L.ANDOLKIN, I.(1985) StatisticalMethodsforMetaAnalysis.NewYork:
AcademicPress.
HEMBREE, R. (1988)Correlates,causeseffectsandtreatmentoftestanxiety. Review
ofEducationalResearch,58(1),4777.
HUBERTY, C.J..(2002)Ahistoryofeffectsizeindices.Educationaland
PsychologicalMeasurement,62,2,227240.
HYMAN, R.B, FELDMAN, H.R., HARRIS, R.B., LEVIN, R.F.ANDMALLOY, G.B.(1989)
Theeffectsofrelaxationtrainingonmedicalsymptoms:ameatanalysis.
NursingResearch,38,216220.
KAVALE, K.A.ANDFORNESS, S.R.(1983)Hyperactivityanddiettreatment:ameat
analysisoftheFeingoldhypothesis. JournalofLearningDisabilities,16,324
330.
KESELMAN, H.J., HUBERTY, C.J., LIX, L.M., OLEJNIK, S. CRIBBIE, R.A., DONAHUE, B.,
KOWALCHUK, R.K., LOWMAN, L.L., PETOSKEY, M.D., KESELMAN, J.C.AND
LEVIN, J.R. (1998)Statisticalpracticesofeducationalresearchers:Ananalysis
oftheirANOVA,MANOVA,andANCOVAanalyses.ReviewofEducational
Research,68,3,350386.
KIRK, R.E.(1996)PracticalSignificance:Aconceptwhosetimehascome.
Educationaland PsychologicalMeasurement,56,5,746759.
KULIK, J.A., KULIK, C.C.ANDBANGERT, R.L.(1984)Effectsofpracticeonaptitude
andachievementtestscores. AmericanEducationResearchJournal,21,435
447.
LEPPER, M.R., HENDERLONG, J., ANDGINGRAS, I. (1999)Understandingtheeffectsof
extrinsicrewardsonintrinsicmotivation Usesandabusesofmetaanalysis:
CommentonDeci,Koestner,andRyan. PsychologicalBulletin,125,6,669
676.
LIPSEY, M.W.(1992)Juveniledelinquencytreatment:ametaanalyticinquiryinto
thevariabilityofeffects.InT.D.Cook,H.Cooper,D.S.Cordray,H.Hartmann,
L.V.Hedges,R.J.Light,T.A.LouisandF.Mosteller(Eds) Metaanalysisfor
explanation.NewYork:RussellSageFoundation.
LIPSEY, M.W.ANDWILSON, D.B. (1993)TheEfficacyofPsychological,Educational,
andBehavioralTreatment:Confirmationfrommetaanalysis. American
Psychologist,48,12,11811209.
MCGRAW, K.O. (1991)ProblemswiththeBESD:acommentonRosenthalsHow
AreWeDoinginSoftPsychology.AmericanPsychologist,46,10846.
MCGRAW, K.O.ANDWONG, S.P. (1992)ACommonLanguageEffectSizeStatistic.
PsychologicalBulletin,111,361365.
MOSTELLER, F., LIGHT, R.J.ANDSACHS, J.A.(1996)'Sustainedinquiryineducation:
lessonsfromskillgroupingandclasssize.' HarvardEducationalReview,66,
797842.
OAKES, M.(1986) StatisticalInference:ACommentaryfortheSocialandBehavioral
Sciences.NewYork:Wiley.
OLEJNIK, S.AND ALGINA, J.(2000)MeasuresofEffectSizeforComparativeStudies:
Applications,InterpretationsandLimitations. ContemporaryEducational
Psychology,25,241286.

17
ROSENTHAL, R.(1994)ParametricMeasuresofEffectSizeinH.CooperandL.V.
Hedges(Eds.), TheHandbookofResearchSynthesis.NewYork:RussellSage
Foundation.
ROSENTHAL, R,ANDRUBIN, D.B. (1982)Asimple,generalpurposedisplayof
magnitudeofexperimentaleffect. JournalofEducationalPsychology,74,
166169.
RUBIN, D.B.(1992)Metaanalysis:literaturesynthesisoreffectsizesurface
estimation. JournalofEducationalStatistics,17,4,363374.
SHYMANSKY, J.A., HEDGES, L.V.ANDWOODWORTH, G.(1990)Areassessmentofthe
effectsofinquirybasedsciencecurriculaofthe60sonstudentperformance.
JournalofResearchinScienceTeaching,27,127144.
SLAVIN, R.E.ANDMADDEN, N.A. (1989)Whatworksforstudentsatrisk?Aresearch
synthesis. EducationalLeadership,46(4),413.
SMITH, M.L.ANDGLASS, G.V. (1980)Metaanalysisofresearchonclasssizeandits
relationshiptoattitudesandinstruction. AmericanEducationalResearch
Journal,17,419433.
SNYDER, P. ANDLAWSON, S. (1993)EvaluatingResultsUsingCorrectedand
UncorrectedEffectSizeEstimates.JournalofExperimentalEducation,61,4,
334349.
STRAHAN, R.F. (1991)RemarksontheBinomialEffectSizeDisplay.American
Psychologist,46,10834.
THOMPSON, B. (1999)Commonmethodologymistakesineducationalresearch,
revisited,alongwithaprimeronbotheffectsizesandthebootstrap.Invited
addresspresentedattheannualmeetingoftheAmericanEducationalResearch
Association,Montreal.[Accessedfrom
<http://acs.tamu.edu/~bbt6147/aeraad99.htm>,January2000]
TYMMS, P., MERRELL, C.ANDHENDERSON, B. (1997)TheFirstYearasSchool:A
QuantitativeInvestigationoftheAttainmentandProgressofPupils.
EducationalResearchandEvaluation,3,2,101118.
VINCENT, D.ANDCRUMPLER, M. (1997) BritishSpellingTestSeriesManual3X/Y.
Windsor:NFERNelson.
WANG, M.C.ANDBAKER, E.T.(1986)Mainstreamingprograms:Designfeaturesand
effects. JournalofSpecialEducation,19,503523.
WILCOX, R.R. (1998)Howmanydiscoverieshavebeenlostbyignoringmodern
statisticalmethods?.AmericanPsychologist,53,3,300314.
WILKINSON, L.AND TASK FORCEON STATISTICAL INFERENCE, APA BOARDOF
SCIENTIFICAFFAIRS (1999)StatisticalMethodsinPsychologyJournals:
GuidelinesandExplanations.AmericanPsychologist,54,8,594604.

18

You might also like