Professional Documents
Culture Documents
Dataexploration
Dataexploration
Investigationintotheaverageageofatsomeonesfirstmarriage,separatedbycountry,
andhowscalingthedatadifferentlycanaffectmeasuresofcenterandspread.Visual
representationsincludehistograms,boxplots,andstemandleafplots.
EmilyHeubaum
DataExplorationMiniProject1
KikerStatsPd3
AverageAgeattheTimeofSomeonesFirstMarriage
ThevaluesIchosetoworkwithrepresenttheaverageageofapersonwhentheyenter
theirfirstmarriage,separatedbycountry.IchosethisdatabecauseIknewtherewouldbedata
forit,asitcancharacterizeacountryspeople.Iwantedtolookatglobaldata,butIdidntwant
toamorbidvariablelikechildmortalityrate.
Thepopulationinquestionisallmarriedpeople.Unmarriedpeoplearenotconsidered
partofthemarriedpopulation.Theunitsusedarewholeyears.Thisspecificunitwasusedto
standardizethedecimalplacethatwasused,inthiscasenone.Mydatawascollectedfromthe
Wikipediaentryonaverageageatfirstmarriagebycountry(Age).
Thesamplesizewas30countries.AllcalculationsweredoneusingthecorrespondingR
functions,andwillbepresentedattheend.Themeanagewas29.13yearsold.Theminimumage
was23,whichbelongedtoChina.Themedianagewas29.5,whilethemaximumagewas34,
whichbelongedbothtoDenmarkandIceland.Thefirstquartilewas27yearsold,andthethird
quartilewas32yearsold.Theinterquartilerangeforthisdatadistributionwas5years.The
overallrangeofageswas11years.Thestandarddeviationforthedistributionoftheseageswas
3.15years,sothevarianceis9.91years.
Therearenooutliersoneitherendofthedistributionwhenusingthe1.5IQRmethod.The
bottomcutoffforoutlierswas19.5,buttheminimumvaluewas23.Similarly,theuppercutoff
was39.5,whilethemaximumvaluewas34.
Ahistogram,aboxplot,andastemplotareshownbelow.
Thenextcalculationswillworkwiththesamedataset,butwith100addedtoeachvalue.
Again,allcalculationsdoneinR,andcodewillbeshownattheend.Thesamplesizeisstill30
countries.Thenewminimumis123yearsold,whilethenewmaximumis134yearsold.The
newfirstquartileis127yearsold,whilethenewthirdquartileis132yearsold.Therangeisstill
11years,andthevarianceisstill9.91years.
Thenewmeanis129.133,whichisthesameasthepreviousmeanwith100addedtoit.
Thenewmedianis129.5,whichagain,isthesameastheoriginalmedianwith100addedtoit.
Thestandarddeviationforthisset,however,isexactlythesameastheoriginalstandard
deviation.
TheIQRisstill5.Thelowercutoffforoutliersfromthelowerquartileis119.5,whichis
belowourminimumvalue.Theuppercutoffforoutliersfromtheupperquartileis139.5,which
isabovethemaximumvalue,sotherearenooutliers.Theoveralleffectisthatthedatasetwas
shiftedover,sotherelationshipsbetweenthecasesstayedsimilarifnotthesame.Graphsforthis
newdatasetarebelow.
Thesenextcalculationswillbedonewiththeoriginaldataset,witheachvalueincreased
by50%.AllcalculationswerecompletedusingRfunctions.Thesamplesizeisstill30countries.
Thenewmeanis47.3,similartotheoriginal,butincreasedby50%.Similarly,themedianis
44.25,whichistheoriginalmeanincreasedby50%.Thestandarddeviationisnotthesameas
theoriginalasitwasinthepreviouscalculations.Thestandarddeviationforthisdatasetis4.72,
whichistheoriginalstandarddeviationincreasedby50%.
Thenewminimumis34.5yearsold,andthenewmaximumis51yearsold.Thefirst
quartileis40.5yearsold,whilethethirdquartileis48yearsold,makingtheinterquartilerange
7.5,whichisa50%increasefromtheoriginalIQR.Thenewrangeis16.5,whichisa50%
increasefromtheoriginalrange.Thenewvarianceis22.3,whichisa125%increasefromthe
original,whichislogicalbecausethevarianceisthestandarddeviationsquared,sothefactorby
whichthestandarddeviationincreasesshouldbesquaredforthevariance.
TheIQRis7.5,sothelowercutoffforoutliersfromthelowerquartileis29.25,whichis
farbelowtheminimumforthisset.Theuppercutoffforoutliersis59.25,wellaboveour
maximumvalueforthisset,sotherearenooutliers.Thehistogram,boxplot,andstemplotare
shownbelow.
Fortheselastcalculations,wewillassumetheoriginaldatasetwasnormallydistributed,
withallcalculationsdoneinR.Thepercentageofdatathatis5unitsabovethemeanisactually
0%.Thisisbecause5unitsabovethemeanis34.13,whichisabovethemaximumvalue.The
percentageofdatathatis3unitsbelowthemeanand2unitsabove,sobetween26.13and31.13
is56.8%ofthedata.Thetop10%ofthedatacorrespondstovaluesequaltoorgreaterthan33.2
yearsold.Icalculatedthisbyfindingthevaluethatwasthe90%percentile(sothelowestofthe
top10%)usingaZTable.IusedtheZScoreequationtofindthatvaluesdeviation,andadded
thedeviationtothemeantogetthelowesttop10%value.
Inconclusion,addingaconstanttodatadoesntchangetherelationshipsbetweenthe
values,itjustshiftsthemup,addsthatconstanttosomeofthemeasuresofcenter,likemedian
andmean.Measuresofdistribution,likerangeandstandarddeviation,dontchange.For
multiplying,likeincreasingby50%,changesthemeasuresofcenterbythesamemultiplying
factor.Thedifferenceis,measuresofvariationchangetoo,alsobythemultiplyingfactor.
Addingaconstantshiftsthewholedatasetup,whilemultiplyingbyaconstantchangesthe
relationshipbetweenthevalues,butbyapredictablefactor.
Source:
"AgeatFirstMarriage."
Wikipedia
.WikimediaFoundation,n.d.Web.22Sept.2015.
<
https://en.wikipedia.org/wiki/Age_at_first_marriage
>.
OriginalData:
+100Data:
>ages<c(Marriage$Age.First.Marriage)
>#ADD100
>#SUMMARY
>plus<ages+100
>fivenum(ages)
>#SUMMARY
[1]23.027.029.532.034.0
>fivenum(plus)
>#MEAN
[1]123.0127.0129.5132.0134.0
>mean(ages)
>#MEAN
[1]29.13333
>mean(plus)
>#MEDIAN
[1]129.1333
>median(ages)
>#MEDIAN
[1]29.5
>median(plus)
>#RANGE
[1]129.5
>range(ages)
>#RANGE
[1]2334
>range(plus)
>3423
[1]123134
[1]11
>#STANDARDDEVIATION
>#STANDARDDEVIATION
>sd(plus)
>sd(ages)
[1]3.148435
[1]3.148435
>#VARIANCE
>#VARIANCE
>(sd(plus))^2
>(sd(ages))^2
[1]9.912644
[1]9.912644
>#IQR
>#IQR
>132127
>32.027.0
[1]5
[1]5
>#OUTLIERS
>#OUTLIERS
>#lower#Minimum1.5IQR
>#lower#Minimum1.5IQR
>127(1.5*5)
>27(1.5*5)
[1]119.5
[1]19.5
>#upper#Maximum+1.5IQR
>#upper#Maximum+1.5IQR
>132.0+(1.5*5)
>32+(1.5*5)
[1]139.5
[1]39.5
>#HISTOGRAM
>#HISTOGRAM
>hist(plus,main="AverageAgeatFirstMarriagebyCountry+
>hist(ages,main="AverageAgeatFirstMarriagebyCountry",
100",xlab="AverageAge+100")
xlab="AverageAge")
>#BOXPLOT
>#BOXPLOT
>boxplot(plus,main="AverageAgeatFirstMarriagebyCountry
>boxplot(ages,main="AverageAgeatFirstMarriagebyCountry",
+100",xlab="AverageAge+100",horizontal=TRUE)
xlab="AverageAge",horizontal=TRUE)
>#STEMPLOT
>#STEMPLOT
>stem(plus,scale=2)
>stem(ages,scale=2)
50%IncreaseData:
NormalDistribution:
>#INCREASE
>#FIVEUNITSABOVE
>times<ages*1.5
>1pnorm(mean(ages)+5)
>#SUMMARY
[1]0
>fivenum(times)
>#BETWEEN3BELOWAND2ABOVE
[1]34.5040.5044.2548.0051.00
>(3/sd(ages))
>#MEAN
[1]0.9528543
>mean(times)
>#LowerZSCOREis.95,correspondingproportionis(1.829)
[1]43.7
belowthatvalue
>#MEDIAN
>(2/sd(ages))
>median(times)
[1]0.6352362
[1]44.25
>#UpperzSCOREis.64,correspondingproportionis.739below
>#RANGE
thatvalue
>range(times)
>#Proportionbetween3and2frommean
[1]34.551.0
>.739(1.829)
>5134.5
[1]0.568
[1]16.5
>#UNITSFORTOP10%
>#STANDARDDEVIATION
>#1.28=zSCORE
>sd(times)
>#(ZScore*SD)=Deviation
[1]4.722653
>#Addmean=unitfortop10%
>#VARIANCE
>(1.28*sd(ages))+mean(ages)
>(sd(times))^2
[1]33.16333
[1]22.30345
>which(Marriage$Age.First.Marriage>33.2)
>#IQR
[1]1725
>4840.5
[1]7.5
>#OUTLIERS
>#lower#Minimum1.5IQR
>40.5(1.5*7.5)
[1]29.25
>#upper#Maximum+1.5IQR
>48+(1.5*7.5)
[1]59.25
>#HISTOGRAM
>hist(times,main="Increasedby50%AverageAgeatFirst
Marriage",xlab="AverageAgeincreasedby50%")
>#BOXPLOT
>boxplot(times,main="Increasedby50%AverageAgeatFirst
Marriage",xlab="AverageAgeincreasedby50%",horizontal=
TRUE)
>#STEMPLOT
>stem(times,scale=1)