You are on page 1of 7

Descriptive Statistics with R

1. Initial steps
1. Read in the data
2. What variables are in the file?
2. Measures of central tendency
1. What to use when?
2. Sum and mean
3. Median
4. Mode
5. Trimmed mean to remove influence of outliers
3. Measures of variability
1. Range
2. Quartiles and IQR
3. Variance and sd
4. Mean absolute deviation, median absolute deviation
4. Measures of shape
1. Skewness
2. Kurtosis
5. Summary of a variable
6. Describing a data frame
1. Descriptive statistics separately for each group
2. Summarizing an entire dataframe
7. Standard scores (z)

Initial Steps
Read in the data
use the load() function
read.table or read.csv
> setwd("~/Documents/statistics/probability_and_statistics_with_R/navarro_datasets")
> load("aflsmall.Rdata")

What variables are in the file?


Two ways:
use head()
load lsr package and use who() function
> library(lsr)
> who()
-- Name --- Class -- -- Size -afl.finalists factor
400
afl.margins
numeric
176
x
integer
1

Measures of central tendency


What to use when
Measure

Data type

Mean

Ratio, Interval

Median

Ordinal (usually), also Ratio, Interval

Mode

Nominal (usually), also Ordinal, Ratio, Interval

Sum and mean


> sum(afl.margins)
[1] 6213
> sum(afl.margins[1:5])
[1] 183
> sum(afl.margins[1:5]) / 5
[1] 36.6
> mean(x = afl.margins)
[1] 35.30114

# sum of a subset of data


# mean of a subset of data
# x is the argument passed to mean()

Median
Usage:
ordinal data
ratio data
interval data
For median, first sort:
> sort(x = afl.margins)
[1] 0 0 1 1 1 1 2 2 3 3 3 3 3 3 3 3 4 4 5
[20] 6 7 7 8 8 8 8 8 9 9 9 9 9 9 10 10 10 10 10
...
> median(x = afl.margins)
[1] 30.5

Mode
Who has played the most finals?
> print(afl.finalists)
[1] Hawthorn
Melbourne
[5] Hawthorn
Carlton
...

Get a frequency table:

> table(afl.finalists)

Carlton
Melbourne

Melbourne
Carlton

afl.finalists
Adelaide
26
Essendon
32
Hawthorn
27
Richmond
6
Western Bulldogs
24

Brisbane
Carlton
Collingwood
25
26
28
Fitzroy
Fremantle
Geelong
0
6
39
Melbourne North Melbourne Port Adelaide
28
28
17
St Kilda
Sydney
West Coast
24
26
38

Find the mode.


> modeOf(x = afl.finalists)
[1] "Geelong"
> maxFreq(x=afl.finalists)
[1] 39

Trimmed mean to remove influence of outliers


> dataset <- c(-15,2,3,4,5,6,7,8,9,12)
> mean(x=dataset)
[1] 4.1
> median(x=dataset)
[1] 5.5
> mean(x=dataset, trim=0.1)
# trim by 10% - one value on either side
[1] 5.5
# trimmed mean is same as median

For afl.margins dataset:


> mean(x=afl.margins, trim=0.05)
[1] 33.75

Measures of variability
> range(afl.margins)
[1] 0 116
> quantile(x = afl.margins, probs = c(0.25, 0.75)) # gives 25th and 75th percentile
25% 75%
12.75 50.50
> IQR(x = afl.margins)
# tells where the middle half of data sits
[1] 37.75
> var(afl.margins)
[1] 679.8345
> sd(afl.margins)
[1] 26.07364
> mean(abs(afl.margins mean(afl.margins)))
# mean absolute deviation
[1] 21.10124
> mad(afl.margins)
# median absolute deviation
[1] 28.9107

Measures of shape
Skewness (measure of asymmetry) and kurtosis:
> library(psych)
> skew(x=afl.margins)
[1] 0.7671555
> kurtosi(x=afl.margins)
[1] 0.02962633

# the data are quite skewed


# note the spelling!

Summary of a variable
> summary(object = afl.margins)
# argument is numeric
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 12.75 30.50 35.30 50.50 116.00
> summary(object = afl.finalists)
# argument is a factor
Adelaide
Brisbane
Carlton
Collingwood
26
25
26
28
Essendon
Fitzroy
Fremantle
Geelong
32
0
6
39
Hawthorn
Melbourne North Melbourne Port Adelaide
27
28
28
17
Richmond
St Kilda
Sydney
West Coast
6
24
26
38
Western Bulldogs
24
> f2 <- as.character(afl.finalists)
> summary(object = f2)
Length
Class
Mode
400 character character

# factor to character vector

> describe(x = afl.margins)


var n mean sd median trimmed mad min max range skew kurtosis se
1 1 176 35.3 26.07 30.5 32.82 28.91 0 116 116 0.77
0.03 1.97

For a logical vector:


e.g. how many blowouts were there?
Blowout = a game in which the winning margin exceeds 50 points.
> blowouts <- afl.margins > 50
> blowouts
[1] TRUE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE
[14] FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE

> summary(object = blowouts)


Mode FALSE TRUE NA's
logical
132
44
0

Describing a dataframe
> load("clinicaltrial.Rdata")
> who(TRUE)
-- Name --- Class -- -- Size -clin.trial
data.frame 18 x 3
$drug
factor
18
$therapy
factor
18
$mood.gain
numeric
18

Descriptive statistics separately for each group


Three functions:
by()
describeBy()
aggregate()
The describeBy() has argument group, which specifies the grouping variable.
The following gives statistics broken down by therapy type.

> describeBy(x=clin.trial, group=clin.trial$therapy)


group: no.therapy
var n mean sd median trimmed mad min max range skew kurtosis se
drug*
1 9 2.00 0.87 2.0 2.00 1.48 1.0 3.0 2.0 0.00 -1.81 0.29
therapy* 2 9 1.00 0.00 1.0 1.00 0.00 1.0 1.0 0.0 NaN
NaN 0.00
mood.gain 3 9 0.72 0.59 0.5 0.72 0.44 0.1 1.7 1.6 0.51 -1.59 0.20
--------------------------------------------------------------group: CBT
var n mean sd median trimmed mad min max range skew kurtosis se
drug*
1 9 2.00 0.87 2.0 2.00 1.48 1.0 3.0 2.0 0.00 -1.81 0.29
therapy* 2 9 2.00 0.00 2.0 2.00 0.00 2.0 2.0 0.0 NaN
NaN 0.00
mood.gain 3 9 1.04 0.45 1.1 1.04 0.44 0.3 1.8 1.5 -0.03 -1.12 0.15

The by() function has argument FUN, which specifies the name of the function you want to apply
separately to each group.

> by(data = clin.trial, INDICES = clin.trial$therapy, FUN = describe) # same as describeBy()


clin.trial$therapy: no.therapy
var n mean sd median trimmed mad min max range skew kurtosis se
drug*
1 9 2.00 0.87 2.0 2.00 1.48 1.0 3.0 2.0 0.00 -1.81 0.29
therapy* 2 9 1.00 0.00 1.0 1.00 0.00 1.0 1.0 0.0 NaN
NaN 0.00
mood.gain 3 9 0.72 0.59 0.5 0.72 0.44 0.1 1.7 1.6 0.51 -1.59 0.20
--------------------------------------------------------------clin.trial$therapy: CBT
var n mean sd median trimmed mad min max range skew kurtosis se
drug*
1 9 2.00 0.87 2.0 2.00 1.48 1.0 3.0 2.0 0.00 -1.81 0.29
therapy* 2 9 2.00 0.00 2.0 2.00 0.00 2.0 2.0 0.0 NaN
NaN 0.00
mood.gain 3 9 1.04 0.45 1.1 1.04 0.44 0.3 1.8 1.5 -0.03 -1.12 0.15
> by(data = clin.trial, INDICES = clin.trial$therapy, FUN = summary)
clin.trial$therapy: no.therapy
drug
therapy mood.gain
placebo :3 no.therapy:9 Min. :0.1000
anxifree:3 CBT
:0 1st Qu.:0.3000
joyzepam:3
Median :0.5000

Mean :0.7222
3rd Qu.:1.3000
Max. :1.7000
--------------------------------------------------------------clin.trial$therapy: CBT
drug
therapy mood.gain
placebo :3 no.therapy:0 Min. :0.300
anxifree:3 CBT
:9 1st Qu.:0.800
joyzepam:3
Median :1.100
Mean :1.044
3rd Qu.:1.300
Max. :1.800

Use aggregate() to group multiple variables.


e.g. Look at average mood gain for all possible combinations of drug and therapy.
> aggregate(formula=mood.gain ~ drug + therapy, data = clin.trial, FUN = mean)
drug therapy mood.gain
1 placebo no.therapy 0.300000
2 anxifree no.therapy 0.400000
3 joyzepam no.therapy 1.466667
4 placebo
CBT 0.600000
5 anxifree
CBT 1.033333
6 joyzepam
CBT 1.500000
> aggregate(formula=mood.gain ~ drug + therapy, data = clin.trial, FUN = sd)
drug therapy mood.gain
1 placebo no.therapy 0.2000000
2 anxifree no.therapy 0.2000000
3 joyzepam no.therapy 0.2081666
4 placebo
CBT 0.3000000
5 anxifree
CBT 0.2081666
6 joyzepam
CBT 0.2645751

Summarizing an entire dataframe


> summary(clin.trial)
drug
therapy mood.gain
placebo :6 no.therapy:9 Min. :0.1000
anxifree:6 CBT
:9 1st Qu.:0.4250
joyzepam:6
Median :0.8500
Mean :0.8833
3rd Qu.:1.3000
Max. :1.8000
> describe(x=clin.trial)
# load psych package first
var n mean sd median trimmed mad min max range skew kurtosis se
drug*
1 18 2.00 0.84 2.00 2.00 1.48 1.0 3.0 2.0 0.00 -1.66 0.20
therapy* 2 18 1.50 0.51 1.50 1.50 0.74 1.0 2.0 1.0 0.00 -2.11 0.12
mood.gain 3 18 0.88 0.53 0.85 0.88 0.67 0.1 1.8 1.7 0.13 -1.44 0.13

Standard scores (z)


> x <- c(3,10,8,4,9,11,6)
> mean(x)
[1] 7.285714
> sd(x)
[1] 3.039424
> z <- ((10 - mean(x)) / sd(x))
>z
[1] 0.8930265

# calculating z score for 10

To calculate the percentile rank of the z-score, use pnorm():


> pnorm(0.8930625)
[1] 0.8140881

Interpretation:
z = 0.8930. The individual score is 0.89 sd above the mean.
pnorm value: If 10 had been a score for laziness, then that individual is lazier than 81.4% of the
people sampled.

Handling missing values


> partial.data <- c(10, 20, NA, 30)
> mean(x = partial.data)
[1] NA
> mean(x = partial.data, na.rm = TRUE)
[1] 20