You are on page 1of 32

R: An Introduction

Jyotirmoy Bhattacharya

Wednesday 02 November 2016

Table of Contents

Setting things up
library(tidyverse)

## Loading tidyverse: ggplot2


## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr

## Conflicts with tidy packages ----------------------------------------------

## filter(): dplyr, stats


## lag(): dplyr, stats

library(stringr)
library(haven)
library(readxl)
library(AER)

## Loading required package: car

##
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':


##
## recode

## The following object is masked from 'package:purrr':


##
## some

## Loading required package: lmtest

## Loading required package: zoo

##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric

## Loading required package: sandwich

## Loading required package: survival

Estimation and Inference


Data import
Before we can carry out any statistical analysis we must be able to get
our data into R. The readr, readxl and haven libraries provide support for
importing data into R from CSV, Excel and SAS/SPSS/Stata files
respectively.

For our first example, we use a data set on wages distributed with
Wooldridge's Introductory Econometrics (3e). Download the archive of
the data sets in Stata format and extract it to a folder called wooldridge-
stata in your current project directory. The data set lives in the file
WAGE2.DTA. We the read_dta function from the haven package to read
the file. The function read.dta returns a data frame containing the data
read in. We assign this value to a variable we call wage2

wage2 <- read_dta('wooldridge-stata/WAGE2.DTA')

Data summaries
The function summary shows useful summary statistics about variables

summary(wage2$wage)

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 115.0 669.0 905.0 957.9 1160.0 3078.0

summary(wage2$educ)

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 9.00 12.00 12.00 13.47 16.00 18.00

One can also use functions that calculate specific summary statistics:
mean, median, sd (standard deviation), quantile etc.
Fitting a linear regression model
Now suppose we want to run the regress the log of wage on education
and experience. The function that does this is lm, which stands for
"linear model". lm does not itself produce any output. Rather, it returns
an object containing information about the fitted model. For now, we
assign this returned object to the variable wagemod1.

wagemod1 <- lm(log(wage)~educ+exper,data=wage2)

In the call to lm the expression log(wage) ~ educ + exper is a formula,


with the dependent variable on the left-hand side and the independent
variables on the right-hand side. As this example shows the dependent
variable for lm can be a transformation of the variables in the original
data frame. See the help for formula for the detailed formula syntax.

The view the results of the estimation we can apply the print or
summary function to the object returned by lm. The former produces
more concise output.

print(wagemod1)

##
## Call:
## lm(formula = log(wage) ~ educ + exper, data = wage2)
##
## Coefficients:
## (Intercept) educ exper
## 5.50271 0.07778 0.01978

summary(wagemod1)

##
## Call:
## lm(formula = log(wage) ~ educ + exper, data = wage2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.86915 -0.24001 0.03564 0.26132 1.30062
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.502710 0.112037 49.115 < 2e-16 ***
## educ 0.077782 0.006577 11.827 < 2e-16 ***
## exper 0.019777 0.003303 5.988 3.02e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.393 on 932 degrees of freedom
## Multiple R-squared: 0.1309, Adjusted R-squared: 0.129
## F-statistic: 70.16 on 2 and 932 DF, p-value: < 2.2e-16

The output should be familiar to anyone who has used a regression


package before.

There are many other useful functions that can be applied to the object
returned by lm. coef can be usef to access individual coefficients, fitted
gives the fitted values, residuals the residuals and confint confidence
intervals for coefficients.

So for example to produce a fitted versus residuals plots we can do the


following

plot(fitted(wagemod1),residuals(wagemod1))

Here plot is a function which by default produces a scatterplot between


two vectors (it can actually do a lot more; do read its documentation).
Hypothesis Tests
Nested models
Continuing with the example above, suppose we now want to add
quadratic terms for education and experience in our model. So we
estimate

wagemod2 <-
lm(log(wage)~educ+I(educ^2)+exper+I(exper^2),data=wage2)
summary(wagemod2)

##
## Call:
## lm(formula = log(wage) ~ educ + I(educ^2) + exper + I(exper^2),
## data = wage2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.86611 -0.24298 0.02562 0.27020 1.28525
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.4707517 0.5534539 8.078 2.03e-15 ***
## educ 0.2300741 0.0786325 2.926 0.00352 **
## I(educ^2) -0.0053666 0.0027648 -1.941 0.05256 .
## exper 0.0144510 0.0135518 1.066 0.28654
## I(exper^2) 0.0002733 0.0005696 0.480 0.63146
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3927 on 930 degrees of freedom
## Multiple R-squared: 0.1344, Adjusted R-squared: 0.1307
## F-statistic: 36.11 on 4 and 930 DF, p-value: < 2.2e-16

First, a word about the formula above. The ^ operator, which otherwise
stands for exponentiation, has a special meaning in the context of
formulas. Enclosing terms like educ^2 in the identity function I
'protects' them and causes them to be interpreted in the usual
arithmetic sense.

If you compare wagemod1 and wagemod2 you will notice that the latter
2
has a smaller R . However, this is something that we would expect
as a simple consequence of the algebra of least squares. We need to
run a F-test to see whether the two added terms are actually
statistically significant. In situations like this where one model is a
subset of another, such F-tests can be performed using the anova
function

anova(wagemod1,wagemod2)

## Analysis of Variance Table


##
## Model 1: log(wage) ~ educ + exper
## Model 2: log(wage) ~ educ + I(educ^2) + exper + I(exper^2)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 932 143.98
## 2 930 143.39 2 0.59201 1.9199 0.1472

The last column gives the p-value for the null hypothesis that the
coefficients for all the additional regressors in wagemod2 is 0. We see
that in the above example we would fail to reject this null hypothesis at
the 10% level of significance.

Other hypothesis tests


Sometimes we want to test a general system of linear hypotesis. The
function linearHypothesis in the library car allows us to do this:

library(car)
linearHypothesis(wagemod2,c("I(educ^2)=0","I(exper^2)=0"))

## Linear hypothesis test


##
## Hypothesis:
## I(educ^2) = 0
## I(exper^2) = 0
##
## Model 1: restricted model
## Model 2: log(wage) ~ educ + I(educ^2) + exper + I(exper^2)
##
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 932 143.98
## 2 930 143.39 2 0.59201 1.9199 0.1472

In the above example we provided the set of hypotheses to be tested


as a list of expressions. It is also possible to provide the hypothesis as
a matrix of coefficients and a right hand side representing the
hypotheses as a set of linear equations in matrix-vector form.

The library lmtest provides a number of functions for tests of


hetroscedasticity and autocorrelation. The library sandwich provides
functions for computing hetroscedasticity- and hetroscedasticity-and-
autocorrelation- consistent variance-covariance matrices. These can be
used for hypothesis testing by using the vcov argument to the function
linearHypothesis discussed above or the coeftest function from the
library lmtest.

Categorical variables
In the data set wage2 which we we have been using, the variable black
is a dummy variable which is 1 if the worker is black and 0 otherwise. R
does not require its users to manually create such dummy variables for
categorical data. Rather, that is stored in a user-friendly categorical
format and then appropriate dummy variables are created behind the
scenes by R when estimating models.

To see how this works let us create a variable race which takes on the
values "black" and "nonblack"

wage2$race <-
factor(wage2$black,levels=c(0,1),labels=c("nonblack","black"))

R calls categorical variables factors. The function factor takes a variable


with a discrete set of values and returns a factor. The optional
argument levels allows us to provide an explicit list of values and labels
allows us to provide user-friendly names for the different values.

Now let us use our newly minted categorical variable in a regression

lm(log(wage)~educ+race,data=wage2)

##
## Call:
## lm(formula = log(wage) ~ educ + race, data = wage2)
##
## Coefficients:
## (Intercept) educ raceblack
## 6.08674 0.05358 -0.22894

When given a factor as a dependent variable in a regression, R creates


dummy variables in the form of contrasts. The default is to take the
first level of the factor as the base level and to create a dummy for
each of the other levels. Which is why in the above regression we get
the coefficient raceblack which measures the differences in the
expected log(wage) between blacks and non-blacks, other things being
the same.
R formulas have convenient syntax for factors. For example to
introduce a slope rather than an intercept dummy for race we would
use

lm(log(wage)~educ+educ:race,data=wage2)

##
## Call:
## lm(formula = log(wage) ~ educ + educ:race, data = wage2)
##
## Coefficients:
## (Intercept) educ educ:raceblack
## 6.06793 0.05500 -0.01859

For both slope and intercept dummies we have

lm(log(wage)~educ*race,data=wage2)

##
## Call:
## lm(formula = log(wage) ~ educ * race, data = wage2)
##
## Coefficients:
## (Intercept) educ raceblack educ:raceblack
## 6.04793 0.05643 0.20451 -0.03457

The * and : operators can also be used between two factors to include
interaction effects.

Finally, just to check our results, let us compare them to a model using
the traditional numerical dummy variable provided by Wooldridge:

lm(log(wage)~educ+black+I(educ*black),data=wage2)

##
## Call:
## lm(formula = log(wage) ~ educ + black + I(educ * black), data = wage2)
##
## Coefficients:
## (Intercept) educ black I(educ * black)
## 6.04793 0.05643 0.20451 -0.03457

Note the use of I. In this last model we want the * to be interpreted as


arithmetic multiplication and not as an interaction term.
Some other models
Instrumental variables
ivmod <- ivreg(log(wage)~educ|sibs,data=wage2)
summary(ivmod)

##
## Call:
## ivreg(formula = log(wage) ~ educ | sibs, data = wage2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.85429 -0.26950 0.04223 0.29276 1.31039
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.13003 0.35517 14.444 < 2e-16 ***
## educ 0.12243 0.02635 4.646 3.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4233 on 933 degrees of freedom
## Multiple R-Squared: -0.009174, Adjusted R-squared: -0.01026
## Wald test: 21.59 on 1 and 933 DF, p-value: 3.865e-06

Panel data
library(plm)

## Loading required package: Formula

##
## Attaching package: 'plm'

## The following objects are masked from 'package:dplyr':


##
## between, lag, lead

wagepan <- read_dta("wooldridge-stata/WAGEPAN.DTA")


wagepan <- pdata.frame(wagepan,index=c("nr","year"),
drop.index=FALSE,row.names=FALSE)
wagepanfe <-
plm(lwage~union+married+year*educ,model="within",data=wagepan)
summary(wagepanfe)

## Oneway (individual) effect Within Model


##
## Call:
## plm(formula = lwage ~ union + married + year * educ, data = wagepan,
## model = "within")
##
## Balanced Panel: n=545, T=8, N=4360
##
## Residuals :
## Min. 1st Qu. Median 3rd Qu. Max.
## -4.1500 -0.1260 0.0109 0.1610 1.4800
##
## Coefficients :
## Estimate Std. Error t-value Pr(>|t|)
## union 0.0829785 0.0194461 4.2671 2.029e-05 ***
## married 0.0548205 0.0184126 2.9773 0.002926 **
## year1981 -0.0224158 0.1458885 -0.1537 0.877893
## year1982 -0.0057611 0.1458558 -0.0395 0.968495
## year1983 0.0104297 0.1458579 0.0715 0.942999
## year1984 0.0843743 0.1458518 0.5785 0.562965
## year1985 0.0497253 0.1458602 0.3409 0.733190
## year1986 0.0656064 0.1458917 0.4497 0.652958
## year1987 0.0904448 0.1458505 0.6201 0.535216
## year1981:educ 0.0115854 0.0122625 0.9448 0.344827
## year1982:educ 0.0147905 0.0122635 1.2061 0.227872
## year1983:educ 0.0171182 0.0122633 1.3959 0.162830
## year1984:educ 0.0165839 0.0122657 1.3521 0.176437
## year1985:educ 0.0237085 0.0122738 1.9316 0.053479 .
## year1986:educ 0.0274123 0.0122740 2.2334 0.025583 *
## year1987:educ 0.0304332 0.0122723 2.4798 0.013188 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 572.05
## Residual Sum of Squares: 474.35
## R-Squared: 0.1708
## Adj. R-Squared: 0.048567
## F-statistic: 48.9069 on 16 and 3799 DF, p-value: < 2.22e-16

wagepanre <-
plm(lwage~union+married+year*educ,model="random",data=wagepan)
summary(wagepanre)

## Oneway (individual) effect Random Effect Model


## (Swamy-Arora's transformation)
##
## Call:
## plm(formula = lwage ~ union + married + year * educ, data = wagepan,
## model = "random")
##
## Balanced Panel: n=545, T=8, N=4360
##
## Effects:
## var std.dev share
## idiosyncratic 0.1249 0.3534 0.536
## individual 0.1083 0.3291 0.464
## theta: 0.645
##
## Residuals :
## Min. 1st Qu. Median 3rd Qu. Max.
## -4.560 -0.144 0.024 0.192 1.530
##
## Coefficients :
## Estimate Std. Error t-value Pr(>|t|)
## (Intercept) 0.6520146 0.1413998 4.6111 4.120e-06 ***
## union 0.1079858 0.0179371 6.0203 1.885e-09 ***
## married 0.0776855 0.0167665 4.6334 3.703e-06 ***
## year1981 -0.0213571 0.1461523 -0.1461 0.88383
## year1982 -0.0036917 0.1461247 -0.0253 0.97985
## year1983 0.0099771 0.1461263 0.0683 0.94557
## year1984 0.0816298 0.1461211 0.5586 0.57643
## year1985 0.0527819 0.1461284 0.3612 0.71797
## year1986 0.0692527 0.1461552 0.4738 0.63565
## year1987 0.0891520 0.1461201 0.6101 0.54181
## educ 0.0594818 0.0118699 5.0112 5.625e-07 ***
## year1981:educ 0.0112997 0.0122849 0.9198 0.35773
## year1982:educ 0.0142678 0.0122857 1.1613 0.24557
## year1983:educ 0.0166585 0.0122855 1.3559 0.17519
## year1984:educ 0.0162039 0.0122875 1.3187 0.18733
## year1985:educ 0.0228156 0.0122942 1.8558 0.06355 .
## year1986:educ 0.0264288 0.0122943 2.1497 0.03164 *
## year1987:educ 0.0296853 0.0122930 2.4148 0.01578 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 655.78
## Residual Sum of Squares: 544.16
## R-Squared: 0.1702
## Adj. R-Squared: 0.16695
## F-statistic: 52.3874 on 17 and 4342 DF, p-value: < 2.22e-16

phtest(wagepanfe,wagepanre)

##
## Hausman Test
##
## data: lwage ~ union + married + year * educ
## chisq = 20.142, df = 16, p-value = 0.2139
## alternative hypothesis: one model is inconsistent

Logit
library("AER")
data("SwissLabor")
swissmod <-
glm(participation~age*youngkids+age*oldkids,data=SwissLabor,
family="binomial")
summary(swissmod)
##
## Call:
## glm(formula = participation ~ age * youngkids + age * oldkids,
## family = "binomial", data = SwissLabor)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8376 -1.0555 -0.6511 1.1447 2.5144
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.02475 0.50658 5.971 2.36e-09 ***
## age -0.66983 0.10735 -6.239 4.39e-10 ***
## youngkids -3.53553 0.81189 -4.355 1.33e-05 ***
## oldkids -0.42969 0.38070 -1.129 0.25903
## age:youngkids 0.75470 0.25107 3.006 0.00265 **
## age:oldkids 0.08028 0.09254 0.867 0.38570
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1203.2 on 871 degrees of freedom
## Residual deviance: 1123.5 on 866 degrees of freedom
## AIC: 1135.5
##
## Number of Fisher Scoring iterations: 4

Graphics with ggplot2


R has many different graphics systems. But a set of very powerful and
coherent facilities is provided by the ggplot2 library. This library is
written by Hadley Wickham. It is based upon the ideas in Leland
Wilkinson's book The Grammar of Graphics. More information about the
library itself can be found on its website.

The functions in ggplot2 are organised around a number of concepts.


We quote from the original documentation:

Geom. "Geoms, short for geometric objects, describe the type of


plot you will produce."
Statistics. "It's often useful to transform your data before plotting,
and that's what statistical transformations do."
Scales. Scales control the mapping between data and aesthetics.
Coordinate systems. Coordinate systems adjust the mapping
from coordinates to the 2d plane of the computer screen.
Faceting. Facets display subsets of the dataset in different panels.
Position adjustments. Position adjustments can be used to fine
tune positioning of objects to achieve effects like dodging, jittering
and stacking.

One can arrange these ideas in a pipeline. The raw data undergoes
statistical transformations, then is mapped to aesthetic attributions
such as position, size or color throgh a choice of scales, positions are
mapped to the physical device through a coordinate system, we can
specify what happens when two objects want to be in the same
position, and finally we can create multipanel plots through faceting.
Now to illustrate.

The Basics
Let's begin with the simplest scatterplot. Using Wooldridge's wage
data, let us plot a scatterplot of wage against age. We import the
ggplot2 library and choose our default theme.

theme_set(theme_bw())

We begin to set up our plot by calling the function ggplot with the first
argument the data frame where the data is to come from and the
second argument a set of mapping of variables to aesthetics. The
mapping is carried out by the aes function. The first two arguments to
this function are taken by default to be the x and y positions. But we
also chose to map the color of graphical elements to the variable race.
The returned variable is an object representing the plot to which more
elements can be added before we generate the plot

plot <- ggplot(wage2,aes(x=age,y=wage,color=race))

To actually plot something we must have some geoms. Because we


want a scatterplot of points, we choose geom_point. The + operator
performs updates on existing plot objects.

plot <- plot+geom_point(alpha=0.5,position="jitter")

Here alpha is an aesthetic which determines the transparency of the


graphical elements, with 0 being fully transparent and 1 being fully
opaque. Rather than mapping this aesthetic to some data, we set it to
a fixed value of 0.5. Semi-transparent points are one way of managing
the problem of overplotting, i.e., points covering up others. With semi-
transparent point place where there are many observations appear
darker in colour. We also choose position="jitter" to indicate that if
multiple points end up in the same place they should be shifted by a
small random amount. This also helps with the problem of overplotting,
at some cost to accuracy.

Finally we print to plot object to actually display it.

plot

Note that ggplot2 has automatically chosen scales for the x and y
variables as well as a color scale for race. The library usually makes
intelligent choses. The most common problem occurs when what is a
categorical varible (such as race) is encoded in a numeric variable
(such as Wooldridge's black). In this case the library may choose a
scale that is more appropriate for continuous data. Transforming the
variable to a factor before plotting takes care of this problem.

While the above graph is fine as far as it goes, sometimes we are more
interested in percentage differences in wages rather absolute wages.
We can depict the data in a way which shows this clearly by explicitly
specifying a log scale for the y axis. We also explicitly specify the
points at which major tick marks (the breaks) must be drawn.

plot <- plot+scale_y_log10(breaks=(1:10)*1000)


plot

Despite our attempts to deal with overplotting, the scatter is still


overwhelming. We can try to guide the viewer by also plotting some
summary statistics. We add one more layer to our plot which
transforms the data by calculating a summary statistic (median) and
then using a line geom to display it as a line graph.

plot+stat_summary(fun.y=median,geom="line")
Note that because we had mapped race to colour in the initial call to
ggplot, this layer also acquired this mapping.

One way to summarize distributions is through a box-and-whisker plot.

wage2$myurban <- if_else(wage2$urban!=0,"urban","nonurban")


ggplot(wage2,aes(race,wage)) +
geom_boxplot()+
facet_grid(~myurban)+
scale_y_log10(breaks=(1:10)*1000)
The above plot also demonstrates the use of 'facets' to draw
multipanel plots. The axes of the plots are aligned, which makes
comparison easy. In fact, multipanel plots are often much more
effective than plots that overload a single space with many varying
aesthetic parameters.

The Barley data


In his book The Elements of Graphing Data, William Cleveland
discusses the example of data on a field trial on the yield of barley. The
following plot, which attempts to reproduce Figure 1.1 in Cleveland's
book uses multiple panels to effectively show the three-dimensional
(site,variety,yield) data.

data(barley,package="lattice")
ggplot(barley,aes(yield,variety))+
geom_point(aes(shape=factor(year)),size=2)+
scale_shape(name="Year",solid=FALSE)+
facet_grid(site~.)+
labs(title="Minnesota Barley Data",
x = "Yield (bushels/acre)",y="")
The diagram immediately shows that while on all other sites the yield
was higher in 1931, the reverse is true in Morris. Not only that the
magnitude of the difference in yield in Morris is about the same as that
at other sites. This strongly suggests the possiblity of a data entry
error: the data for Morris was simply reversed.

We can more directly compare the change in magnitudes by drawing a


boxplot.

barley %>%
spread(key=year,value=yield,sep="_") %>%
mutate(diff=year_1932-year_1931) %>%
ggplot(aes(site,diff)) +
geom_boxplot() +
theme(axis.text.x=element_text(angle = 30,hjust=1)) +
labs(title="Minnesota Barley Data",
x="Site",
y="Difference in yields between 1932 and 1931 (bushels/acre)")

This not only brings out the problem with the Morris data but also
shows that there is one outlier in Grand Rapids which is very likely a
transcription error since its value is negative of the median.
barley %>%
spread(key=year,value=yield,sep="_") %>%
filter(site=="Grand Rapids")

## variety site year_1932 year_1931


## 1 Svansota Grand Rapids 16.63333 29.66667
## 2 No. 462 Grand Rapids 19.90000 24.93334
## 3 Manchuria Grand Rapids 22.13333 32.96667
## 4 No. 475 Grand Rapids 15.23333 19.70000
## 5 Velvet Grand Rapids 32.23333 23.03333
## 6 Peatland Grand Rapids 26.76667 34.70000
## 7 Glabron Grand Rapids 14.43333 29.13333
## 8 No. 457 Grand Rapids 19.46667 32.16667
## 9 Wisconsin No. 38 Grand Rapids 20.66667 34.46667
## 10 Trebi Grand Rapids 20.63333 29.76667

Data Wrangling
This is from the introduction vignette for dplyr.

library(nycflights13)
dim(flights)

## [1] 336776 19

head(flights)

## # A tibble: 6 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## # ... with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## # time_hour <dttm>

filter(flights, month == 1, day == 1)

## # A tibble: 842 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## 7 2013 1 1 555 600 -5 913
## 8 2013 1 1 557 600 -3 709
## 9 2013 1 1 557 600 -3 838
## 10 2013 1 1 558 600 -2 753
## # ... with 832 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour
<dbl>,
## # minute <dbl>, time_hour <dttm>

slice(flights, 1:10)

## # A tibble: 10 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## 7 2013 1 1 555 600 -5 913
## 8 2013 1 1 557 600 -3 709
## 9 2013 1 1 557 600 -3 838
## 10 2013 1 1 558 600 -2 753
## # ... with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## # time_hour <dttm>

arrange(flights, year, month, day)

## # A tibble: 336,776 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## 7 2013 1 1 555 600 -5 913
## 8 2013 1 1 557 600 -3 709
## 9 2013 1 1 557 600 -3 838
## 10 2013 1 1 558 600 -2 753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time
<int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour
<dbl>,
## # minute <dbl>, time_hour <dttm>

arrange(flights, desc(arr_delay))

## # A tibble: 336,776 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 9 641 900 1301 1242
## 2 2013 6 15 1432 1935 1137 1607
## 3 2013 1 10 1121 1635 1126 1239
## 4 2013 9 20 1139 1845 1014 1457
## 5 2013 7 22 845 1600 1005 1044
## 6 2013 4 10 1100 1900 960 1342
## 7 2013 3 17 2321 810 911 135
## 8 2013 7 22 2257 759 898 121
## 9 2013 12 5 756 1700 896 1058
## 10 2013 5 3 1133 2055 878 1250
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time
<int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour
<dbl>,
## # minute <dbl>, time_hour <dttm>

# Select columns by name


select(flights, year, month, day)

## # A tibble: 336,776 3
## year month day
## <int> <int> <int>
## 1 2013 1 1
## 2 2013 1 1
## 3 2013 1 1
## 4 2013 1 1
## 5 2013 1 1
## 6 2013 1 1
## 7 2013 1 1
## 8 2013 1 1
## 9 2013 1 1
## 10 2013 1 1
## # ... with 336,766 more rows

# Select all columns between year and day (inclusive)


select(flights, year:day)

## # A tibble: 336,776 3
## year month day
## <int> <int> <int>
## 1 2013 1 1
## 2 2013 1 1
## 3 2013 1 1
## 4 2013 1 1
## 5 2013 1 1
## 6 2013 1 1
## 7 2013 1 1
## 8 2013 1 1
## 9 2013 1 1
## 10 2013 1 1
## # ... with 336,766 more rows

# Select all columns except those from year to day (inclusive)


select(flights, -(year:day))

## # A tibble: 336,776 16
## dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
## <int> <int> <dbl> <int> <int> <dbl>
## 1 517 515 2 830 819 11
## 2 533 529 4 850 830 20
## 3 542 540 2 923 850 33
## 4 544 545 -1 1004 1022 -18
## 5 554 600 -6 812 837 -25
## 6 554 558 -4 740 728 12
## 7 555 600 -5 913 854 19
## 8 557 600 -3 709 723 -14
## 9 557 600 -3 838 846 -8
## 10 558 600 -2 753 745 8
## # ... with 336,766 more rows, and 10 more variables: carrier <chr>,
## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time
<dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

rename(flights, tail_num = tailnum)

## # A tibble: 336,776 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## 7 2013 1 1 555 600 -5 913
## 8 2013 1 1 557 600 -3 709
## 9 2013 1 1 557 600 -3 838
## 10 2013 1 1 558 600 -2 753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time
<int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tail_num <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour
<dbl>,
## # minute <dbl>, time_hour <dttm>

distinct(flights, tailnum)

## # A tibble: 4,044 1
## tailnum
## <chr>
## 1 N14228
## 2 N24211
## 3 N619AA
## 4 N804JB
## 5 N668DN
## 6 N39463
## 7 N516JB
## 8 N829AS
## 9 N593JB
## 10 N3ALAA
## # ... with 4,034 more rows

distinct(flights, origin, dest)

## # A tibble: 224 2
## origin dest
## <chr> <chr>
## 1 EWR IAH
## 2 LGA IAH
## 3 JFK MIA
## 4 JFK BQN
## 5 LGA ATL
## 6 EWR ORD
## 7 EWR FLL
## 8 LGA IAD
## 9 JFK MCO
## 10 LGA ORD
## # ... with 214 more rows

mutate(flights,
gain = arr_delay - dep_delay,
speed = distance / air_time * 60)

## # A tibble: 336,776 21
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## 7 2013 1 1 555 600 -5 913
## 8 2013 1 1 557 600 -3 709
## 9 2013 1 1 557 600 -3 838
## 10 2013 1 1 558 600 -2 753
## # ... with 336,766 more rows, and 14 more variables: sched_arr_time
<int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour
<dbl>,
## # minute <dbl>, time_hour <dttm>, gain <dbl>, speed <dbl>

summarise(flights,
delay = mean(dep_delay, na.rm = TRUE))

## # A tibble: 1 1
## delay
## <dbl>
## 1 12.63907

sample_n(flights, 10)

## # A tibble: 10 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 3 20 1125 1130 -5 1438
## 2 2013 11 13 1428 1430 -2 1639
## 3 2013 6 28 750 740 10 1115
## 4 2013 1 24 952 935 17 1241
## 5 2013 7 22 1614 1559 15 1918
## 6 2013 2 5 1341 1350 -9 1512
## 7 2013 1 22 1739 1730 9 1917
## 8 2013 12 7 617 625 -8 711
## 9 2013 11 1 1040 1029 11 1341
## 10 2013 8 6 1257 1300 -3 1456
## # ... with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## # time_hour <dttm>

sample_frac(flights, 0.01)

## # A tibble: 3,368 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 11 16 1652 1700 -8 2009
## 2 2013 8 27 903 909 -6 1108
## 3 2013 6 10 1012 1007 5 1127
## 4 2013 3 29 1629 1630 -1 1949
## 5 2013 1 3 1257 1255 2 1513
## 6 2013 6 4 1234 1245 -11 1341
## 7 2013 6 5 600 600 0 845
## 8 2013 10 16 1824 1805 19 1952
## 9 2013 10 30 2027 2029 -2 2342
## 10 2013 9 25 1812 1715 57 2025
## # ... with 3,358 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour
<dbl>,
## # minute <dbl>, time_hour <dttm>

by_tailnum <- group_by(flights, tailnum)


delay <- summarise(by_tailnum,
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE))
delay <- filter(delay, count > 20, dist < 2000)

Interestingly, the average delay is only slightly related to the average


distance flown by a plane.

ggplot(delay, aes(dist, delay)) +


geom_point(aes(size = count), alpha = 1/2) +
geom_smooth() +
scale_size_area()

## Warning: Removed 1 rows containing non-finite values (stat_smooth).

## Warning: Removed 1 rows containing missing values (geom_point).


destinations <- group_by(flights, dest)
summarise(destinations,
planes = n_distinct(tailnum),
flights = n()
)

## # A tibble: 105 3
## dest planes flights
## <chr> <int> <int>
## 1 ABQ 108 254
## 2 ACK 58 265
## 3 ALB 172 439
## 4 ANC 6 8
## 5 ATL 1180 17215
## 6 AUS 993 2439
## 7 AVL 159 275
## 8 BDL 186 443
## 9 BGR 46 375
## 10 BHM 45 297
## # ... with 95 more rows

daily <- group_by(flights, year, month, day)


(per_day <- summarise(daily, flights = n()))

## Source: local data frame [365 x 4]


## Groups: year, month [?]
##
## year month day flights
## <int> <int> <int> <int>
## 1 2013 1 1 842
## 2 2013 1 2 943
## 3 2013 1 3 914
## 4 2013 1 4 915
## 5 2013 1 5 720
## 6 2013 1 6 832
## 7 2013 1 7 933
## 8 2013 1 8 899
## 9 2013 1 9 902
## 10 2013 1 10 932
## # ... with 355 more rows

(per_month <- summarise(per_day, flights = sum(flights)))

## Source: local data frame [12 x 3]


## Groups: year [?]
##
## year month flights
## <int> <int> <int>
## 1 2013 1 27004
## 2 2013 2 24951
## 3 2013 3 28834
## 4 2013 4 28330
## 5 2013 5 28796
## 6 2013 6 28243
## 7 2013 7 29425
## 8 2013 8 29327
## 9 2013 9 27574
## 10 2013 10 28889
## 11 2013 11 27268
## 12 2013 12 28135

(per_year <- summarise(per_month, flights = sum(flights)))

## # A tibble: 1 2
## year flights
## <int> <int>
## 1 2013 336776

flights %>%
group_by(year, month, day) %>%
select(arr_delay, dep_delay) %>%
summarise(
arr = mean(arr_delay, na.rm = TRUE),
dep = mean(dep_delay, na.rm = TRUE)
) %>%
filter(arr > 30 | dep > 30)

## Adding missing grouping variables: `year`, `month`, `day`

## Source: local data frame [49 x 5]


## Groups: year, month [11]
##
## year month day arr dep
## <int> <int> <int> <dbl> <dbl>
## 1 2013 1 16 34.24736 24.61287
## 2 2013 1 31 32.60285 28.65836
## 3 2013 2 11 36.29009 39.07360
## 4 2013 2 27 31.25249 37.76327
## 5 2013 3 8 85.86216 83.53692
## 6 2013 3 18 41.29189 30.11796
## 7 2013 4 10 38.41231 33.02368
## 8 2013 4 12 36.04814 34.83843
## 9 2013 4 18 36.02848 34.91536
## 10 2013 4 19 47.91170 46.12783
## # ... with 39 more rows
More Data Wrangling
Importing
For this example we look at the commodity-wise monthly Wholesale
Price Index data published by the Ministry of Commerce and Industry of
the Government of India (You can get the data here. This is an Excel
file month2.xls. We can read it into an R data frame using the read_excel
function from the readxl package.

wpi <- read_excel('month2.xls',


na = "9999.9")

## DEFINEDNAME: 00 00 00 16 0b 00 00 00 00 00 00 00 00 00 00 45 78 63
65 6c 5f 42 75 69 6c 74 49 6e 5f 44 61 74 61 62 61 73 65 3b 00 00 00 00 1d
03 00 00 33 00
## DEFINEDNAME: 00 00 00 16 0b 00 00 00 00 00 00 00 00 00 00 45 78 63
65 6c 5f 42 75 69 6c 74 49 6e 5f 44 61 74 61 62 61 73 65 3b 00 00 00 00 1d
03 00 00 33 00
## DEFINEDNAME: 00 00 00 16 0b 00 00 00 00 00 00 00 00 00 00 45 78 63
65 6c 5f 42 75 69 6c 74 49 6e 5f 44 61 74 61 62 61 73 65 3b 00 00 00 00 1d
03 00 00 33 00
## DEFINEDNAME: 00 00 00 16 0b 00 00 00 00 00 00 00 00 00 00 45 78 63
65 6c 5f 42 75 69 6c 74 49 6e 5f 44 61 74 61 62 61 73 65 3b 00 00 00 00 1d
03 00 00 33 00

The WPI data uses 9999.9 as a marker for missing data. We tell
read_excel about it so that it is correctly marked as NA rather than being
treated as a numeric variable.

In the form in which the WPI data is downloaded each row consists of
multiple observation with the month of the observation encoded in the
column header. We want to rearrange the data with one observation
per row with the month and the year for each observation in variables
which we can use.

First we separate out the commodity names and weights from the
data.

wpi_desc <- wpi %>%


select(COMM_NAME,COMM_CODE,COMM_WT)
wpi_wide <- wpi %>%
select(COMM_CODE,starts_with("INDX"))

Next we rearrange the observations.


wpi_tidy <-
wpi_wide %>%
gather(starts_with("INDX"),key="time",value="wpi") %>%
separate(time,into=c("bogus","mon","year"),sep=c(4,6)) %>%

mutate(mon=as.integer(mon),year=as.integer(year),date=as.POSIXct(paste(
year,mon,"01",sep="-"))) %>%
select(-bogus)

gather takes 'wide' data and turns it into a long format. We provide it
the columns with the data. The column names become the values of a
new column called time and the data in the variabes are put into a
variable called wpi. The time variable is still a composite of the text
INDX, the month (2 digits) and year (3 digits). We use separate to
separate these out. We take the month and the year, convert them to
numbers and map them to the first day of the month to ease the
analysis of data as time series. Finally we eliminate the redundant
bogus column which just has the text INDX.

The WPI commodities include both the basic commodities themselves


as well as commodity groups at various level. Suppose we want to look
at the top three commodity groups, but not the overall category "all
commodities"

major_commodity <-
wpi_desc %>%
filter(str_detect(COMM_CODE,"^1[^0]0*$"))

str_detect from the package stringr returns TRUE or FALSE depending on


whether the string provided matches the regular expression. Now
suppose we want to get the data for these commodity codes and plot
the change in their WPI from the beginning of our data set.

left_join(major_commodity,wpi_tidy) %>%
group_by(COMM_CODE) %>%
arrange(date) %>%
mutate(relwpi=wpi/first(wpi)) %>%
ggplot(aes(date,relwpi))+
geom_line(aes(linetype=COMM_NAME))

## Joining, by = "COMM_CODE"

## Warning: Removed 12 rows containing missing values (geom_path).


We use left_join to pick up the data items corresponding to the major
groups, group the observation by groups, sort them by date, for each
group we compute the relative wpi by dividing the wpi by the first
observation and then plot.

The following example computes year-on-year inflation rates for rice

rice_wpi <-
semi_join(wpi_tidy,filter(wpi_desc,COMM_NAME=="Rice")) %>%
select(-COMM_CODE) %>%
mutate(lyear=year-1)

## Joining, by = "COMM_CODE"

inner_join(rice_wpi,rice_wpi,by=c("year"="lyear","mon")) %>%
mutate(inflation=wpi.y/wpi.x-1) %>%
select(-lyear,-date.y) %>%
arrange(year,mon) %>%
ggplot(aes(date.x,inflation)) +
geom_line()

## Warning: Removed 4 rows containing missing values (geom_path).


Resources
Hadley Wickham's website, specially his book R for Data Science
which is freely available for online reading.
Zeileis & Kleiber. Applied Econometrics with R.