You are on page 1of 33

Applied Econometrics:


An Introduction for Research & Business using R

Serge Pajak Nicolas Soulié


serge.pajak@u-psud.fr nicolas.soulie@u-psud.fr

Université Paris-Sud

Université Paris-Sud Econometrics


Course Calendar

Week 1 Introducing the single linear regressions and the R software


Week 2 Data management basics, specification of the functional form
Week 3 Getting started with the assignment: data, topic and literature. Q & A.
Week 4 Regression with non-i.i.d. errors (not identical and not independent)
Week 5 Instrumental variables
Week 6 Discrete choice models
Week 7 Introduction to panel data

References
o Jeffrey Wooldridge, Introductory Econometrics: A Modern Approach, 3rd Edition,
2006
o William Green, Econometric Analysis, Prentice Hall, 6th Edition, 2008
o Florian Heiss, Using R for introductory econometrics, 2016

Université Paris-Sud Econometrics


Purpose of this course

Economists try to understand and explain many (economic) questions or phenomena:


– Which factors affect people’s wages?
– Are startup companies more successful when the founder is older?
– What are the factors explaining house pricing? Size of house, # of rooms, local
crime, pollution, etc.
– Which factors affect people’s adoption and use of smartphone? Etc.

Dealing with such questions involve:

A theory: economic mechanism, potential explicative factors, insights about the


phenomenon, relevance of issue, etc.
Data: dataset including relevant variables needed to answer your question
(according to theory/literature)
Econometric analysis: data description, modification, analysis and interpretation
This course focuses on the technical (3nd) aspect.
In real-life you need the three!

Université Paris-Sud Econometrics


Econometrics

Definition

– The main variable under study, or the explained variable (wage, location choice,
migration choice, etc.)
– which is affected by other factors, or explicative variables (income, age, gender,
consumption, education, etc.)

Theory: Links between explained and explicative variables


Consequence or result of economical, sociological, psychological, etc. mechanisms
which have been theoretically documented and/or empirically validated in existing
studies

Université Paris-Sud Econometrics


Econometrics

Econometrics seeks to identify the nature of the relationship between an explained


variable and the explicative variables

Nature: positive, negative, perhaps non-monotonic (positive then negative after a


threshold), etc.
Some models allow to measure the strength of the relationship

Université Paris-Sud Econometrics


Econometrics

Aim of econometrics: creating information on the relationship between variables in order


to:
Help decision-making (firm, people, public policy, etc.)
Test hypothesis
Documenting a question/phenomena

For instance, does education affect wage? And if so, what is the return of an additional
year of education on wage?

Université Paris-Sud Econometrics


Evaluation: Term Paper

Write a mini-article in applied econometrics


- Short and basic, but structured with motivation, literature review, model, results and
interpretation
- Choice of topic by October 31. Includes name, topic, dataset and references
- Detailed instructions are included on myCourse and links to known public datasets
will be provided
- Must be turned-in at the end of January

Université Paris-Sud Econometrics


Installing R

1. Download and install R software: www.r-project.org

2. Download and install RStudio (desktop, free version):


www.rstudio.com/products/rstudio/download/

3. Install the car and wooldridge packages for applied regression. Other packages will
be introduced during the course!

Université Paris-Sud Econometrics


Data management: opening files

For importing files, use "Import dataset" in the ’Environment’ window and choose the
dataset format among csv (including text), Excel, SAS, SPSS or Stata

For example, the HousePrices3.xlsx dataset:


- Click on Import Dataset and then on "From Excel ..."
- Select the file URL
- Change dataset name
- Indicate if first line includes variable name

library(readxl)
HousePrices3 <- read_excel("C:/...HousePrices3.xlsx")

read.excel opens the Excel database "HousePrice3.xlsx" in the selected file


HousePrice3.xlsx is the name of the dataset

Université Paris-Sud Econometrics


Data management and analysis: golden rules

Data management and analysis, golden rules:


- Keep an original version of the dataset, and work on a copy
- Keep record of your work using a script
- Set your working directory

Script to keep record of your work on the dataset (new var., graphics, models, etc.):
- In RStudio: File => New File => R Script
- Write command(s), select it and click on run
- Add comments using # at the beginning of a line

Set a working directory:


- setwd("C:/Documents and Settings/Data/")
- Display working directory: getwd()
- Display files in working directory: ls()

Université Paris-Sud Econometrics


Data management: opening files

HousePrices3 variables:
- year: year of information collection (1978 or 1981)
- age: age of the house
- nbh: neighborhood, 0 to 6
- cbd: distance (feet) to Central Business District
- inst: distance (feet) to interstate
- price: selling price
- rooms: # rooms in house
- area: square footage of house
- land: square footage of lot
- baths: # of bathrooms
- dist: distance (feet) to incinerator

Université Paris-Sud Econometrics


Data management: opening files

For example, the banks.txt dataset:


- Click on Import Dataset and then on "From CSV ..."
- Select the file URL
- Change dataset name
- Indicate if first line includes variable name
- Fill the delimiter/separator: comma, semicolon, tab or whitespace

valbanks <- read.table("banks.txt", sep = " ", header = True)

read.table opens the database "banks.txt" in the working directory


valbanks is the new name of the dataset
sep= for separator, with whitespace:" ", comma: ",", semicolon: ";" or tab: "\t"
header = T or True reads the first line as variable names

Université Paris-Sud Econometrics


Data management: descriptive statistics
Using the HousePrices3 database, for descriptive statistics on quantitative variable:
install.packages("skimr")
library(skimr)
skim(HousePrices3) provides basic statistics for every variable
> skim(HousePrices3)
Skim summary statistics
n obs: 321
n variables: 27

Variable type: character


variable missing complete n min max empty n_unique
roomsd 0 2247 2247 1 1 0 2

Variable type: numeric


variable missing complete n mean sd p0 p25 median
age 0 321 321 18.01 32.57 0 0 4
area 0 321 321 2106.73 694.96 735 1560 2056
baths 0 321 321 2.34 0.77 1 2 2
cbd 0 321 321 15822.43 8967.11 1000 9000 14000
counter 0 321 321 161 92.81 1 81 161
dist 0 321 321 20715.58 8508.18 5000 13400 19900
...

Université Paris-Sud Econometrics


Data management: descriptive statistics

Using the HousePrices3 database, for descriptive statistics on quantitative variable:


install.packages("psych")
library(psych)
describe(HousePrices3$price) provides basic statistics on the variable listed after $
> vars n mean sd median trimmed mad min max
X1 1 321 96100.66 43223.73 85900 91630.93 37658.04 26000 3e+05
range skew kurtosis se
X1 274000 1.13 1.76 2412.51
With variable’s mean, median, minimum, maximum, range and also:
- vars: number of variable
- sd: standard deviation
- trimmed: variable’s mean without its 10% highest and lowest values
- mad: median absolute deviation
- skew: Skewness index (normal distribution, Skewness = 0)
- kurtosis: Kurtosis index (normal distribution, Kurtosis = 3)
- se: standard error

Université Paris-Sud Econometrics


Data management: descriptive statistics

Descriptive statistics on quantitative variable by subcategory of another variable:


describeBy(HousePrices3$price, group=HousePrices3$nbh) provides statistics on
price by sub-group of nbh’s values

Descriptive statistics by group


group: 0
vars n mean sd median trimmed mad min max
X1 1 121 108737.4 50748.08 98000 104952.4 48925.8 26000 3e+05
range skew kurtosis se
X1 274000 0.9 1.06 4613.46
---------------------------------------------------------------
group: 1
vars n mean sd median trimmed mad min max
X1 1 27 109800 46247.02 89500 105595.6 40178.46 58000 216000
range skew kurtosis se
X1 158000 0.81 -0.65 8900.24

Université Paris-Sud Econometrics


Data management: descriptive statistics

Descriptive statistics on quantitative variable for a subset of observations:


describe(HousePrices3$price[HousePrices3$rooms>5]) provides statistics on price for
houses with 5 rooms or more
vars n mean sd median trimmed mad min max
X1 1 287 100685.8 43191.59 90000 96725.41 40771.5 26000 3e+05
range skew kurtosis se
X1 274000 1.07 1.68 2549.52
describe(HousePrices3$price[HousePrices3$rooms>5 & HousePrices3$baths>3])
provides statistics on price for houses with 5 rooms or more, and with more than 3
baths
vars n mean sd median trimmed mad min max
X1 1 4 149339.2 45336.09 154750 149339.2 48449.14 98000 189857
range skew kurtosis se
X1 91857 -0.1 -2.31 22668.05

Université Paris-Sud Econometrics


Data management: descriptive statistics
For qualitative variable: Simple frequency table of age
as.data.frame(table(HousePrices3$age))

Université Paris-Sud Econometrics


Data management: descriptive statistics
A 2-way cross-table
library(gmodels)
CrossTable(HousePrices3$rooms,HousePrices3$nbh, digits=2,
prop.r=FALSE, prop.c=TRUE,prop.t = FALSE, prop.chisq = FALSE)

Université Paris-Sud Econometrics


Data management: operations with variables

How to create a new variable


newvariable <- newvariable_formula
The formula can be any operation on variable

For instance, converting distances from feet to meters:


- Creating conversion constant (1 foot = 0.3048 meter): tometers <- 0.3048
- Then, converting from feet to meters:
HousePrices3$cbdm <- HousePrices3$cbd*tometers
HousePrices3$instm <- HousePrices3$inst*tometers
HousePrices3$distm <- HousePrices3$dist*tometers
HousePrices3$areasm <- HousePrices3$area*0.0929

Université Paris-Sud Econometrics


Data management: logical and math operators

Math operators in R:
- Usual math operators: - , + , / and *
- Inferior/inferior or equal/superior/superior or equal: <, <=, >, >=
- Is equal to (comparison operator in condition/constraint): ==

Usual Math functions:


- Power: x^2, e.g. income_square <- income^2
- Square root: sqrt(x), e.g. size_sqrt <- sqrt(size + 1)
- Log: log(x) or log10(x), e.g. log_income <- log(income + 1)

Logical operators in R:
- AND: &, e.g. [HousePrices3$rooms>=5 & HousePrices3$baths>3] computes the
command only for obs. in HousePrices3 which have rooms>=5 AND baths>3
- OR: | pronounced ’tube’ or ’pipe’ (Mac: Maj + alt + L, or PC: Alt Gr + 6), e.g.
[HousePrices3$rooms>=5 | HousePrices3$area>2000] computes the command only
for observations in HousePrices3 which have rooms>=5 OR area>2000
- NOT: !=, e.g. [HousePrices3$rooms!=5] computes the command only for
observations in HousePrices3 which have variable rooms NOT EQUAL to 5

Université Paris-Sud Econometrics


Data management: operations with variables

Creating dichotomous variable (1/0):

Built a variable isTenYearsOld = 1 if the house is 10 years old, and isTenYearsOld =


0 otherwise
HousePrices3$isTenYearsOld <- as.numeric(HousePrices3$age == 10)
Built a variable recent indicating that the house is newer than 10 years old
HousePrices3$isRecent <- as.numeric(HousePrices3$age < 10)

Université Paris-Sud Econometrics


Data management: operations with variables

How to rename variables


names(HousePrices3)[names(HousePrices3) == "areasm"] <- "areasmeters"

Or, duplicate the variable then delete the old variable (here cbdmeters):
HousePrices3$cbdm <- HousePrices3$cbdmeters
HousePrices3 <- HousePrices3[-c(HousePrices3$cbdmeters)]

-c: exclude variables listed between brackets

Université Paris-Sud Econometrics


Graphical analysis

With a fictitious created dataset


x <- rnorm(n=100, mean=0.5, sd=0.1)
y <- 2+4*rnorm(n=100, mean=0.5, sd=0.1)

Simple histogram: hist(x)


Customized histogram
hist(x,
main="Histogramme pour x",
xlab="Valeurs en abcisses",
border="black",
col="blue",
xlim=c(-1,1),
breaks=5)

Boxplot boxplot(x)

Université Paris-Sud Econometrics


Graphical analysis

Plotting one variable against the other, with a regression line


plot(x,y,
xlab="1re variable", ylab="2e variable")
abline(lm(y ~ x), col = "blue", lty="dashed")
Or using ggplot2 package:
install.packages("ggplot2")
library(ggplot2)
ggplot(data=HousePrices3, aes(y=price, x=area)) + geom_point()

aes: declare variables, and geom_point(): draw scatter plot

Université Paris-Sud Econometrics


Graphical analysis

Plotting one variable against the other, with a regression line


ggplot(data=HousePrices3, aes(y=price, x=area)) + geom_point()
+ stat_smooth(method = "lm", formula = y ~ x)

stat_smooth: fitted line, here using linear model (lm).


Be careful: in ggplot’s subcommands (e.g. geom_point(), stat_smooth(), formula, etc.)
refer to y and x (as declared in aes()) and not to variables’ name (e.g. price, area, etc.)

Université Paris-Sud Econometrics


Graphical analysis
Or with a quadratic regression line

ggplot(data=HousePrices3, aes(y=price, x=area)) + geom_point()


+ stat_smooth(method = "lm", formula = y ~ poly(x, 2))

poly(x, 2): indicates a polynomial of x of degree 2 (i.e. y = x + x 2 ).


Try with a polynomial of degree 3.
Université Paris-Sud Econometrics
Graphical analysis
Now using the crime database
crime <- read_dta(".../crime.dta")
Simple histogram + scatter plot to detect correlation between variables
hist(crime$murder)
plot(crime$pctmetro, crime$murder,
xlab="Percent of pop living in the city",
ylab="Crime per 1000s inhabitants")
text(crime$pctmetro, crime$murder, labels=crime$state, pos=2)

Université Paris-Sud Econometrics


Graphical analysis
Plot excluding "dc":
plot(crime$pctmetro[crime$state!="dc"], crime$murder[crime$state!="dc"],
xlab="Percent of pop living in the city",
ylab="Crime per 1000s inhabitants")

Université Paris-Sud Econometrics


Graphical analysis
Or using ggplot (excluding "dc"), and adding a fitted curve:
ggplot(data=crime[(crime$state!="dc"),], aes(y=murder, x=pctmetro))
+ geom_point()
+ stat_smooth(method = "lm", formula = y ~ poly(x, 2))

Université Paris-Sud Econometrics


Graphical analysis

Now using the Banks losses data


valbanks<-scan("banks.txt",
what=list(0,0,""), sep="", skip=1, comment.char="#")
valbanks
valj2007<-valbanks[[1]]
valj2009<-valbanks[[2]]
namebank<-valbanks[[3]]
percent_losses<-(valj2009-valj2007)/valj2007
percent_losses
abs_losses<-(valj2007-valj2009)
abs_losses
plot(abs_losses, percent_losses,
main="Absolute Losses vs. Relative Losses(in %)",
xlab="Losses (absolute, in miles of millions)",
ylab="Losses relative (in % of January 2007 value)",
col="blue", pch = 19, cex = 1, lty = "solid", lwd = 2,
text(percent_losses, abs_losses, namebank))
text(abs_losses, percent_losses, labels=namebank, cex= 0.7, offset = 10)

Université Paris-Sud Econometrics


Graphical analysis

Complex representation for categorical variables


library(vcd)
mosaic(...)

Université Paris-Sud Econometrics


Graphical analysis

library(vcd)
isLarge <- as.numeric(HousePrices3$areasm >= 110)
mosaic(~ isRecent + isLarge + nbh4,
data = HousePrices3, shade=TRUE, legend=TRUE )

Université Paris-Sud Econometrics


Extra resources for graphical analysis

Commented examples:
https://www.harding.edu/fmccown/r/
https://www.statmethods.net/graphs/line.html
https://en.wikibooks.org/wiki/R_Programming/Graphics

More ’impressive’ results:


https://www.r-graph-gallery.com

Université Paris-Sud Econometrics

You might also like