Econometrics 1 - Introduction - To - Econometrics - Using - R

Applied Econometrics:
–
An Introduction for Research & Business using R
Serge Pajak Nicolas Soulié

serge.pajak@u-psud.fr nicolas.soulie@u-psud.fr
Université Paris-Sud
Université Paris-Sud Econometrics

Course Calendar
Week 1 Introducing the single linear regressions and the R software

Week 2 Data management basics, specification of the functional form
Week 3 Getting started with the assignment: data, topic and literature. Q & A.
Week 4 Regression with non-i.i.d. errors (not identical and not independent)
Week 5 Instrumental variables
Week 6 Discrete choice models
Week 7 Introduction to panel data
References
o Jeffrey Wooldridge, Introductory Econometrics: A Modern Approach, 3rd Edition,
2006
o William Green, Econometric Analysis, Prentice Hall, 6th Edition, 2008
o Florian Heiss, Using R for introductory econometrics, 2016

Purpose of this course
Economists try to understand and explain many (economic) questions or phenomena:

– Which factors affect people’s wages?
– Are startup companies more successful when the founder is older?
– What are the factors explaining house pricing? Size of house, # of rooms, local
crime, pollution, etc.
– Which factors affect people’s adoption and use of smartphone? Etc.
Dealing with such questions involve:
A theory: economic mechanism, potential explicative factors, insights about the

phenomenon, relevance of issue, etc.
Data: dataset including relevant variables needed to answer your question
(according to theory/literature)
Econometric analysis: data description, modification, analysis and interpretation
This course focuses on the technical (3nd) aspect.
In real-life you need the three!

Econometrics
Definition
– The main variable under study, or the explained variable (wage, location choice,
migration choice, etc.)
– which is affected by other factors, or explicative variables (income, age, gender,
consumption, education, etc.)
Theory: Links between explained and explicative variables

Consequence or result of economical, sociological, psychological, etc. mechanisms
which have been theoretically documented and/or empirically validated in existing
studies

Econometrics
Econometrics seeks to identify the nature of the relationship between an explained

variable and the explicative variables
Nature: positive, negative, perhaps non-monotonic (positive then negative after a

threshold), etc.
Some models allow to measure the strength of the relationship

Econometrics
Aim of econometrics: creating information on the relationship between variables in order

to:
Help decision-making (firm, people, public policy, etc.)
Test hypothesis
Documenting a question/phenomena
For instance, does education affect wage? And if so, what is the return of an additional
year of education on wage?

Evaluation: Term Paper
Write a mini-article in applied econometrics

- Short and basic, but structured with motivation, literature review, model, results and
interpretation
- Choice of topic by October 31. Includes name, topic, dataset and references
- Detailed instructions are included on myCourse and links to known public datasets
will be provided
- Must be turned-in at the end of January

Installing R
1. Download and install R software: www.r-project.org
2. Download and install RStudio (desktop, free version):

www.rstudio.com/products/rstudio/download/
3. Install the car and wooldridge packages for applied regression. Other packages will
be introduced during the course!

Data management: opening files
For importing files, use "Import dataset" in the ’Environment’ window and choose the
dataset format among csv (including text), Excel, SAS, SPSS or Stata
For example, the HousePrices3.xlsx dataset:

- Click on Import Dataset and then on "From Excel ..."
- Select the file URL
- Change dataset name
- Indicate if first line includes variable name
library(readxl)
HousePrices3 <- read_excel("C:/...HousePrices3.xlsx")
read.excel opens the Excel database "HousePrice3.xlsx" in the selected file

HousePrice3.xlsx is the name of the dataset

Data management and analysis: golden rules
Data management and analysis, golden rules:

- Keep an original version of the dataset, and work on a copy
- Keep record of your work using a script
- Set your working directory
Script to keep record of your work on the dataset (new var., graphics, models, etc.):
- In RStudio: File => New File => R Script
- Write command(s), select it and click on run
- Add comments using # at the beginning of a line
Set a working directory:

- setwd("C:/Documents and Settings/Data/")
- Display working directory: getwd()
- Display files in working directory: ls()

HousePrices3 variables:
- year: year of information collection (1978 or 1981)
- age: age of the house
- nbh: neighborhood, 0 to 6
- cbd: distance (feet) to Central Business District
- inst: distance (feet) to interstate
- price: selling price
- rooms: # rooms in house
- area: square footage of house
- land: square footage of lot
- baths: # of bathrooms
- dist: distance (feet) to incinerator

For example, the banks.txt dataset:

- Click on Import Dataset and then on "From CSV ..."
- Select the file URL
- Change dataset name
- Indicate if first line includes variable name
- Fill the delimiter/separator: comma, semicolon, tab or whitespace
valbanks <- read.table("banks.txt", sep = " ", header = True)
read.table opens the database "banks.txt" in the working directory

valbanks is the new name of the dataset
sep= for separator, with whitespace:" ", comma: ",", semicolon: ";" or tab: "\t"
header = T or True reads the first line as variable names

Data management: descriptive statistics
Using the HousePrices3 database, for descriptive statistics on quantitative variable:
install.packages("skimr")
library(skimr)
skim(HousePrices3) provides basic statistics for every variable
> skim(HousePrices3)
Skim summary statistics
n obs: 321
n variables: 27
Variable type: character

variable missing complete n min max empty n_unique
roomsd 0 2247 2247 1 1 0 2
Variable type: numeric

variable missing complete n mean sd p0 p25 median
age 0 321 321 18.01 32.57 0 0 4
area 0 321 321 2106.73 694.96 735 1560 2056
baths 0 321 321 2.34 0.77 1 2 2
cbd 0 321 321 15822.43 8967.11 1000 9000 14000
counter 0 321 321 161 92.81 1 81 161
dist 0 321 321 20715.58 8508.18 5000 13400 19900
...

Using the HousePrices3 database, for descriptive statistics on quantitative variable:

install.packages("psych")
library(psych)
describe(HousePrices3$price) provides basic statistics on the variable listed after $
> vars n mean sd median trimmed mad min max
X1 1 321 96100.66 43223.73 85900 91630.93 37658.04 26000 3e+05
range skew kurtosis se
X1 274000 1.13 1.76 2412.51
With variable’s mean, median, minimum, maximum, range and also:
- vars: number of variable
- sd: standard deviation
- trimmed: variable’s mean without its 10% highest and lowest values
- mad: median absolute deviation
- skew: Skewness index (normal distribution, Skewness = 0)
- kurtosis: Kurtosis index (normal distribution, Kurtosis = 3)
- se: standard error

Descriptive statistics on quantitative variable by subcategory of another variable:

describeBy(HousePrices3$price, group=HousePrices3$nbh) provides statistics on
price by sub-group of nbh’s values
Descriptive statistics by group

group: 0
vars n mean sd median trimmed mad min max
X1 1 121 108737.4 50748.08 98000 104952.4 48925.8 26000 3e+05
X1 274000 0.9 1.06 4613.46
---------------------------------------------------------------
group: 1
X1 1 27 109800 46247.02 89500 105595.6 40178.46 58000 216000
X1 158000 0.81 -0.65 8900.24

Descriptive statistics on quantitative variable for a subset of observations:

describe(HousePrices3$price[HousePrices3$rooms>5]) provides statistics on price for
houses with 5 rooms or more
X1 1 287 100685.8 43191.59 90000 96725.41 40771.5 26000 3e+05
X1 274000 1.07 1.68 2549.52
describe(HousePrices3$price[HousePrices3$rooms>5 & HousePrices3$baths>3])
provides statistics on price for houses with 5 rooms or more, and with more than 3
baths
X1 1 4 149339.2 45336.09 154750 149339.2 48449.14 98000 189857
X1 91857 -0.1 -2.31 22668.05

For qualitative variable: Simple frequency table of age
as.data.frame(table(HousePrices3$age))

A 2-way cross-table
library(gmodels)
CrossTable(HousePrices3$rooms,HousePrices3$nbh, digits=2,
prop.r=FALSE, prop.c=TRUE,prop.t = FALSE, prop.chisq = FALSE)

Data management: operations with variables
How to create a new variable

newvariable <- newvariable_formula
The formula can be any operation on variable
For instance, converting distances from feet to meters:

- Creating conversion constant (1 foot = 0.3048 meter): tometers <- 0.3048
- Then, converting from feet to meters:
HousePrices3$cbdm <- HousePrices3$cbd*tometers
HousePrices3$instm <- HousePrices3$inst*tometers
HousePrices3$distm <- HousePrices3$dist*tometers
HousePrices3$areasm <- HousePrices3$area*0.0929

Data management: logical and math operators
Math operators in R:
- Usual math operators: - , + , / and *
- Inferior/inferior or equal/superior/superior or equal: <, <=, >, >=
- Is equal to (comparison operator in condition/constraint): ==
Usual Math functions:

- Power: x^2, e.g. income_square <- income^2
- Square root: sqrt(x), e.g. size_sqrt <- sqrt(size + 1)
- Log: log(x) or log10(x), e.g. log_income <- log(income + 1)
Logical operators in R:
- AND: &, e.g. [HousePrices3$rooms>=5 & HousePrices3$baths>3] computes the
command only for obs. in HousePrices3 which have rooms>=5 AND baths>3
- OR: | pronounced ’tube’ or ’pipe’ (Mac: Maj + alt + L, or PC: Alt Gr + 6), e.g.
[HousePrices3$rooms>=5 | HousePrices3$area>2000] computes the command only
for observations in HousePrices3 which have rooms>=5 OR area>2000
- NOT: !=, e.g. [HousePrices3$rooms!=5] computes the command only for
observations in HousePrices3 which have variable rooms NOT EQUAL to 5

Creating dichotomous variable (1/0):
Built a variable isTenYearsOld = 1 if the house is 10 years old, and isTenYearsOld =

0 otherwise
HousePrices3$isTenYearsOld <- as.numeric(HousePrices3$age == 10)
Built a variable recent indicating that the house is newer than 10 years old
HousePrices3$isRecent <- as.numeric(HousePrices3$age < 10)

How to rename variables

names(HousePrices3)[names(HousePrices3) == "areasm"] <- "areasmeters"
Or, duplicate the variable then delete the old variable (here cbdmeters):
HousePrices3$cbdm <- HousePrices3$cbdmeters
HousePrices3 <- HousePrices3[-c(HousePrices3$cbdmeters)]
-c: exclude variables listed between brackets

Graphical analysis
With a fictitious created dataset

x <- rnorm(n=100, mean=0.5, sd=0.1)
y <- 2+4*rnorm(n=100, mean=0.5, sd=0.1)
Simple histogram: hist(x)

Customized histogram
hist(x,
main="Histogramme pour x",
xlab="Valeurs en abcisses",
border="black",
col="blue",
xlim=c(-1,1),
breaks=5)
Boxplot boxplot(x)

Graphical analysis
Plotting one variable against the other, with a regression line

plot(x,y,
xlab="1re variable", ylab="2e variable")
abline(lm(y ~ x), col = "blue", lty="dashed")
Or using ggplot2 package:
install.packages("ggplot2")
library(ggplot2)
ggplot(data=HousePrices3, aes(y=price, x=area)) + geom_point()
aes: declare variables, and geom_point(): draw scatter plot

Graphical analysis
Plotting one variable against the other, with a regression line

+ stat_smooth(method = "lm", formula = y ~ x)
stat_smooth: fitted line, here using linear model (lm).

Be careful: in ggplot’s subcommands (e.g. geom_point(), stat_smooth(), formula, etc.)
refer to y and x (as declared in aes()) and not to variables’ name (e.g. price, area, etc.)

Graphical analysis
Or with a quadratic regression line

+ stat_smooth(method = "lm", formula = y ~ poly(x, 2))
poly(x, 2): indicates a polynomial of x of degree 2 (i.e. y = x + x 2 ).

Try with a polynomial of degree 3.
Graphical analysis
Now using the crime database
crime <- read_dta(".../crime.dta")
Simple histogram + scatter plot to detect correlation between variables
hist(crime$murder)
plot(crime$pctmetro, crime$murder,
xlab="Percent of pop living in the city",
ylab="Crime per 1000s inhabitants")
text(crime$pctmetro, crime$murder, labels=crime$state, pos=2)

Graphical analysis
Plot excluding "dc":
plot(crime$pctmetro[crime$state!="dc"], crime$murder[crime$state!="dc"],
xlab="Percent of pop living in the city",
ylab="Crime per 1000s inhabitants")

Graphical analysis
Or using ggplot (excluding "dc"), and adding a fitted curve:
ggplot(data=crime[(crime$state!="dc"),], aes(y=murder, x=pctmetro))
+ geom_point()
+ stat_smooth(method = "lm", formula = y ~ poly(x, 2))

Graphical analysis
Now using the Banks losses data

valbanks<-scan("banks.txt",
what=list(0,0,""), sep="", skip=1, comment.char="#")
valbanks
valj2007<-valbanks[[1]]
valj2009<-valbanks[[2]]
namebank<-valbanks[[3]]
percent_losses<-(valj2009-valj2007)/valj2007
percent_losses
abs_losses<-(valj2007-valj2009)
abs_losses
plot(abs_losses, percent_losses,
main="Absolute Losses vs. Relative Losses(in %)",
xlab="Losses (absolute, in miles of millions)",
ylab="Losses relative (in % of January 2007 value)",
col="blue", pch = 19, cex = 1, lty = "solid", lwd = 2,
text(percent_losses, abs_losses, namebank))
text(abs_losses, percent_losses, labels=namebank, cex= 0.7, offset = 10)

Graphical analysis
Complex representation for categorical variables

library(vcd)
mosaic(...)

Graphical analysis
library(vcd)
isLarge <- as.numeric(HousePrices3$areasm >= 110)
mosaic(~ isRecent + isLarge + nbh4,
data = HousePrices3, shade=TRUE, legend=TRUE )

Extra resources for graphical analysis
Commented examples:
https://www.harding.edu/fmccown/r/
https://www.statmethods.net/graphs/line.html
https://en.wikibooks.org/wiki/R_Programming/Graphics
More ’impressive’ results:

https://www.r-graph-gallery.com

Econometrics 1 - Introduction - To - Econometrics - Using - R

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Econometrics 1 - Introduction - To - Econometrics - Using - R

Uploaded by

Copyright:

Available Formats

Applied Econometrics:

Serge Pajak Nicolas Soulié

Université Paris-Sud Econometrics

Week 1 Introducing the single linear regressions and the R software

Université Paris-Sud Econometrics

Economists try to understand and explain many (economic) questions or phenomena:

Dealing with such questions involve:

A theory: economic mechanism, potential explicative factors, insights about the

Université Paris-Sud Econometrics

Theory: Links between explained and explicative variables

Université Paris-Sud Econometrics

Econometrics seeks to identify the nature of the relationship between an explained

Nature: positive, negative, perhaps non-monotonic (positive then negative after a

Université Paris-Sud Econometrics

Aim of econometrics: creating information on the relationship between variables in order

Université Paris-Sud Econometrics

Write a mini-article in applied econometrics

Université Paris-Sud Econometrics

1. Download and install R software: www.r-project.org

2. Download and install RStudio (desktop, free version):

Université Paris-Sud Econometrics

For example, the HousePrices3.xlsx dataset:

read.excel opens the Excel database "HousePrice3.xlsx" in the selected file

Université Paris-Sud Econometrics

Data management and analysis, golden rules:

Set a working directory:

Université Paris-Sud Econometrics

Université Paris-Sud Econometrics

For example, the banks.txt dataset:

valbanks <- read.table("banks.txt", sep = " ", header = True)

read.table opens the database "banks.txt" in the working directory

Université Paris-Sud Econometrics

Variable type: character

Variable type: numeric

Université Paris-Sud Econometrics

Using the HousePrices3 database, for descriptive statistics on quantitative variable:

Université Paris-Sud Econometrics

Descriptive statistics on quantitative variable by subcategory of another variable:

Descriptive statistics by group

Université Paris-Sud Econometrics

Descriptive statistics on quantitative variable for a subset of observations:

Université Paris-Sud Econometrics

Université Paris-Sud Econometrics

Université Paris-Sud Econometrics

How to create a new variable

For instance, converting distances from feet to meters:

Université Paris-Sud Econometrics

Usual Math functions:

Université Paris-Sud Econometrics

Creating dichotomous variable (1/0):

Built a variable isTenYearsOld = 1 if the house is 10 years old, and isTenYearsOld =

Université Paris-Sud Econometrics

How to rename variables

-c: exclude variables listed between brackets

Université Paris-Sud Econometrics

With a fictitious created dataset

Simple histogram: hist(x)

Université Paris-Sud Econometrics

Plotting one variable against the other, with a regression line

aes: declare variables, and geom_point(): draw scatter plot

Université Paris-Sud Econometrics

Plotting one variable against the other, with a regression line

stat_smooth: fitted line, here using linear model (lm).