Department of Economics: ECONOMICS 481: Economics Research Paper and Seminar

Department of Economics
ECONOMICS 481: Economics Research Paper and Seminar

Data Analysis (Program Learning Objective 3.1)
The approach taken will depend a lot on the topic and your creativity. Often the hardest part is
not doing empirical analysis, but explaining what it means. Writing up your results will take you
longer than you think.
1. Find data
What data set you use depends on your question.
There are a lot of Statistics Canada and other survey data sets available at ODESI. It has over
5600 datasets that are searchable by a broad subject, but using the browse feature is generally
better than using the search function. The data interface system is likely to be revamped in the
relatively near future. For international social science data, look into ICPSR , a repository that
maintains over 250,000 datasets.
For macroeconomic data, CANSIM is a good source, accessible either through the University of
Toronto’s CHASS or StatCan directly, as is the Bank of Canada. For US data, there is FRED
(also sometimes an easier place to get Canadian data).
For Canadian public finance data, there is Finances of the Nation.
The OECD also has a lot of public finance and other data on its members, but the interface is
not that great.
If you want data on prices of commodities, you might want to use one of the Bloomberg
terminals in the PRISM Finance lab.
For cross-country data, the World Development Indicators are a good starting point, as are
IMF’s International Financial Statistics (https://www.imf.org/en/Data), the Penn-World Tables
(https://www.rug.nl/ggdc/productivity/pwt/), and The Global Health Observatory.
If you want extra help trying to find relevant data, the library can help – please contact Matthew
Rohweder (mrohweder@wlu.ca).
2. Importing/opening data
Most Statistics Canada microdata files can be downloaded directly into Stata format. In that
case, to open it, it’s just:
use “filename”, clear
© C. Neill, January 2023

EC 481 Guidelines for Exploratory Data Analysis and Summary Statistics
The biggest problem students run into here is making sure that they have the data saved in the
correct directory. It’s often helpful to tell Stata the directory you’re working in at the start. To see
what directory Stata is currently working from, type:
pwd
To change the working director, you can use:
cd c:\users\myname\ec481\data
A trick here is that you don’t have to use quotation marks if you have no spaces in your directory
or file names, which makes coding neater. Good practice is to never use spaces in directorie or
file names.
Suppose you want to keep all your files for a project together in one folder, with data in one
directory and code and log files in another. You could use what’s called a ‘global’ to make the
main directory easy to use, then refer to the sub-directories. Eg:
global workdir = “c:\users\myname\EC481\”

use “$workdir\data\LFS_2000.dta”, clear
log using “$workdir\code\EC481_output.log”, replace
Many of the data sets online can be downloaded in Stata format. If not, you may have to build a
data set yourself. In that case, it’s probably easiest to do it in Excel first then import it into Stata:
import excel “filename”, sheet(sheetname) firstrow
The firstrow option specifies that the first row in the spreadsheet is the variable names. You can
also copy and paste directly into the data browser in Stata (the bit that looks like a spreadsheet),
but that is less elegant. Remember to point Stata to the right directory to find the file.
3. Understanding your data set

A first thing you need to do when looking at your data is to make sure you understand:
1. The unit of observation

a. In a well-organized data set for statistical analysis, each row is a unit of
observation. For instance, in the public use microdata files of the Canadian
Census, which is cross-sectional data, each row is a single person. For
Canadian CPI data, which is time series data, an observation is a month.
b. Often these observations can be grouped – e.g. a person in the Census is part of
a household often with more than one person; that household lives in a Census
metropolitan area; and that CMA is in a province. We often write these types of
groups into the subscripts of our variables, so we might get 𝐸𝑖ℎ𝑐𝑝 as a variable
equal to 1 if person 𝑖 in household ℎ in CMA 𝑐 and province 𝑝 is employed, and 0
otherwise.
c. Panel data is a combination of individual-level (item 1) and time-series (item 2)
data. Each row is a different combination of individual and time. We write this
something like 𝐸𝑖𝑡 where 𝐸 = 1 if person 𝑖 is employed in year 𝑡 and zero
© C. Neill, 2023 2
otherwise. You could have 𝑖 = 100, 𝑡 = 2020, and if 𝐸100,2020 = 1 then person
number 100 in the data set was employed in 2020.
2. Where your variables are
a. You want each variable to be in a single column
b. Typically the first couple of columns in your data set are used to uniquely define
the unit of observation.
i. For individual level data you would likely have an individual identification
number.
ii. For time series data, you might have the first variable being month and
the second year, so that together they identify a unique month in a
particular year.
iii. For panel data, you might have the first column being the individual
identifier and the second being the time period identifier.
c. Sometimes you might have panel data in what’s called ‘wide’ form. The table
below gives an example. The first column identifies a country (so each row is a
country), the second column is GDP in 2010, the third population in 2010, then
GDP and population in 2011, and so on.:
But to do a panel analysis, you want each row to be a single unit of observation,
which means each row should be a unique combination of country and year. So
you want a new column identifying the year, and then GDP in one column and
population in another. To do that you’re going to have to reshape the data from
wide to ‘long’. In this case, you’d use:
reshape long GDP Pop, i(Country) j(Year)
To get:
© C. Neill, 2023 3
d. In Stata, you can type ‘inspect varname’ and get some summary information on
the variable. ‘des’ and ‘sum’ also work, with a specific variable or just by
themselves for the whole data set (you might not want to do that if you have
hundreds of variables in your data set). You can also type ‘codebook’ to get a
different type of summary. But for any data set there will be more detailed
information on the exact definition of the variables. For example, with ODESI,
each data set has its own codebook. You should read these.
4. Putting different data sets together.

Sometimes you may need to add two or more data sets together. There are two different styles
of this:
Append: the Labour Force Survey microdata files have data on individuals’ employment
outcomes in rows, but each month is a different file. The variables (columns) are always the
same each month, but the individuals change (kind of – it’s a bit more complicated than this,
and talk to your professor if you want to use the LFS). So you can just put all the observations
for,eg February 2022 under the data for January 2022. It is very easy. If you have to append
multiple months, you probably want to use a loop. (If you actually do want to do this for the LFS,
email Christine Neill: cneill@wlu.ca). But here’s an example:
use $workdir\data\LFS_2000.dta, clear

forvalues i = 2001(1)2020 {
append using $workdir\data\LFS_`i’.dta
save $workdir\data\LFS_all.dta, replace
}
This opens the file LFS_2000.dta, then for the first value of i, which is 2001, runs the two
commands inside the curly brackets – so it adds the data in LFS_2001, then saves the file –
then goes to the next value of I which is 2001+1 = 2002 and does the two commands inside the
curly brackets, and so on until it gets to 2020.
Merge: suppose you have one data set that has NBA player statistics by each player’s name in
2020, and another data set that has the salary of each NBA player by each player’s name in
2021. You want to get these two variables together so you can run a regression. In this case
you want to use a merge command. Here, because each player appears in each data set only
once, you would first sort each data set by player name and save it, and then use merge (1:1).
Something like:
use $workdir\NBA_stats_2020.dta, clear

sort name
save $workdir\NBA_stats_2020.dta, replace
use $workdir\NBA_salary_2021.dta, clear
sort name
merge (1:1) using $workdir\NBA_stats_2020.dta
You might have another data set that has each NBA players statistics for each game in 2020.
So an observation there would be a combination of player and date of game – panel data!.
© C. Neill, 2023 4
Each player has multiple observations. To merge that with the salary data, you’ll have to use a
1:n type merge. Type help merge in Stata to get other forms of merge.
Make sure you check the _merge variable afterwards – it tells you whether the particular player
name was in only the first data set (1) only the second data set (2) or was in both (3). This can
sometimes alert you to data problems. If you need to do multiple merges, you’ll have to either
rename or delete this variable in between.
5. Save original data safely, and don’t overwrite; and use do files
It is really important that you keep the original datasets that you downloaded (or especially
created yourself!) somewhere safe. Perhaps create a ‘data’ folder for your project, and don’t
write anything else to that folder. Save interim data sets in a different folder. Using a do file that
keeps track of all the steps you do is also important to making your work replicable.
6. Understanding your data types

Before you can do anything with your data, you need to know what it is. Make sure you check
any codebooks. Summarize the data to see if it fits with the definitions and the restrictions on
the numbers (eg if this is annual income data in 2020 in Canada, are most people in the
thousands of dollars? If it’s age, are there any negative numbers? If it’s meant to be a data set
of school aged children, is anyone listed as over 25?)
Remember back in EC285 you learned about numerical and categorical data? This is really
important!
Numerical data are really truly numbers, where the numbers mean something measurable. So
for instance age is a number and someone with age = 5 is 5 years younger than someone with
age = 10. It is measured in years. Temperature is a number that also means something,
although we have lots of different scales for measuring it. If you have temperature data you
need to know the scale / units of measurement. Same with height. If something is in dollars, it’s
numerical – but you need to know whether it’s in nominal terms (probably) or real. Often we
want to use real data in our analysis, not nominal, so you’re going to have to get a price index
and deflate any nominal data you have.
Categorical data are not numbers. They are data that describe some characteristic of the
observation that cannot be measured numerically. So, for instance, eye colour. Or gender. Or
level of education. The variable might be stored as words – so you could have a variable called
eye_colour that has data in the cells like “blue” “brown” and “grey”. In Stata, these are called
‘string’ variables.
If eye_colour is a string variable and you type:
sum eye_colour
you will get output that says that there are no observations/values. This is because Stata
absolutely cannot provide a mean for something that is stored as letters not numbers.
But often it’s more efficient to store these data as numbers. And it is really important that you
understand that this is just how the data are stored – the numbers don’t actually mean anything.
© C. Neill, 2023 5
You could for instance have a variable called eye_colour_n that takes a value of 1 for anyone
with blue eyes, 2 for anyone with brown eyes, 3 for grey, 4 for green, 5 for other, and . for
unknown. In this case if you type
sum eye_colour_n
. sum eye_colour_n
Variable Obs Mean Std. Dev. Min Max
eye_colour_n 128 2.4375 .9027517 1 5
you will get back summary statistics, with a mean, a minimum of 1, a maximum of 5, etc. But
again these summary statistics don’t mean anything. A better approach to summarising
categorical data is to use tabulate or the user-written command frequency. If you type:
tab eye_colour_n
you will get a table back something like this:

. tab eye_colour_n
eye_colour_
n Freq. Percent Cum.
Blue 5 3.91 3.91

Brown 91 71.09 75.00
Grey 5 3.91 78.91
Green 25 19.53 98.44
Other 2 1.56 100.00
Total 128 100.00
So this tells you that there are 5 people with blue eyes in the data set, and 128 people with
known eye colour.
Here’s the fre command (you have to install it first):

. ssc install fre
checking fre consistency and verifying not already installed...
installing into c:\ado\plus\...
installation complete.
. fre eye_colour_n
eye_colour_n
Freq. Percent Valid Cum.
Valid 1 Blue 5 3.70 3.91 3.91

2 Brown 91 67.41 71.09 75.00
3 Grey 5 3.70 3.91 78.91
4 Green 25 18.52 19.53 98.44
5 Other 2 1.48 1.56 100.00
Total 128 94.81 100.00
Missing . 7 5.19
Total 135 100.00
© C. Neill, 2023 6
It tells you that there are 135 observations in the data set in total, with 7 missing values. It also
tells you that the “Blue” eye colour is stored as the number 1, 2 is “Brown” and so on. So it’s a
better option than tab, if you are less sure of what’s going on with your data.
The classic variable in Statistics Canada data that is categorical but stored as a number is
province. Province in the Census is stored as a number, with 10=Newfoundland, 11=PEI,
12=Nova Scotia, 13=New Brunswick, 24=Quebec, 35=Ontario, 46=Manitoba,
47=Saskatchewan, 48=Alberta and 59=BC. But the numbers don’t mean anything themselves.
They just represent the category. You can tell this because you can’t identify units, or talk
sensibly about the meaning of a one unit increase in Province.
You should never, ever just throw a categorical variable with more than two categories into a
regression. Or in fact do summary statistics on the numbers. The mean of such a variable
doesn’t mean anything. If you do the mean of Province, you might get something like 23.8.
This doesn’t mean anything. To summarize these sorts of variables you need to describe the
percentage of the sample that is in each category.
There is one exception to the rule that a mean of a categorical variable doesn’t mean anything:
and that’s the case of an indicator or dummy variable, which is a categorical variable with only
two categories, and which takes a value of 0 for one category and 1 for the other. The classic
example of this is gender – at least if we define that as a binary which is sadly the case in a lot
of our data sets. We might get gender = 1 for women and 0 for men. In this case, if we get a
mean(gender) = 0.4, that means our data set has 40% women. But imagine if we added a
‘nonbinary’ category with a value equal to 2. Then the mean of this new variable would not
mean anything.
There’s another type of categorical variable that you might want to think about – ordered
categories. For instance, a Likert scale may have a value of 5 meaning “very happy”, 4
meaning “happy”, 3 meaning “neither happy nor unhappy”, 2 “unhappy” and 1 “very unhappy”.
It is still not valid to take the mean of such a variable. And since regressions are really about
finding (conditional) means, you can’t use it like that in a regression. But the median does mean
something (why?).
To include a categorical variable in a regression, create a full set of dummy variables, one for
each category, and include them in the regression as fixed effects. To actually estimate each
coefficient, though, you will need to drop one of the categories, and have that be the base or
excluded category (why? Think about multicollinearity). We might in the case of provinces
create dummy variables for each province, but leave the Ontario dummy variable out of our
regression. This is called ‘including a full set of province fixed effects’.
Interval data is data that were originally numeric, but that have been grouped into intervals and
so are now categorical. For example, you might have an “income range” variable, where
1=<$10,000, 2=$10-$20,000, etc. You can do a bit more with interval data than you can with
regular categorical data – in particular, you can calculate an approximate mean, depending on
how narrow the interval ranges are. But you still can’t use any numbers for the categories as if
they are true numerical data.
© C. Neill, 2023 7
Equally, you need to think about what sort of variable can be your dependent variable and how
that might change what type of model you estimate. Any time you have a variable for which the
mean actually means something, you can use a regular regression model. So that means if you
have an indicator variable, you can use that as the dependent (y) variable in a regular
regression – that is called a linear probability model. There are some cases when that is not a
good idea, though, it turns out. You can always use a probit or logit model though, which takes
account of the fact that the error terms cannot be normally distributed if your dependent variable
can only take values of zero or one.
For any other type of categorical variable, the mean doesn’t mean anything, so you can’t use it
as a dependent variable in a regular regression. For ordered categorical data you can estimate
an ordered probit or logit model. For regular categorical data you can estimate a multinomial
logit or probit model. But at this stage it’s probably easier just to transform it into a bunch of
indicator variables and estimate a standard linear probability model with that.
Generally speaking, if you can’t clearly explain what each variable you’re planning to use is,
how it is calculated (or derived), what values it can logically take, etc., then there’s no point
continuing. Not doing this step properly is the most common way of having the ‘Garbage
In/Garbage Out’ problem.
7. Cleaning data
Once you think you have all the data together, and safely saved somewhere, you need to think
about what it all means. Often there will be errors in the data you have to fix. Other times you
just want to create new versions of particular variables – taking a log, or a lag, or calculating
percentage change, or changing the categories.
Probably the first thing to do is to just manually look at what you have using one or more of the
following commands:
• des
• sum
• codebook, compact
• browse
After that, what you do depends on what you’ve got and what you want to do. Make sure you
know what your data currently are, and where you want to end up.
This is a good guide to data cleaning with Stata from the University of Toronto:
https://mdl.library.utoronto.ca/technology/tutorials/cleaning-data-stata
You’ve learned all the key commands before, starting in EC285, but here’s a list:
• duplicates
• drop if X>=1 & (Y==1|Y==2)
• drop if missing(AGE)
• keep if year!=1999
• recode varlist (old value(s)=new value)
© C. Neill, 2023 8
• destring – if you have a variable that is really numeric that is stored as a string variable
• gen
• egen
• label
• rename
• correl
• mvencode/decode – you REALLY need to know if your data have missing values, and
especially if those missing values are stored with a numeric missing value code,
because that can be really bad. Often census income data are top coded (ie incomes
over a certain value aren’t reported, but instead given some maximum value) – that’s
also important to know.
Dealing with outliers – use sum, histograms and scatterplots to make sure your data don’t
contain any nonsensical numbers. (Note, though, that while negative age is nonsensical,
negative income is not; think carefully about your data!)
Always make sure that your logical statements did what you wanted them to do. Suppose you
want to keep people who are immigrants and people who are 50 years of age or more. Then
you’d want to use:
keep if immigrant==1 | age>= 50
not
keep if immigrant==1 & age>= 50
The latter keeps people who are immigrants aged 50 or more, which was not what the original
statement said. Unfortunately, the use of the English “and” is a bit imprecise – the math is much
more precise. But don’t just automatically translate the English “and” into the math/logic “&”.
8. Working with Time Series Data

You will want to make sure that you tell your statistical program that this is time series data. To
do this, use tsset in Stata. Similarly, if you have panel data, you might want to use xtset to
let Stata know which variable/s identify individuals and which the time period.
You also want to make sure you’ve got your date format right. There are a few different ways
Stata handles dates. Read up on it for your case.
9. Visualizing Data
While graphs of your data are not strictly required, they are often extremely helpful in setting the
scene, helping you to clean your data and to determine an appropriate empirical approach.
Types of graphs:
1. Time series graphs – with time on the horizontal axis and the value of the variable on the
vertical. This is very easy to do in Stata if you have already xtset your data. tsline.
© C. Neill, 2023 9
2. Box plots – a popular way of visualizing the distribution of a particular variable, but really
these give you little more than 5 numbers – median, 75th percentile, 25th percentile,
minimum and maximum. They may be helpful for outliers, though.
3. Histograms and kernel densities – these are two different ways of seeing how your
variable is distributed overall. They are different ways of representing the probability
density function. A kernel density is basically a smoothed histogram.
o These are helpful because sometimes they will show you obvious problems with
your data. For instance, you might have data that are top coded – e.g. in the
Census, incomes over a certain value aren’t reported, but are instead given
some maximum value. If you do a density plot, you’ll be alerted to problems like
that really quickly. It’s hard to see them just from a table of summary statistics.
4. Scatterplots – these are graphs of key dependent and explanatory variables.
o A scatterplot tells you about the correlation between variables of interest.
twoway scatter y x
o To add a line of best fit:
twoway (scatter y x) (lfit y x)
o A scatterplot might also suggest a functional form for your model. Perhaps the
relationship doesn’t look linear – might a quadratic be appropriate? If so you
want to add a square term, so first create that then:
gen y2 = y^2
10. Samples and Sample Weights

We almost never get a dataset that contains the income of each individual (or family) included in
the entire population we are interested in. Most data comes from surveys of a sample of the
population. For various reasons, these samples are partly randomized, and partly stratified.
Randomized means that every person in the population has a chance of being selected for the
survey, but not everyone is surveyed. Thus, each person has some probability, call it pi, of
being included in the survey.
The simplest possible sampling scheme is what is called a simple random sample. In that case,
each person in the population has an equal chance of being selected into the sample. In that
case, we can take the simple unweighted mean of the sample, and that will be an unbiased
estimator of the population mean.
However, most of the surveys that are run are not simple random samples – they are usually
either stratified or clustered, or some combination of the two.
Figuring out what probability of selection to have for any given individual is the task of Statistics
Canada (or whoever is running the survey). They balance off increased accuracy from having a
bigger sample against lower reliability of the survey with a smaller sample (this is measured in
terms of the expected variance of the sample estimate around the actual population average).
But there are other considerations too. For instance, suppose Statistics Canada wants to run a
survey – the Labour Force Survey – to be able to estimate with a reasonable degree of
accuracy the unemployment rate in Ontario and in PEI. As we know, the accuracy of a survey
is mostly related to the number of individuals sampled. To get an equally reliable figure, they
© C. Neill, 2023 10
will probably want to survey a larger number of people in Ontario, but a smaller percentage of
the Ontarian population. And indeed, this is exactly what they do. This is called stratifying the
sample - you split the population up into different groups (based on province, in this case), and
for each stratum, you assign a different probability of selecting into the sample. Here, Ontarians
have a smaller probability of being selected into the sample than do PE Islanders.
What this means is that if you want to get a good estimate of the average income in Canada,
you have to take these different sampling probabilities into account – otherwise, you will get an
average Canadian unemployment rate that includes too many PE Islanders and not enough
Ontarians to be representative of the population as a whole. Statistics Canada helps with this
by including in the survey data set what are called population weights. These population
weights tell us the inverse of the probability of being selected into the sample. Suppose that
Ontarians had a 1% chance of being surveyed and PE Islanders had a 20% chance of being
surveyed. The sample weights for all Ontarians would be 100, while the sample weights for all
PE Islanders would have been 5. These weights, then, represent the number of households in
the population which are being represented by the particular household being sampled. Each
Ontarian family in the sample represents itself and 99 other Ontarian families. Each PE Islander
family surveyed represents itself and 4 other families.
In reality, Statistics Canada stratifies on a number of characteristics other than province of

residence when designing its samples, so the weights are a bit more complicated in practice.
But the idea is the same.
Given these weights, it's easy to calculate a weighted average. The formula for a weighted
average is very straightforward. Just multiply each value of the variable by its associated
weight, wi.
I
w y i i
= i =1
I
w
i =1
i
In Stata, you can identify the data as survey data with appropriate weights using the svy
commands.
When running regressions, if the goal is to get an estimate of a mean, then we’d want to use the
survey weights. You can do this using the svy commands, or you can use a normal regress
command with the aweight option:
reg y x [aweight=PUMFWT]
11. Standard errors: groups, clustering, time series

You should always use robust standard errors, to deal with the basic types of heteroskedasticity
that are always present. This affects the standard errors, not the point estimates. If there is no
important heteroskedasticity, this won’t change much:
reg y x [aweight=PUMFWT], r
© C. Neill, 2023 11
There are cases where you likely want to take account of groups in your data. For instance,
everyone in a particular province may be hit by a particular labour market shock in 1999 while
people in other provinces were not. You might want to cluster by a combination of (province x
year). Other times, you might want to take account of the idea that everyone in a province is
similar and cluster at the province level. Doing this involves using the , cluster(group) option
after a regress command. How exactly you do this is going to depend on the particular data you
have and what you are trying to identify – so again, think about how your data are structured (or
in other terminology what the data generating process is).
reg y x [aweight=PUMFWT], cluster(prov)
Serial correlation is a similar problem – observations that are close to each other in time are
likely to have correlated errors. You can deal with this using time series models.
© C. Neill, 2023 12
12. Key Stata Commands

Stata commands
Create a new numeric variable gen a = 1
Create a new string variable gen s = “s”
Create a blank numeric variable gen a = .
Create a dummy variable gen agegt50 = age >50
Create a series of dummy variables based on tabulate x, gen(cat_x)

a single categorical variable
Rename a variable rename var2 GDPpc
Replace replace a = 5 if age>=50
replace a = 2 if age==20
replace a = 3 if age>=21 & age<31
replace a = 4 if inrange(age, 31, 50)
Replace on string replace a = 2 if inlist(eye_colour, “blue”, “green”)
Make a numeric variable from a string variable split make2, parse(, “ “)
tab make21
encode make21, gen(Make_numeric)
See values of categorical variables without tab catvar

labels
tab catvar, nolabel
Destring destring stringvar, replace force
// note: be careful to check there aren’t any non-numeric

values, or if there are that it’s ok for them to be missing
after the destring
destring stringvar, gen(destrung)
Recode recode Sex(1=0 “Men”) (2=1 “Women”) (9=.)
Rescale sum price
gen pindex=price/r(mean)*100
© C. Neill, 2023 13
gen p_z = (price – r(mean))/r(sd)

egen p_z = std(price)
gen ln_p = ln(price)
Summarizing codebook, compact
labelbook varname
ssc install fre, replace

fre varname
Install packages ssc install packagename
Append
Merge
Reshape
Sort
duplicates
drop if X>=1 & (Y==1|Y==2)
drop if missing(AGE)
keep if year != 1999
recode varlist (old value(s)=new value)
Correl
label values varname_v
label var varname varname_v
© C. Neill, 2023 14
13. Some other resources:

Kane, John V. (2023) CLEANING VARIABLES IN STATA: REPLACING, RECODING, AND
RESCALING. Online notes
https://drive.google.com/file/d/1bBVAHJMDSpDOPiNs_o8noRw8jtIIMTUg/view -
• a guide to how to do a lot of this in Stata, using the auto.dta data set.
Data cleaning with Stata from the University of Toronto:

https://mdl.library.utoronto.ca/technology/tutorials/cleaning-data-stata
Creating a log file in Stata:

https://sociology.fas.harvard.edu/files/sociology/files/creating_a_log_file.pdf
• If you don’t specify the file extension as .log, Stata will save the log file in .smcl format.
A .smcl file can only be opened in Stata. Usually the whole point of a log file is to be
able to use it with text applications / read it without using Stata. So you almost certainly
want to save as a .log file. Just always do that.
• You can save a .smcl log file as a pdf using the following in Stata:
o ssc install smcl2do
o translate logfilename.smcl pdflog.pdf, translator(smcl2pdf)
More on Stata: https://www.johanosterberg.com/getting-started-with-stata/
© C. Neill, 2023 15

Department of Economics: ECONOMICS 481: Economics Research Paper and Seminar

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Department of Economics: ECONOMICS 481: Economics Research Paper and Seminar

Uploaded by

Copyright:

Available Formats

Department of Economics

ECONOMICS 481: Economics Research Paper and Seminar

For Canadian public finance data, there is Finances of the Nation.

use “filename”, clear

© C. Neill, January 2023

To change the working director, you can use:

global workdir = “c:\users\myname\EC481\”

import excel “filename”, sheet(sheetname) firstrow

3. Understanding your data set

1. The unit of observation

reshape long GDP Pop, i(Country) j(Year)

4. Putting different data sets together.

use $workdir\data\LFS_2000.dta, clear

use $workdir\NBA_stats_2020.dta, clear

6. Understanding your data types

If eye_colour is a string variable and you type:

Variable Obs Mean Std. Dev. Min Max

eye_colour_n 128 2.4375 .9027517 1 5

you will get a table back something like this:

Blue 5 3.91 3.91

Total 128 100.00

Here’s the fre command (you have to install it first):

Freq. Percent Valid Cum.

Valid 1 Blue 5 3.70 3.91 3.91

keep if immigrant==1 | age>= 50

keep if immigrant==1 & age>= 50

8. Working with Time Series Data

10. Samples and Sample Weights

In reality, Statistics Canada stratifies on a number of characteristics other than province of

11. Standard errors: groups, clustering, time series

reg y x [aweight=PUMFWT], cluster(prov)

12. Key Stata Commands

Create a new numeric variable gen a = 1

Create a new string variable gen s = “s”

Create a blank numeric variable gen a = .

Create a dummy variable gen agegt50 = age >50

Create a series of dummy variables based on tabulate x, gen(cat_x)

Rename a variable rename var2 GDPpc

Replace replace a = 5 if age>=50

replace a = 3 if age>=21 & age<31

replace a = 4 if inrange(age, 31, 50)

Replace on string replace a = 2 if inlist(eye_colour, “blue”, “green”)

Make a numeric variable from a string variable split make2, parse(, “ “)

encode make21, gen(Make_numeric)

See values of categorical variables without tab catvar

Destring destring stringvar, replace force

// note: be careful to check there aren’t any non-numeric

destring stringvar, gen(destrung)

Recode recode Sex(1=0 “Men”) (2=1 “Women”) (9=.)

Rescale sum price

gen p_z = (price – r(mean))/r(sd)

gen ln_p = ln(price)

Summarizing codebook, compact

ssc install fre, replace

Install packages ssc install packagename

drop if X>=1 & (Y==1|Y==2)

keep if year != 1999

recode varlist (old value(s)=new value)

label values varname_v

label var varname varname_v

13. Some other resources:

Data cleaning with Stata from the University of Toronto:

Creating a log file in Stata: