Panel Data Analysis using Stata

Panel Data Analysis using Stata
Norhisam Bulot
Faculty of Business and Management
Universiti Teknologi MARA Perlis
Tel: +604 9882734 @ +6019-2987836 Email :

norhisam@uitm.edu.my / norhisambulot@gmail.com
OBJECTIVES OF THE WORKSHOP

This workshop provides an introduction to econometric methods for analysing micro panel data (datasets
with many panels and “few” time periods) and specific procedures for carrying them out using Stata. At
the end of the session, the participants should to be able to:
� Enter and define data in a proper manner.
� Explain the types and nature of the data, data analysis steps & tehniques.
� Formulate STATIC econometric models for panel data estimations
� Perform an empirical analysis, involving the construction of panel data models using “real” data.
� Interpret the empirical results of static single equation panel data models
� Transfer outputs from Stata to other presentation format (Microsoft Word)
PEDAGOGY
To achieve the above objectives, this half-day (one-day?) session will focus on hands-on training in the
use of Stata. It will assume some basic knowledge of research process & statistics.
1. Types of data
1.1 Cross Sectional Data
� A sample of individuals, households, firms, countries, cities, or any other type of units at a
specific point in time
� Denoted by the subscript i
� Example: Data on share price (N=100 companies)
� Example: Determinants of students’ academic performance (2019)
� Model: Yi = β0 + β1X1i + β2 X2i + β3 X3i + β4 X4i + β5 X5i + + εi
1.2 Time series data

� Observation on one or several variables over time
� Chronological order and can have different frequencies (annual, quarterly, monthly, weekly,
daily, hourly)
� Examples: GDP, unemployment, inflation, stock prices
� Denoted by the subscript t
� Example: Determinants of unemployment: Evidence from Malaysia
� Model: Yt = β0 + β1X1t + β2 X2t + β3 X3t + β4 X4t + β5 X5t + + εt
COUNTRY T Y X1 X2 X3 X4
MALAYSIA 1 307.56 103.2 25.43 0.56 74.22
MALAYSIA 2 170.42 -4.83 88.03 0.55 -32.25
MALAYSIA 3 120.56 -111.5 -57.81 -0.52 -292.7
MALAYSIA 4 102.73 -151.3 70.97 -1.4 -541.9
MALAYSIA 5 470 96 111.07 -2.93 3.78
MALAYSIA 6 672.69 98.49 9 0.6 54.27
MALAYSIA 7 511.85 78.19 45.41 0.72 24.53
MALAYSIA 8 67.64 -134.5 104.8 -0.29 -448.3
MALAYSIA 9 42.94 -172.9 -39.43 -1.07 -758.5
MALAYSIA 10 32.42 -196.5 8.21 -1.59 -900.2
MALAYSIA 11 396.55 176.07 -31.53 0.87 108.76
MALAYSIA 12 626.4 265.66 -47.19 0.82 89.73
MALAYSIA 13 465.34 217.29 14.97 0.82 106.69
MALAYSIA 14 494.34 109.31 88.88 0.79 89.94
MALAYSIA 15 495.98 111.16 -3.53 0.81 65.99
1.3 Panel Data

� A combination of time series and cross sectional data
� Observation on the same units (people, firms, cities, states, markets, countries) in several
different time periods (annually, quarterly, monthly, etc.) ... Short Panel (Micro) VS. Long Panel
(Macro)
� Denoted by the use of both i and t subscripts
� Example:
� Determinants of leverage: Evidence from Malaysia
� Determinants of unemployment: Evidence from ASEAN countries
� Model: Yit = β0 + β1X1it + β2 X2it + β3 X3it + β4 X4it + β5 X5it + εit
COMPANY YEAR Y X1 X2 X3 X4
ABRAR CORPORATION BERHAD 5 307.56 103.2 25.43 0.56 74.22
ABRAR CORPORATION BERHAD 4 170.42 -4.83 88.03 0.55 -32.25
ABRAR CORPORATION BERHAD 3 120.56 -111.5 -57.81 -0.52 -292.7
ABRAR CORPORATION BERHAD 2 102.73 -151.3 70.97 -1.4 -541.9
ABRAR CORPORATION BERHAD 1 470 96 111.07 -2.93 3.78
ACTACORP HOLDINGS BERHAD 5 672.69 98.49 9 0.6 54.27
ACTACORP HOLDINGS BERHAD 4 511.85 78.19 45.41 0.72 24.53
ACTACORP HOLDINGS BERHAD 3 67.64 -134.5 104.8 -0.29 -448.3
ACTACORP HOLDINGS BERHAD 2 42.94 -172.9 -39.43 -1.07 -758.5
AKN TECHNOLOGY BHD 5 396.55 176.07 -31.53 0.87 108.76
1
AKN TECHNOLOGY BHD 3 465.34 217.29 14.97 0.82 106.69
1.3.1 Balanced
� Each i has the same number of t (observations)
ABRAR CORPORATION BERHAD 4 170.42 -4.83 88.03 0.55 -32.25
ACTACORP HOLDINGS BERHAD 5 672.69 98.49 9 0.6 54.27
ACTACORP HOLDINGS BERHAD 4 511.85 78.19 45.41 0.72 24.53
1.3.2 Unbalanced
� Each i has varying number of t (observations)
Research Topic?
2. Data collection & preparation: Important steps
a. Determine the variables to be included in your study
● Dependent variable
● Independent variables
● Moderating
● Intervening / Mediating
b. Determine the “most appropriate” proxy / formula for each of the variables.
c. Determine the sources of data.
2
● Osiris
● Datastream
● Bank Negara Malaysia (BNM)
● Worldbank
● Bankscope
● Department of statistics Malaysia
● Etc...
d. Collecting the data
e. Checking your data (completeness, consistency and accuracy).
f. Data transformation (if needed … please refer to section 4.4)
3
3. Data analysis steps: Short Panel (Micro panel)
Step 1: Descriptive Analysis
Step 2: Determining the optimal model (Optional)
Step 3: Choosing the most appropriate panel data estimator (Optional)
Static Dynamic
vselest vselest
Step S1: Panel specification tests Step D1: DGMM VS. SGMM
BP-LM F-test Hausman DGMM SGMM
POLS @ FE @ RE
Step S2: Diagnostic tests_ Static model Step D2: Diagnostic tests_ Dynamic model
Multicollinearity Autocorrelation
Heteroskedasticity Sargan test
Serial correlation Number of instruments
Normality* Normality*
Step S3: Strategy to rectify the problem(s) Step D3: Strategy to rectify the problem(s)
Step S4: Report the findings Step D4: Report the findings
4
4. Introduction to Stata (Stata 14)
4.1 How to open your Stata… StataMP.exe
4.2 Stata Windows (Stata 14)

The Stata display will contain four primary windows: Command Window, Results Window; Review
Window; Variable Window
Display the
Display list of list of
commands variables
that you have
This window will display the entered command,
already
completed followed by its corresponding output/results
corresponding
output/results
The command window is where the user enters commands for

Stata to execute. Simply type a command and hit enter
4.3 Preliminary Steps

4.3.1 Downloading Stata command(s)
Please download these commands using the “findit” command*
● xttest3: calculates a modified Wald statistic for groupwise heteroskedasticity in the residuals of a
fixed effect regression model.
5
● xtserial: implements the Wooldridge test for serial correlation in panel data.
● vselect: performs variable selection for linear regression.
● eststo: making regression table in Stata command are needed to analyse your Please download
these commands using the “findit” command*
*findit xttest3 … st0004_2 … click here to install*
*findit xtserial … st0039 … click here to install*
*findit eststo … st0085_1 … click here to install*
*findit vselect … st0213 … click here to install*
4.3.2 Creating a Log File

A log is a file containing what you type and Stata's output. Log file allows you to make a full record of
your Stata session. You can either create a new log file or use the existing log file.
4.3.2.1 Creating NEW log file

Stata command: log using yourfilename.log
6
4.1.2.2 Using existing log file*
If you want to add more Stata output to an EXISTING log file, use “append”.
Stata command: log using yourfilename.log,append
4.1.3 Inputting your data into Stata

4.1.3.2 Entering data manually
Stata works like any other spreadsheet: Within Stata go to data editor (or type “edit”), and put the data
manually.
7
4.1.3.3 Importing data from excel
Stata expects a single matrix or table of data from a single sheet, with at most one line of text at the start
defining the contents of the colums. Using your windows computer:
� Start EXCEL
� [Open dataset file... example: Data.]
� Highlight the data of interest, the pull down Edit and choose Copy (Ctrl C)
8
� Click “Data Editor” or type “Edit” in the command window.
� Select the first column, paste the data & choose “Variable names”.
� Save the data.
4.2 Converting string variable to numeric variable

o If you have a string variable and want to convert it to a numeric variable, you can use the encode
command (kindly check your string variable’s name. In this example, the string variable is
“company”).
o The goal is to assign each value the string takes on to a numeric value.
9
o Use Stata command to transform the string variable: encode company,gen(ccode)
4.3 Display all output.

The goal is to display all Stata output at once without having to press a button or click –more-- to
continue.
Stata command: set more off
10
4.4 Data transformation (Optional)
In data analysis, data transformation is the replacement of a variable by a function of that variable: for
example, replacing a variable “x” by the square root of “x” or the logarithm of “x”. In a stronger sense, a
transformation is a replacement that changes the shape of a distribution or relationship.
You should consider these things before transforming your variable:
● Reasons for using transformations
● Transformations for proportions and percents
● Transformations for variables that are both positive and negative
*replace “varname” with your variable*
Standardize
foreach var of varlist varname {

egen z`var'=std(`var')
}
Log (Please be aware of the conditions for transforming your variable into log)

gen Log`var'=ln(`var')
}
First Difference

gen D1`var'=D.`var'
}
● Assuming we would like to transform lev prof io intang into log … the command is as follows
(replace “varname” with your variable)
11
Further readings:
a) Log transformation and its implications for data analysis
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4120293/
b) Why do we do transformation before data analysis?

https://www.researchgate.net/post/Why_do_we_do_transformation_before_data_analysis
c) Topic 8: Data transformations

http://www.unh.edu/halelab/BIOL933/Readings/Topic8_Reading.pdf
12
4.5 Declare data set to be panel data
The nest step in working with panel data is to describe the panel structure of your data. Once your
workfile is structured as a panel workfile, you may take advantage of the Stata tools for working with
panel data, and for estimating equation specifications using the panel structure.
Stata command: xtset ccode year
*PLEASE SAVE YOUR DATA_SAVE YOUR DATA_PLEASE SAVE YOUR DATA*
13
5. Data analysis
5.1 STEP 1: Descriptive analysis
Objective: to describe your data / getting a feel for your data
Stata Command:
sum dviv
xtsum dviv
14
Notes for .xtsum
Stata lists three different types of statistics: “overall”, “between”, and “within”.
(1) Overall statistics are ordinary statistics that are based on your total observations & are the same as
those in the .sum output above.
(2) Between statistics are calculated based on summary statistics of the entities (number of company @
country) regardless of time.
(3) Within statistics are calculated based on summary statistics of time periods regardless of entities.
15
“… The summary statistics of the dependent and independent variables over the sample period are
presented in Table 1…”
Table 1: Descriptive Statistics

Variables N Mean SD Median Min Max
A
B
C
5.2 STEP 2: Determining the optimal model (optional)*

Objective: to determine the most optimal combination of predictors.
Stata command, vselect, developed by Lindsey and Sheather (2010) will be used to determine whether
certain variable should be included (or excluded) in the final model.
Stata Command
vselect dviv,best
16
17
Decision Rule: Choose the one with the lowest AICC (Yang, 2005).
“… As shown in Table 2, the most optimal number of variables is xxx. The chosen variables
are xxx…”
Table 2: Variable Selection
Variable Selection Optimal Model
Models R2ADJ C AIC AICC BIC # Ivs
Model 1
Model 2
5.3 STEP 3: Choosing the most appropriate panel data model (optional)
Options:
a) Consider both Static and Dynamic
b) Choose Static (please proceed to section 5.3.1)
c) Choose Dynamic (Please proceed to section 5.3.2)
a) Consider both static and dynamic

Objective: to choose between static and dynamic panel data technique by looking at the coefficient of the
lagged dependent variable
Stata Command: xtabond dviv,noconstant
18
Decision rule: p-value < 0.05 = Dynamic; > 0.05 = static (Brañas-Garza et al., 2011)*
*For static model, please proceed to section 5.3.1*
*For dynamic model, please proceed to section 5.3.2*
5.3.1
5.3.2 Static Model
5.3.2.1 Step S1: Panel Specification Tests*

Objective
19
To choose the most appropriate panel data analysis technique: Pooled Ordinary Least Squares (POLS) @
Fixed Effect (FE) @ Random Effect (RE)*
Test
F-test, Breusch and Pagan Lagrangian multiplier test (BP-LM Test) & Hausman Test
i. F-test: To choose between POLS and FE
ii. BP-LM Test: To choose between POLS and RE
iii. Hausman Test: To choose between FE & RE
Three options
i. Option 1: Perform all tests: F-test, BP-LM test & Hausman test [Park (2011)]
ii. Option 2: Perform Hausman test only
iii. Option 3: The choice of technique (between FE & RE) is determine by whether you think there
are unobserved, individual specific (and time-invariant) factors that influence your outcome and
are correlated with the X variable(s) of interest. If there are, use FE, if not, use RE. (Please refer
to the basic assumption for POLS, FE & RE below)
Assumptions:
i. POLS: Basic assumption: The Pooled Ordinary Least Squares (POLS) is a pooled linear
regression without fixed and/or random effect. It assumes that the SLOPES AND INTERCEPT
OF FIRMS ARE CONSTANT across individuals and time. In general, OLS ignores the
individual and time effects.
ii. RE: Basic assumption: no omitted variable (OV); omitted variable is UNCORRELATED with
the explanatory variables (IV) that are in the model
iii. FE: Basic assumption: there is omitted variable; the omitted variable is CORRELATED with the
explanatory variable (IV); FE provides means for controlling the effect of OV on IV; that is,
whatever effect that the OV have on the subject at one time, it will have the same effect at later
time (Fixed Effect).
“… The result of the panel specification tests as presented in Table 3 suggests that xxxx model is the
most appropriate data analysis technique…”
Table 3: Panel Specification tests

p-values of the tests
Models Lagged DV F-test BP-LM Hausman Technique
Model 1
Model 2
5.3.2.1.1 S1(a): F-test
● Objective: Choosing between POLS and FE
● Stata command: xtreg dviv,fe
● Decision rule: p-value > 0.05 = POLS; p-value < 0.05 FE
20
5.3.2.1.2 S1(b): BP-LM test
● Stata command:
quietly xtreg dviv,re
xttest0
● Objective: Choosing between POLS and RE
● Decision rule: p-value > 0.05 = POLS; p-value < 0.05 RE
5.3.2.1.3 S1(c): Hausman Test

● Objective: Choosing between FE and RE
● Decision rule: p-value > 0.05 = RE; p-value < 0.05 FE
quietly xtreg dviv,fe
est store fixed
quietly xtreg dviv,re
21
hausman fixed, sigmamore
5.3.2.1.4 Decision Rule for panel specification tests

F-test BP-LM test Hausman test
p-value model p-value model p-value model Model
Outcome 1 > 0.05 POLS > 0.05 POLS - - POLS*
Outcome 2 < 0.05 FE > 0.05 POLS - - FE
22
Outcome 3 > 0.05 POLS < 0.05 RE - - RE
Outcome 4 < 0.05 FE < 0.05 RE < 0.05 FE FE
Outcome 5 < 0.05 FE < 0.05 RE > 0.05 RE RE

Source: Park (2011)
5.3.2.2 Step S2: Diagnostic tests

Linear regression is an analysis that assesses whether one or more predictor variables explain the
dependent (criterion) variable.
Objective: to investigate the existence of severe multicollinearity, heteroskedasticity & serial correlation
problem.
Importance of diagnostic tests: without verifying that your data have met the assumptions of linear
regression, your results may be misleading.
Most Important Assumptions / Requirements of Linear Regression: Multicollinearity, Heteroskedasticity,

Serial Correlation & Normality of the residuals).
“… As presented in Table 4, the diagnostic tests on the baseline model indicate the presence of xxx
problem(s). Following the suggestion by Hoechle (2007), the remedial procedure has been carried
out by using the xxxxx…”
Table 4: Diagnostic Tests for Static Models

p-values of the tests
Models VIF H SC Strategy
Model 1
Model 2
5.3.2.2.1 Multicollinearity
Multicollinearity: exists when two or more of the predictors in a regression model are moderately or
highly correlated.
Objective: to ensure there is no “severe” multicollinearity in the data.
Potential solution to the problem: (1) Do nothing, (2) Remove some of the highly correlated independent
variables, (3) Linearly combine the independent variables, such as adding them together, (4) Perform an
analysis designed for highly correlated variables, such as principal components analysis or partial least
squares regression & (5) Other relevant strategy.
(http://statisticsbyjim.com/regression/multicollinearity-in-regression-analysis/)
Techniques: (1) Variance Inflation Factors (VIF) & (2) Pearson Correlation.
23
5.3.2.2.1.1 Variance Inflation Factors (VIF)
Stata command:
quietly regress dviv
vif
Decision rule: vif > 10 = severe multicollinearity problem; vif < 10 = No multicollinearity problem
The cut-off point of “10” is most commonly used by many practitioners (O’Brien, 2007)].
5.3.2.2.1.2 Pearson Correlation

Stata Command:
pwcorr iv,sig
Decision rule: coefficient > 0.90 = very high correlation @ severe multicollinearity problem exists;
coefficient < 0.90 = NO multicollinearity problem.
The cut-off point of 0.90 is suggested and used by many researchers (Example: Hinkle et. al., 2003;
Mukaka, 2012)].
5.3.2.2.2 Homoscedasticity
The variance of the error term is constant. The assumption of homoscedasticity fails when the variance
changes (variance is not constant).
”homoscedasticity of the regression error term: that its variance is assumed to be constant in the
population, conditional on the explanatory variables. The assumption of homoscedasticity fails when the
variance changes in different segments of the population: for instance, if the variance of the unobserved
factors influencing individuals' saving increases with their level of income. In such a case, we say that the
error process is heteroskedastic...” (Wooldrige, 2015)
Heteroskedasticity Test: Modified Wald test for GroupWise heteroskedasticity*

Only For Fixed Effect (FE). For random effect (RE), the errors of the model assumed to be
heteroskedastic. Please refer to 4(e) for explanation.
Objective: to check the existence / severity of heteroskedasticity problem.
Stata command:
quietly xtreg dviv,fe
xttest3
Decision rule: p-value > 0.05 = No heteroskedasticity problem (The errors are homoscedastic); p-value <
0.05 = heteroskedasticity problem exist.
5.3.2.2.3 Serial Correlation
Serial Correlation: the relationship between a given variable and itself over various time intervals @ the
error in a period influences the error in a subsequent period – the next period or beyond.
24
Objective: To check the existence of serial correlation or auto correlation in the residuals.
Test: Wooldridge test for autocorrelation in panel data *

Stata Command:
xtserial dviv
*Decision rule: p-value > 0.05 = No serial correlation problem; p-value < 0.05 = serial correlation
problem exist*
5.3.2.2.4 Normality Assumption

Technically normality is necessary only for hypothesis tests to be valid, estimation of the coefficients
only requires that the errors be identically and independently distributed.
Two options available: (1) Test the assumption & (2) Skip the test.
Option (1): Test the assumption using Stata command xtsktest @ any other “suitable” test(s).
Test: Tests for skewness and kurtosis.

Stata command (this test is available in Stata 12 and above)
xtsktest dviv
Decision Rule:
p-value < 0.05 = we can REJECT the data is normally distributed (your data is not normally
distributed).
p-value > 0.05 = we cannot reject the hypothesis that the data is normally distributed (Your data
is normally distributed).
Option (2): Skip the test (normality of the residuals will not be tested).
The residual is assumed to be normally distributed [Minimum observations required: 30 @ 100: Kindly
read the following explanation on “Central limit theorem”.
Central Limit Theorem.
Central limit theorem (CLT) is used to justify the action of not testing the normality assumptions
The arguments / idea (below) were put forward by Love (2013):

https://www.researchgate.net/post/What_is_the_rationale_behind_the_magic_number_30_in_statistics
a) The CLT basically states that as your sample size increases, the distribution of the sample mean of a
parameter becomes normally distributed, regardless of the underlying distribution of data. (This is a
bit of an oversimplification, but I believe it's suitable for this discussion).
b) "n > 30" is not a rule, it is a guideline.*

”…A sample size of 30 of is considered to be “typically large enough” for repeatedly sampled means
to be “approximately normally distributed”. It is my understanding that this was determined using
25
simulations in which means based on repeated samples of different sample sizes from “known”
distributions were evaluated, and n = 30 was the smallest sample size that tended to be normally
distributed regardless of the underlying distribution. Normally distributed means/errors are an
underlying assumption in many statistical tests, which is why this tends to be important….”
“…Therefore, it is reasonable to “assume” that if your sample is 30 or greater, your mean has a
normal distribution with sample variance equal to population variance divided by sample size
(sigma^2/n)…”
“…Please note that this is still an “assumption”. It is not "ensured", and it is not testable in any
specific instance without repeatedly sampling a population. But while it is an assumption, it is a very
reasonable assumption based on previous statistical research…”
“…Other misinterpretations, misapplications, or misuses of central limit theorem (specifically here,
or generally elsewhere) may be the product of ignorance or sloth on the part of the researcher, BUT
assuming that the mean of sample n=>30 has an approximate normal distribution is not the product
of ignorance or sloth…”*
Possible reasons for non-normality.

a) Extreme Values
b) Overlap of two or more processes
c) insufficient data discrimination
d) Sorted data
e) Values close to zero or a natural limit
f) Data follows different distribution.
Please read https://www.isixsigma.com/tools-templates/normality/dealing-non-normal-data-
strategies-and-tools/ @ any other relevant references*
Possible remedies.
a) Remove the outliers (extreme values)
b) Transform the data (example: log transformation)
c) Other strategy.
You may remove / exclude the outliers from your analysis by using the following Stata command
Step 1:Identify the existence of outliers.
regress dviv
predict d1,cooksd
quietly generate cutoff =d1> 4/no of observation
list ccode d1 if cutoff
Step 2: Exclude the outliers from your analysis.

You can exclude the outliers from your analysis by adding “if cutoff~=1”
Example 1
*Original Command: sum dviv
26
*Without Outliers: sum dviv if cutoff~=1
Example 2
*Original Command: xtreg dviv, fe
*Without outliers: xtreg dviv if cutoff~=1, fe
5.3.2.3 Step S3: Final model / strategy to rectify the problem(s)
Objective: to choose the most appropriate strategy to rectify the problem(s).
Decision Rules for selecting the most appropriate strategy to rectify the problem(s) (following the
suggestion by Hoechle (2007).
Model M H SC Final Model Stata Command
a. FE - - - Fixed-effects (within) regression xtreg dviv,fe
b. FE - √ - Fixed-effects (within) regression with robust xtreg dviv,fe robust

option
c. FE - - √ FE (within) regression with AR(1) xtregar dviv,fe

disturbances
d. FE - √ √ Fixed-effects (within) regression with xtreg dviv,fe cluster (ccode)

cluster option
e. RE* - - - Random-effects GLS regression with robust xtreg dviv,re robust

option
f. RE* - - √ Random-effects GLS regression with cluster xtreg dviv,re cluster (ccode)
option
* For RE model, the errors are assumed to be heteroskedastic.
27
Explanation:
"Economic theory rarely gives any reason to believe that the errors are homoscedastic. It is therefore
prudent to assume that the errors might be heteroskedastic unless you have compelling reasons to believe
otherwise. (...) If the homoscedasticity-only and heteroskedasticity-robust standard errors are the same,
nothing is lost by using the heteroskedasticity-robust standard errors; if they differ, however, then you
should use the more reliable ones that allow for heteroskedasticity. The simplest thing, then, is always to
use the heteroskedasticity-robust standard errors…”(Stock & Watson, 2007)*
28
5.3.3 DYNAMIC MODEL
The Difference and System GMM estimators are designed for panel analysis and embody the following
assumptions about the data-generating process (Baltagi & Giles, 1998):
● The process may be dynamic, with current realizations of the dependent variable influenced by
past ones
● There may be arbitrarily distributed fixed individual effects. This argues against cross-section
regressions, which must essentially assume fixed effects away, and in favor of a panel setup,
where variation over time can be used to identify parameters.
● Some regressors may be endogenous.
● The idiosyncratic disturbances (those apart from the fixed effects) may have individual-specific
patterns of heteroskedasticity and serial correlation.
● The idiosyncratic disturbances are uncorrelated across individuals.
Also, some secondary concerns shape the design:
● Some regressors can be predetermined but not strictly exogenous; that is, independent of current
disturbances, some regressors can be influenced by past ones. The lagged dependent variable is
an example.
● The number of time periods of available data, T, may be small. (The panel is “small T, large N”.)
“…Another important step in the data analysis process is to choose the most appropriate dynamic
panel data estimator. The two available estimators are difference GMM and System GMM…”
“…The dynamic structure of the model makes OLS estimator upwards biased and inconsistent,
since the lagged dependent variable is correlated with the error term. The within transformation does not
solve the problem because of a downward bias (Nickel, 1981). A possible solution is represented by the
Generalized Method of Moments (GMM) technique (Presbitero, 2008). Dynamic panel data models are
useful when the dependent variable depends on its own past realizations (Brañas-Garza et al., 2011)…”
“…Although the General Method of Moments (GMM) is the method of estimation of dynamic
panel models that provides consistent estimates, one still has to decide whether to use Difference GMM
(DGMM) or System GMM (SGMM) (Efendic, Pugh, & Adnett, 2009)…”
“…The SGMM estimate has an advantage over DGMM for variables that are random walk or
close to be random walks variables (Efendic et al., 2009). One way to identify this is by looking at the
value of lagged dependent variable Coefficient and p-value of the Sargan test. The closer the value to one
(more than 0.9) will indicate the need to go for a SGMM estimator. SGMM was more preferable against
DGMM. The DGMM was found to be inefficient since it did not make use all available moment
conditions (Ahn & Schmidt, 1995). In addition to that, it also had very poor finite sample properties in
dynamic panel data models with highly autoregressive series and small number of time series
observations. Due to these limitations, following the suggestion by Arellano and Bover (1995), Blundell
and Bond (1998) propose taking into consideration extra moment conditions from the level equations that
rely on certain restrictions on the initial observations. The resulting SGMM has been shown to perform
much better than DGMM estimator in terms of finite sample and mean squared error, as well as with
regard to coefficient estimator standard errors, since the instruments used for the level equation are still
informative as the autoregressive coefficient approaches unity (see Blundell and Bond (1998)). As a
result, SGMM estimator has been widely used for estimation (Sun and Ashley, 2014)…”
“…The choice of an appropriate estimator is very important as the dynamic structure of the
model will make the static estimator upwards biased and inconsistent, since the lagged level of dependent
variables is correlated with the error term. The within transformation does not solve the problem because
29
of a downward bias (Nickel, 1981) and inconsistency (Presbitero, 2008). A possible solution is
represented by the Generalized Method of Moments (GMM) technique (Presbitero, 2008)…”
5.3.3.1.1 Step D1: Perform the System GMM

Stata Command:
xtdpdsys dviv,nocons twostep
5.3.3.1.2 Step D2: Diagnostic Tests*

D2(a). Model Significance (to be checked after we get the regression result)*
*Independent variables should be jointly significant to influence or explain the dependent variable*
*Test: Check the p-value of the model*
*Decision rule: p-value < 0.05 = model is statistically significant; p-value > 0.05 = model is not
statistically significant*
D2(b) Serial Correlation: Arellano-Bond test for zero autocorrelation in first-differenced errors:*
*Serial correlation is when error terms from different (usually adjacent) time periods (or cross-section
observations) are correlated.*
*Arellano and Bond (1991) propose a test to detect serial correlation in the disturbances.*
*The test for AR (2) in the first difference is more important because it will detect autocorrelation in
levels (Mileva, 2007).*
*Decision rule: p-value < 0.05 = Serial Correlation problem; p-value > 0.05 = no serial correlation
problem*
*Strategy to rectify the problem: If the diagnostic checks indicate serial correlation problem, DGMM or
SGMM with the robust option will be used to rectify the problem*
estat abond
D2(c) Sargan test of overidentifying restrictions *

*The Sargan test (Sargan, 1958) verifies the validity of instrument subsets*
*The Sargan test has a null hypothesis of “the instruments as a group are exogenous”.* *Decision Rule:
the higher the p-value (p-value>0.05) of the Sargan statistic is the better (Mileva, 2007).*
estat sargan
D2(d) Number of Instruments

*Even though there is no clear rules concerning how many instruments are too many (Roodman 2006),
some rules of thumb may be used.*
*General Rule: following previous paper such as Efendic et al. (2009), the number of instruments should
not exceed the number of observations.*Step S4 & D4: Report the findings.
Table 5: Regression results
Model 1 Model 2
SIZE -19.90*** -23.33***
(-6.53) (-5.90)
INTANG -0.03* -0.02
(-1.90) (-1.39)
IO 6.08 6.85**
30
(1.62) (2.05)
CINV -0.01 -0.01
(-0.69) (-0.52)
TID 126.43*** .
(24.61) .
LA 0.06* 0.05
(1.72) (1.42)
Constant -141.23*** 120.40***
(-10.99) (7.77)
N 952 632
R2 0.43 0.41
R2_a 0.27 0.24
R2_o - -
F 31.18 24.15
Chi2 - -
p-value 0.00 0.00
Notes:
(1) t statistics in parentheses, (2) * p < 0.1, ** p < 0.05, *** p < 0.01, (3) TID = time in distress, CINV = change in investment, IO = investment
opportunities, INTANG = intangible assets, TANG = tangible assets, LA = liquid assets and SIZE = firm size.
(4) N = number of observations, R2 = R-squared, R2_a = adjusted R-squared, R2_o = Overall R-squared
(5) Model 1 = Overall Sample, Model 2 = Mixed Sector
6. Regression Table.
eststo:quietly your best model: refer to step S3 @ D3

esttab using yourfilename.rtf, b(2) nogap star (* 0.1 ** 0.05 *** 0.01) stats(N r2 r2_a r2_w r2_b r2_o F
p chi2) label title(you research title) nonumbers mtitles(“your model’s name”) addnote("Notes: (1)
definition of your variable: example: TID = time in distress, LEV = leverage, CINV = change in
investment. (2) Figures in parenthesis are t-statistic. (3) any other information you would like to
include")
eststo clear
7. Which information should be reported?*
a) Descriptive statistics
b) Optimal Model
c) Panel specification tests (P-value the lagged DV, F-Test, BPLM test & Hausman test
d) Diagnostic tests
e) Regression result: goodness-of-fit measures & Parameter Estimates of Regressors
i) Number of observation
ii) Number of groups / Number of entities (firm/country)
iii) Prob > Chi2 @ Prob > F (Significance of the model).
Model Significance: Independent variables should be jointly significant to influence or
explain the dependent variable.
31
Decision rule: p-value < 0.05 = model is statistically significant; p-value > 0.05 = model
is not statistically significant.
iv) R-Squared (R2)
R2 shows the amount of variance (dependent variable) explained by predictors.
Adjusted R-squared: The adjusted R-squared is a modified version of R-squared that has
been adjusted for the number of predictors in the model. The adjusted R-squared
increases only if the new term improves the model more than would be expected by
chance. It decreases when a predictor improves the model by less than expected by
chance.
Which R2 to report?
POLS = Adjusted. R2
RE = Overall
FE = Adjusted R2. … Following the suggestion by Torres-Reyna (2007), the Adjusted R 2
from “areg dviv, absorb (ccode)” command should be used
*Explanation on R2: The Stata command “xtreg” fits various panel data models,
including fixed and random effects models. For the fixed-effects model, “xtreg”
estimates within-group variation by computing the differences between observed
values and their means. This model produces correct parameter estimates (Torres-
Reyna, 2007); however, due to the larger degrees of freedom, its standard errors
and, consequently, R-squared statistic are incorrect. That is, the R-squared statistic
labelled “R-sq: within =” is not correct. As a remedy, please use Stata command
“areg” that will produce more accurate estimate of the R-squared (Torres-Reyna,
2007). As for the random effect model, the “overall” R-squared should be
reported.*
Explanatory power of the model (R2) must be “high”?

What is the “acceptable” r-squared value? 50%? 75%?
1. It is difficult to provide rules of thumb regarding what R 2 is appropriate,
as this varies from research area to research area.
2. For example, in longitudinal studies R 2 of 0.90 and higher are common,
in cross-sectional designs, values of around 0.30 are common; while for
exploratory research, using cross sectional data, values of 0.10 are
typical.
3. Different scholars have different opinions on what constitutes as good R
square (R2) variance.
4. Falk and Miller (1992) recommended that R2 values should be equal to
or greater than 0.10 in order for the variance explained of a particular
endogenous construct to be deemed adequate.
5. Cohen (1988) suggested R2 values for endogenous latent variables are
assessed as follows: 0.26 (substantial), 0.13 (moderate), 0.02 (weak)
6. Chin (1998) recommended R2 values for endogenous latent variables
based on: 0.67 (substantial), 0.33 (moderate), 0.19 (weak).
32
7. Hair et al. (2011) & Hair et al. (2013) suggested in scholarly research
that focuses on marketing issues, R 2 values of 0.75, 0.50, or 0.25 for
endogenous latent variables can, as a rough rule of thumb, be
respectively described as substantial, moderate or weak.
8. In scholarly research that focuses on marketing issues, R2 values of 0.75,
0.50, or 0.25 can, as a rough rule of thumb, be respectively described as
substantial, moderate, or weak. Sarstedt, M., & Mooi, E. (2014,p.211).
9. R-square value .12 or below indicate low, between .13 to .25 values
indicate medium, .26 or above and above values indicate high effect size
(Cohen, 1992)
*Additional Reading: R-sq*
*http://statisticsbyjim.com/regression/how-high-r-squared/ *
*http://www.academia.edu/3880005/Small_is_beautiful._The_use_and_int
erpretation_of_R2_in_social_research*
*https://people.duke.edu/~rnau/rsquared.htm*
*http://condor.depaul.edu/sjost/it223/documents/correlation.htm *
v) Significance of the individual predictors

vi) The most important / significant variable (z-value @ t-value)
vii) Coefficient of the independent variables.
viii) Rewrite the equation using the coefficient obtained from the regression result.
ix) Other relevant information / statistics: Check with your advisor
*References (selected)*
*Hinkle, Wiersma, & Jurs (2003). Applied Statistics for the Behavioral Sciences (5th ed.).*
*Hoechle, D. (2007). Robust standard errors for panel regressions with cross-sectional dependence. Stata
Journal,7(3), 281. *
*Mukaka, M. M. (2012). A guide to appropriate use of correlation coefficient in medical research.

Malawi Medical Journal, 24(3), 69-71. *
*O’brien, R. M. (2007). A caution regarding rules of thumb for variance inflation factors. Quality &
Quantity, 41(5), 673-690.*
*Park, H. M. (2011). Practical guides to panel data modeling: a step-by-step analysis using stata. Public
Management and Policy Analysis Program, Graduate School of International Relations, International
University of Japan.*
*Stock, J. H., & Watson, M. W. (2007). Introduction to econometrics.*

*Torres-Reyna, O. (2007). Panel data analysis fixed and random effects using Stata (v. 4.2). Data &
Statistical Services, Priceton University.*
33
* Wooldridge, J. M. (2015). Introductory econometrics: A modern approach. Nelson Education. *
*(Cadangan untuk pembetulan & penambabaikan: e-mail: norhisam@uitm.edu.my)*
*Wishing you all the best in your future endeavors.*

*Norhisam Bulot*
34

Panel Data Analysis using Stata

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Panel Data Analysis using Stata

Uploaded by

Copyright:

Available Formats

Panel Data Analysis using Stata

Tel: +604 9882734 @ +6019-2987836 Email :

OBJECTIVES OF THE WORKSHOP

1.2 Time series data

1.3 Panel Data

Step 1: Descriptive Analysis

Step 2: Determining the optimal model (Optional)

Step 3: Choosing the most appropriate panel data estimator (Optional)

BP-LM F-test Hausman DGMM SGMM

4.2 Stata Windows (Stata 14)

The command window is where the user enters commands for

4.3 Preliminary Steps

4.3.2 Creating a Log File

4.3.2.1 Creating NEW log file

4.1.3 Inputting your data into Stata

4.2 Converting string variable to numeric variable

4.3 Display all output.

foreach var of varlist varname {

foreach var of varlist varname {

foreach var of varlist varname {

b) Why do we do transformation before data analysis?

c) Topic 8: Data transformations

*PLEASE SAVE YOUR DATA_SAVE YOUR DATA_PLEASE SAVE YOUR DATA*

Table 1: Descriptive Statistics

5.2 STEP 2: Determining the optimal model (optional)*

a) Consider both static and dynamic

Stata Command: xtabond dviv,noconstant

5.3.2.1 Step S1: Panel Specification Tests*

Table 3: Panel Specification tests

5.3.2.1.3 S1(c): Hausman Test

5.3.2.1.4 Decision Rule for panel specification tests

p-value model p-value model p-value model Model

Outcome 1 > 0.05 POLS > 0.05 POLS - - POLS*

Outcome 2 < 0.05 FE > 0.05 POLS - - FE

Outcome 4 < 0.05 FE < 0.05 RE < 0.05 FE FE

Outcome 5 < 0.05 FE < 0.05 RE > 0.05 RE RE

5.3.2.2 Step S2: Diagnostic tests

Most Important Assumptions / Requirements of Linear Regression: Multicollinearity, Heteroskedasticity,

Table 4: Diagnostic Tests for Static Models

Objective: to ensure there is no “severe” multicollinearity in the data.

5.3.2.2.1.2 Pearson Correlation

Heteroskedasticity Test: Modified Wald test for GroupWise heteroskedasticity*

5.3.2.2.3 Serial Correlation

Test: Wooldridge test for autocorrelation in panel data *

5.3.2.2.4 Normality Assumption

Test: Tests for skewness and kurtosis.

The arguments / idea (below) were put forward by Love (2013):

b) "n > 30" is not a rule, it is a guideline.*

Possible reasons for non-normality.

Step 2: Exclude the outliers from your analysis.

5.3.2.3 Step S3: Final model / strategy to rectify the problem(s)

Objective: to choose the most appropriate strategy to rectify the problem(s).

Model M H SC Final Model Stata Command

a. FE - - - Fixed-effects (within) regression xtreg dviv,fe

b. FE - √ - Fixed-effects (within) regression with robust xtreg dviv,fe robust

c. FE - - √ FE (within) regression with AR(1) xtregar dviv,fe

d. FE - √ √ Fixed-effects (within) regression with xtreg dviv,fe cluster (ccode)

e. RE* - - - Random-effects GLS regression with robust xtreg dviv,re robust

* For RE model, the errors are assumed to be heteroskedastic.

5.3.3.1.1 Step D1: Perform the System GMM

5.3.3.1.2 Step D2: Diagnostic Tests*

D2(c) Sargan test of overidentifying restrictions *

PLEASE SAVE YOUR DATA_SAVE YOUR DATA_PLEASE SAVE YOUR DATA

Stock, J. H., & Watson, M. W. (2007). Introduction to econometrics.

(Cadangan untuk pembetulan & penambabaikan: e-mail: norhisam@uitm.edu.my)

Wishing you all the best in your future endeavors.