Professional Documents
Culture Documents
Norhisam Bulot
Faculty of Business and Management
Universiti Teknologi MARA Perlis
PEDAGOGY
To achieve the above objectives, this half-day (one-day?) session will focus on hands-on training in the
use of Stata. It will assume some basic knowledge of research process & statistics.
1. Types of data
1.1 Cross Sectional Data
� A sample of individuals, households, firms, countries, cities, or any other type of units at a
specific point in time
� Denoted by the subscript i
� Example: Data on share price (N=100 companies)
� Example: Determinants of students’ academic performance (2019)
� Model: Yi = β0 + β1X1i + β2 X2i + β3 X3i + β4 X4i + β5 X5i + + εi
1.3.1 Balanced
� Each i has the same number of t (observations)
COMPANY YEAR Y X1 X2 X3 X4
ABRAR CORPORATION BERHAD 5 307.56 103.2 25.43 0.56 74.22
ABRAR CORPORATION BERHAD 4 170.42 -4.83 88.03 0.55 -32.25
ABRAR CORPORATION BERHAD 3 120.56 -111.5 -57.81 -0.52 -292.7
ABRAR CORPORATION BERHAD 2 102.73 -151.3 70.97 -1.4 -541.9
ABRAR CORPORATION BERHAD 1 470 96 111.07 -2.93 3.78
ACTACORP HOLDINGS BERHAD 5 672.69 98.49 9 0.6 54.27
ACTACORP HOLDINGS BERHAD 4 511.85 78.19 45.41 0.72 24.53
ACTACORP HOLDINGS BERHAD 3 67.64 -134.5 104.8 -0.29 -448.3
ACTACORP HOLDINGS BERHAD 2 42.94 -172.9 -39.43 -1.07 -758.5
ACTACORP HOLDINGS BERHAD 1 32.42 -196.5 8.21 -1.59 -900.2
AKN TECHNOLOGY BHD 5 396.55 176.07 -31.53 0.87 108.76
AKN TECHNOLOGY BHD 4 626.4 265.66 -47.19 0.82 89.73
AKN TECHNOLOGY BHD 3 465.34 217.29 14.97 0.82 106.69
AKN TECHNOLOGY BHD 2 494.34 109.31 88.88 0.79 89.94
AKN TECHNOLOGY BHD 1 495.98 111.16 -3.53 0.81 65.99
1.3.2 Unbalanced
� Each i has varying number of t (observations)
COMPANY YEAR Y X1 X2 X3 X4
ABRAR CORPORATION BERHAD 4 307.56 103.2 25.43 0.56 74.22
ABRAR CORPORATION BERHAD 3 120.56 -111.5 -57.81 -0.52 -292.7
ABRAR CORPORATION BERHAD 2 102.73 -151.3 70.97 -1.4 -541.9
ABRAR CORPORATION BERHAD 1 470 96 111.07 -2.93 3.78
ACTACORP HOLDINGS BERHAD 3 67.64 -134.5 104.8 -0.29 -448.3
ACTACORP HOLDINGS BERHAD 2 42.94 -172.9 -39.43 -1.07 -758.5
ACTACORP HOLDINGS BERHAD 1 32.42 -196.5 8.21 -1.59 -900.2
AKN TECHNOLOGY BHD 5 396.55 176.07 -31.53 0.87 108.76
AKN TECHNOLOGY BHD 4 626.4 265.66 -47.19 0.82 89.73
AKN TECHNOLOGY BHD 3 465.34 217.29 14.97 0.82 106.69
AKN TECHNOLOGY BHD 2 494.34 109.31 88.88 0.79 89.94
AKN TECHNOLOGY BHD 1 495.98 111.16 -3.53 0.81 65.99
Research Topic?
2. Data collection & preparation: Important steps
a. Determine the variables to be included in your study
● Dependent variable
● Independent variables
● Moderating
● Intervening / Mediating
b. Determine the “most appropriate” proxy / formula for each of the variables.
c. Determine the sources of data.
2
● Osiris
● Datastream
● Bank Negara Malaysia (BNM)
● Worldbank
● Bankscope
● Department of statistics Malaysia
● Etc...
d. Collecting the data
e. Checking your data (completeness, consistency and accuracy).
f. Data transformation (if needed … please refer to section 4.4)
3
3. Data analysis steps: Short Panel (Micro panel)
Static Dynamic
vselest vselest
Step S1: Panel specification tests Step D1: DGMM VS. SGMM
POLS @ FE @ RE
Step S2: Diagnostic tests_ Static model Step D2: Diagnostic tests_ Dynamic model
Multicollinearity Autocorrelation
Heteroskedasticity Sargan test
Serial correlation Number of instruments
Normality* Normality*
Step S3: Strategy to rectify the problem(s) Step D3: Strategy to rectify the problem(s)
Step S4: Report the findings Step D4: Report the findings
4
4. Introduction to Stata (Stata 14)
4.1 How to open your Stata… StataMP.exe
Display the
Display list of list of
commands variables
that you have
This window will display the entered command,
already
completed followed by its corresponding output/results
corresponding
output/results
6
4.1.2.2 Using existing log file*
If you want to add more Stata output to an EXISTING log file, use “append”.
Stata command: log using yourfilename.log,append
7
4.1.3.3 Importing data from excel
Stata expects a single matrix or table of data from a single sheet, with at most one line of text at the start
defining the contents of the colums. Using your windows computer:
� Start EXCEL
� [Open dataset file... example: Data.]
� Highlight the data of interest, the pull down Edit and choose Copy (Ctrl C)
8
� Click “Data Editor” or type “Edit” in the command window.
� Select the first column, paste the data & choose “Variable names”.
� Save the data.
9
o Use Stata command to transform the string variable: encode company,gen(ccode)
10
4.4 Data transformation (Optional)
In data analysis, data transformation is the replacement of a variable by a function of that variable: for
example, replacing a variable “x” by the square root of “x” or the logarithm of “x”. In a stronger sense, a
transformation is a replacement that changes the shape of a distribution or relationship.
You should consider these things before transforming your variable:
● Reasons for using transformations
● Transformations for proportions and percents
● Transformations for variables that are both positive and negative
*replace “varname” with your variable*
Standardize
Log (Please be aware of the conditions for transforming your variable into log)
First Difference
● Assuming we would like to transform lev prof io intang into log … the command is as follows
(replace “varname” with your variable)
11
Further readings:
a) Log transformation and its implications for data analysis
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4120293/
12
4.5 Declare data set to be panel data
The nest step in working with panel data is to describe the panel structure of your data. Once your
workfile is structured as a panel workfile, you may take advantage of the Stata tools for working with
panel data, and for estimating equation specifications using the panel structure.
Stata command: xtset ccode year
13
5. Data analysis
5.1 STEP 1: Descriptive analysis
Objective: to describe your data / getting a feel for your data
Stata Command:
sum dviv
xtsum dviv
14
Notes for .xtsum
Stata lists three different types of statistics: “overall”, “between”, and “within”.
(1) Overall statistics are ordinary statistics that are based on your total observations & are the same as
those in the .sum output above.
(2) Between statistics are calculated based on summary statistics of the entities (number of company @
country) regardless of time.
(3) Within statistics are calculated based on summary statistics of time periods regardless of entities.
15
“… The summary statistics of the dependent and independent variables over the sample period are
presented in Table 1…”
Stata Command
vselect dviv,best
16
17
Decision Rule: Choose the one with the lowest AICC (Yang, 2005).
“… As shown in Table 2, the most optimal number of variables is xxx. The chosen variables
are xxx…”
Table 2: Variable Selection
Variable Selection Optimal Model
Models R2ADJ C AIC AICC BIC # Ivs
Model 1
Model 2
5.3 STEP 3: Choosing the most appropriate panel data model (optional)
Options:
a) Consider both Static and Dynamic
b) Choose Static (please proceed to section 5.3.1)
c) Choose Dynamic (Please proceed to section 5.3.2)
18
Decision rule: p-value < 0.05 = Dynamic; > 0.05 = static (Brañas-Garza et al., 2011)*
*For static model, please proceed to section 5.3.1*
*For dynamic model, please proceed to section 5.3.2*
5.3.1
5.3.2 Static Model
Test
F-test, Breusch and Pagan Lagrangian multiplier test (BP-LM Test) & Hausman Test
i. F-test: To choose between POLS and FE
ii. BP-LM Test: To choose between POLS and RE
iii. Hausman Test: To choose between FE & RE
Three options
i. Option 1: Perform all tests: F-test, BP-LM test & Hausman test [Park (2011)]
ii. Option 2: Perform Hausman test only
iii. Option 3: The choice of technique (between FE & RE) is determine by whether you think there
are unobserved, individual specific (and time-invariant) factors that influence your outcome and
are correlated with the X variable(s) of interest. If there are, use FE, if not, use RE. (Please refer
to the basic assumption for POLS, FE & RE below)
Assumptions:
i. POLS: Basic assumption: The Pooled Ordinary Least Squares (POLS) is a pooled linear
regression without fixed and/or random effect. It assumes that the SLOPES AND INTERCEPT
OF FIRMS ARE CONSTANT across individuals and time. In general, OLS ignores the
individual and time effects.
ii. RE: Basic assumption: no omitted variable (OV); omitted variable is UNCORRELATED with
the explanatory variables (IV) that are in the model
iii. FE: Basic assumption: there is omitted variable; the omitted variable is CORRELATED with the
explanatory variable (IV); FE provides means for controlling the effect of OV on IV; that is,
whatever effect that the OV have on the subject at one time, it will have the same effect at later
time (Fixed Effect).
“… The result of the panel specification tests as presented in Table 3 suggests that xxxx model is the
most appropriate data analysis technique…”
20
5.3.2.1.2 S1(b): BP-LM test
● Stata command:
quietly xtreg dviv,re
xttest0
● Objective: Choosing between POLS and RE
● Decision rule: p-value > 0.05 = POLS; p-value < 0.05 RE
21
hausman fixed, sigmamore
22
Outcome 3 > 0.05 POLS < 0.05 RE - - RE
Objective: to investigate the existence of severe multicollinearity, heteroskedasticity & serial correlation
problem.
Importance of diagnostic tests: without verifying that your data have met the assumptions of linear
regression, your results may be misleading.
“… As presented in Table 4, the diagnostic tests on the baseline model indicate the presence of xxx
problem(s). Following the suggestion by Hoechle (2007), the remedial procedure has been carried
out by using the xxxxx…”
5.3.2.2.1 Multicollinearity
Multicollinearity: exists when two or more of the predictors in a regression model are moderately or
highly correlated.
Potential solution to the problem: (1) Do nothing, (2) Remove some of the highly correlated independent
variables, (3) Linearly combine the independent variables, such as adding them together, (4) Perform an
analysis designed for highly correlated variables, such as principal components analysis or partial least
squares regression & (5) Other relevant strategy.
(http://statisticsbyjim.com/regression/multicollinearity-in-regression-analysis/)
Techniques: (1) Variance Inflation Factors (VIF) & (2) Pearson Correlation.
23
5.3.2.2.1.1 Variance Inflation Factors (VIF)
Stata command:
quietly regress dviv
vif
Decision rule: vif > 10 = severe multicollinearity problem; vif < 10 = No multicollinearity problem
The cut-off point of “10” is most commonly used by many practitioners (O’Brien, 2007)].
Decision rule: coefficient > 0.90 = very high correlation @ severe multicollinearity problem exists;
coefficient < 0.90 = NO multicollinearity problem.
The cut-off point of 0.90 is suggested and used by many researchers (Example: Hinkle et. al., 2003;
Mukaka, 2012)].
5.3.2.2.2 Homoscedasticity
The variance of the error term is constant. The assumption of homoscedasticity fails when the variance
changes (variance is not constant).
”homoscedasticity of the regression error term: that its variance is assumed to be constant in the
population, conditional on the explanatory variables. The assumption of homoscedasticity fails when the
variance changes in different segments of the population: for instance, if the variance of the unobserved
factors influencing individuals' saving increases with their level of income. In such a case, we say that the
error process is heteroskedastic...” (Wooldrige, 2015)
Serial Correlation: the relationship between a given variable and itself over various time intervals @ the
error in a period influences the error in a subsequent period – the next period or beyond.
24
Objective: To check the existence of serial correlation or auto correlation in the residuals.
*Decision rule: p-value > 0.05 = No serial correlation problem; p-value < 0.05 = serial correlation
problem exist*
Two options available: (1) Test the assumption & (2) Skip the test.
Option (1): Test the assumption using Stata command xtsktest @ any other “suitable” test(s).
Decision Rule:
p-value < 0.05 = we can REJECT the data is normally distributed (your data is not normally
distributed).
p-value > 0.05 = we cannot reject the hypothesis that the data is normally distributed (Your data
is normally distributed).
Option (2): Skip the test (normality of the residuals will not be tested).
The residual is assumed to be normally distributed [Minimum observations required: 30 @ 100: Kindly
read the following explanation on “Central limit theorem”.
Central Limit Theorem.
Central limit theorem (CLT) is used to justify the action of not testing the normality assumptions
a) The CLT basically states that as your sample size increases, the distribution of the sample mean of a
parameter becomes normally distributed, regardless of the underlying distribution of data. (This is a
bit of an oversimplification, but I believe it's suitable for this discussion).
“…Therefore, it is reasonable to “assume” that if your sample is 30 or greater, your mean has a
normal distribution with sample variance equal to population variance divided by sample size
(sigma^2/n)…”
“…Please note that this is still an “assumption”. It is not "ensured", and it is not testable in any
specific instance without repeatedly sampling a population. But while it is an assumption, it is a very
reasonable assumption based on previous statistical research…”
“…Other misinterpretations, misapplications, or misuses of central limit theorem (specifically here,
or generally elsewhere) may be the product of ignorance or sloth on the part of the researcher, BUT
assuming that the mean of sample n=>30 has an approximate normal distribution is not the product
of ignorance or sloth…”*
Possible remedies.
a) Remove the outliers (extreme values)
b) Transform the data (example: log transformation)
c) Other strategy.
You may remove / exclude the outliers from your analysis by using the following Stata command
Step 1:Identify the existence of outliers.
regress dviv
predict d1,cooksd
quietly generate cutoff =d1> 4/no of observation
list ccode d1 if cutoff
Example 2
*Original Command: xtreg dviv, fe
*Without outliers: xtreg dviv if cutoff~=1, fe
Decision Rules for selecting the most appropriate strategy to rectify the problem(s) (following the
suggestion by Hoechle (2007).
f. RE* - - √ Random-effects GLS regression with cluster xtreg dviv,re cluster (ccode)
option
27
Explanation:
"Economic theory rarely gives any reason to believe that the errors are homoscedastic. It is therefore
prudent to assume that the errors might be heteroskedastic unless you have compelling reasons to believe
otherwise. (...) If the homoscedasticity-only and heteroskedasticity-robust standard errors are the same,
nothing is lost by using the heteroskedasticity-robust standard errors; if they differ, however, then you
should use the more reliable ones that allow for heteroskedasticity. The simplest thing, then, is always to
use the heteroskedasticity-robust standard errors…”(Stock & Watson, 2007)*
28
5.3.3 DYNAMIC MODEL
The Difference and System GMM estimators are designed for panel analysis and embody the following
assumptions about the data-generating process (Baltagi & Giles, 1998):
● The process may be dynamic, with current realizations of the dependent variable influenced by
past ones
● There may be arbitrarily distributed fixed individual effects. This argues against cross-section
regressions, which must essentially assume fixed effects away, and in favor of a panel setup,
where variation over time can be used to identify parameters.
● Some regressors may be endogenous.
● The idiosyncratic disturbances (those apart from the fixed effects) may have individual-specific
patterns of heteroskedasticity and serial correlation.
● The idiosyncratic disturbances are uncorrelated across individuals.
Also, some secondary concerns shape the design:
● Some regressors can be predetermined but not strictly exogenous; that is, independent of current
disturbances, some regressors can be influenced by past ones. The lagged dependent variable is
an example.
● The number of time periods of available data, T, may be small. (The panel is “small T, large N”.)
“…Another important step in the data analysis process is to choose the most appropriate dynamic
panel data estimator. The two available estimators are difference GMM and System GMM…”
“…The dynamic structure of the model makes OLS estimator upwards biased and inconsistent,
since the lagged dependent variable is correlated with the error term. The within transformation does not
solve the problem because of a downward bias (Nickel, 1981). A possible solution is represented by the
Generalized Method of Moments (GMM) technique (Presbitero, 2008). Dynamic panel data models are
useful when the dependent variable depends on its own past realizations (Brañas-Garza et al., 2011)…”
“…Although the General Method of Moments (GMM) is the method of estimation of dynamic
panel models that provides consistent estimates, one still has to decide whether to use Difference GMM
(DGMM) or System GMM (SGMM) (Efendic, Pugh, & Adnett, 2009)…”
“…The SGMM estimate has an advantage over DGMM for variables that are random walk or
close to be random walks variables (Efendic et al., 2009). One way to identify this is by looking at the
value of lagged dependent variable Coefficient and p-value of the Sargan test. The closer the value to one
(more than 0.9) will indicate the need to go for a SGMM estimator. SGMM was more preferable against
DGMM. The DGMM was found to be inefficient since it did not make use all available moment
conditions (Ahn & Schmidt, 1995). In addition to that, it also had very poor finite sample properties in
dynamic panel data models with highly autoregressive series and small number of time series
observations. Due to these limitations, following the suggestion by Arellano and Bover (1995), Blundell
and Bond (1998) propose taking into consideration extra moment conditions from the level equations that
rely on certain restrictions on the initial observations. The resulting SGMM has been shown to perform
much better than DGMM estimator in terms of finite sample and mean squared error, as well as with
regard to coefficient estimator standard errors, since the instruments used for the level equation are still
informative as the autoregressive coefficient approaches unity (see Blundell and Bond (1998)). As a
result, SGMM estimator has been widely used for estimation (Sun and Ashley, 2014)…”
“…The choice of an appropriate estimator is very important as the dynamic structure of the
model will make the static estimator upwards biased and inconsistent, since the lagged level of dependent
variables is correlated with the error term. The within transformation does not solve the problem because
29
of a downward bias (Nickel, 1981) and inconsistency (Presbitero, 2008). A possible solution is
represented by the Generalized Method of Moments (GMM) technique (Presbitero, 2008)…”
D2(b) Serial Correlation: Arellano-Bond test for zero autocorrelation in first-differenced errors:*
*Serial correlation is when error terms from different (usually adjacent) time periods (or cross-section
observations) are correlated.*
*Arellano and Bond (1991) propose a test to detect serial correlation in the disturbances.*
*The test for AR (2) in the first difference is more important because it will detect autocorrelation in
levels (Mileva, 2007).*
*Decision rule: p-value < 0.05 = Serial Correlation problem; p-value > 0.05 = no serial correlation
problem*
*Strategy to rectify the problem: If the diagnostic checks indicate serial correlation problem, DGMM or
SGMM with the robust option will be used to rectify the problem*
estat abond
Model 1 Model 2
SIZE -19.90*** -23.33***
(-6.53) (-5.90)
INTANG -0.03* -0.02
(-1.90) (-1.39)
IO 6.08 6.85**
30
(1.62) (2.05)
CINV -0.01 -0.01
(-0.69) (-0.52)
TID 126.43*** .
(24.61) .
LA 0.06* 0.05
(1.72) (1.42)
Constant -141.23*** 120.40***
(-10.99) (7.77)
N 952 632
R2 0.43 0.41
R2_a 0.27 0.24
R2_o - -
F 31.18 24.15
Chi2 - -
p-value 0.00 0.00
Notes:
(1) t statistics in parentheses, (2) * p < 0.1, ** p < 0.05, *** p < 0.01, (3) TID = time in distress, CINV = change in investment, IO = investment
opportunities, INTANG = intangible assets, TANG = tangible assets, LA = liquid assets and SIZE = firm size.
(4) N = number of observations, R2 = R-squared, R2_a = adjusted R-squared, R2_o = Overall R-squared
(5) Model 1 = Overall Sample, Model 2 = Mixed Sector
6. Regression Table.
a) Descriptive statistics
b) Optimal Model
c) Panel specification tests (P-value the lagged DV, F-Test, BPLM test & Hausman test
d) Diagnostic tests
e) Regression result: goodness-of-fit measures & Parameter Estimates of Regressors
i) Number of observation
ii) Number of groups / Number of entities (firm/country)
iii) Prob > Chi2 @ Prob > F (Significance of the model).
Model Significance: Independent variables should be jointly significant to influence or
explain the dependent variable.
31
Decision rule: p-value < 0.05 = model is statistically significant; p-value > 0.05 = model
is not statistically significant.
iv) R-Squared (R2)
R2 shows the amount of variance (dependent variable) explained by predictors.
Adjusted R-squared: The adjusted R-squared is a modified version of R-squared that has
been adjusted for the number of predictors in the model. The adjusted R-squared
increases only if the new term improves the model more than would be expected by
chance. It decreases when a predictor improves the model by less than expected by
chance.
Which R2 to report?
POLS = Adjusted. R2
RE = Overall
FE = Adjusted R2. … Following the suggestion by Torres-Reyna (2007), the Adjusted R 2
from “areg dviv, absorb (ccode)” command should be used
*Explanation on R2: The Stata command “xtreg” fits various panel data models,
including fixed and random effects models. For the fixed-effects model, “xtreg”
estimates within-group variation by computing the differences between observed
values and their means. This model produces correct parameter estimates (Torres-
Reyna, 2007); however, due to the larger degrees of freedom, its standard errors
and, consequently, R-squared statistic are incorrect. That is, the R-squared statistic
labelled “R-sq: within =” is not correct. As a remedy, please use Stata command
“areg” that will produce more accurate estimate of the R-squared (Torres-Reyna,
2007). As for the random effect model, the “overall” R-squared should be
reported.*
*References (selected)*
*Hinkle, Wiersma, & Jurs (2003). Applied Statistics for the Behavioral Sciences (5th ed.).*
*Hoechle, D. (2007). Robust standard errors for panel regressions with cross-sectional dependence. Stata
Journal,7(3), 281. *
*O’brien, R. M. (2007). A caution regarding rules of thumb for variance inflation factors. Quality &
Quantity, 41(5), 673-690.*
*Park, H. M. (2011). Practical guides to panel data modeling: a step-by-step analysis using stata. Public
Management and Policy Analysis Program, Graduate School of International Relations, International
University of Japan.*
33
* Wooldridge, J. M. (2015). Introductory econometrics: A modern approach. Nelson Education. *
34