You are on page 1of 263

DATA

ANALYSIS

Dr. Shahab
Aziz
Bahria University, Islamabad

1
Outline
• Regression
• Assumptions of regressions
• Time series analysis
• Descriptive Analysis
• Diagnostic tests
• Graphs
• Panel Data Analysis
• Importing results
• Download data from WDI and perform Panel Data Analysis
2
Types of Data
• Primary Data [ Survey ]
• Collected by researcher

• Secondary Data [GDP, Inf, etc]


• Available/ Published/ database
• IMF/WB/SB etc
3
Time series Data

Cross sectional Data

Panel Data

4
5
6
Cross Sectional Data or a cross section of a 
study population, in statistics and 
econometrics is a type of data collected by
observing many subjects (such as
individuals, firms, countries, or regions) at
the one point or period of time.
Cross
sectional Data
GDP of three developing countries in 1990.

7
Regression and Correlation

8
9
Regression
• Most frequently used technique in Research.
• Can analyse the relationship between independent and dependent
variable.
• Dependent variable is outcome variable.
• Independent variable is used to achieve that outcome.

10
Calculate if one of the independent
variable or a set of independent
variable has a significant relationship
with dependent variable.

Benefits of Estimate the relative strength of


different independent variables effect
Regression on a dependent variable.

Make predictions.

11
• The ’independent’ variable
‘X’ is usually called the
repressor (there may be one
or more of these), the
’dependent’ variable y is the
response variable.

12
• Least Square Method-:
• The regression equation of X on Y is : X= a+bX
• Where,
• X=Dependent variable and Y=Independent variable

• The regression equation of Y on X is:


• Y = a+bX
• Where,
• Y=Dependent variable
• X=Independent variable
• a and b are parameters whose values needs to be find.

13
14
15
1. Model should be linear in parameters

2. None of the Independent variable have linear relationship


with any other independent variable. [Multicollinearity].
Assumptions
3. None of the independent variable are correlated with e.
of Classical
Linear 4. Error term observations are independent of each other
[Autocorrelation]

Regression 5. Mean of error term is zero.

Model
6. Error term have constant variance [Heteroscedasticity].

7. The error term is normally distributed.

16
Y= a+bX

Y & X are variables


1. Model
should be a & b are parameters [ Whose values we need to find]
linear in
parameters Y= a+1/B X or Y= a+b2X [Can not estimate such]

Can be non-linear in variables [ variables can have


power, can be non-linear in variables].

Regression don’t have capacity to estimate non linear


parameters

17
Suppose

3. None of the
independent Y= a+bX+e

variable are
correlated If there is positive correlation between x and e , x will
increase so e will also increase. Than the coefficient
[b] will be overestimated coz it is showing error effect
with e. as well.

It is also called endogeneity

18
19
5. Mean of error Term is zero.

20
Gauss-Markov Theorem : If 6 assumptions
hold, the parameters a, b will be BLUE.
7. The error
term is
normally Central Limit Theorem: As sample size
increases the distribution becomes normal
distributed.
Not required for estimation as 6
assumptions are enough. However for
checking significance it is important. To hold
this assumption.
21
How to Enter
Time Series Data

22
23
24
25
26
27
28
29
How to Enter Panel Data

30
31
32
Three Countries

33
34
Command for entering variables

1 is for country 1
2 for country 2
3 for country 3

35
Importing Data from excel

36
37
38
39
40
Multicollinearity
its Detection
&
Removal

41
Independent variables should
not be highly correlated.

Y=a+bX1+cX2

42
Multicollinearity Explained
SC= a+b1 PM+b2FI
( Student Consumption= Pocket Money+ Father’s income)

If fathers income increase by 1 unit student consumption will increase by b2


keeping other factors constant.

Pocket money also related with Father’s income , It means SC depends upon b1
and b2. [ Income increase, pocket money also increase hence SC increases , SC is
effected by not only b1 but also with b2]

43
IV’s are correlated called Multicollinearity.

This issues makes coefficient inefficient.

WE measure it through Variance Inflation Factor VIF.

VIF= 1/1-R2
Violation of VIF is related to independent variables.
Assumption Y= a+b1X+b2Z+b3A

X= a+b2Z+b3A --------------- R2 [ VIF of X]

Z=a+b1X+b3A----------------- R2 [ VIF for Z]

We have to calculate for each variable

44
45
46
47
48
49
Leave as it is if VIF<10, 5, 3.3.

Remove variable [ If control variable] .


Main variable/ Focus variable can not be removed.

Removal Change the measure


Student performance [ CGPA, Final Marks, Quiz, Assignment]

Can take in growth form etc

Increase in sample size

50
51
Autocorrelation

Detection

Removal
52
Autocorrelation
• Error terms observations are independent of each other or they are
not correlated with each other.
• When two variables are moving closely we call it correlation.

53
X X t-1 •When one variable is
correlated with its own lagged
10 8 value, we call it
“Autocorrelation”
20 18
•GDP, Savings, Income,
30 27 Consumption, Investment etc
40 38 will be mostly correlated with
their previous values.
50 47

54
• Y= a+b1X+b2Z+b3A+e
• e is error term , the variables which are not a part of the model.
• Error means mistake, not intentional, it should not be consistent.
• Error should be random, there should not exist any trend in error term.
• For example:
• Y= a+b1X+e [ There was a variable related with Y i.e. Z and we have not
included it in the model. Z is missed and not is a part of error term.
Z was important for Y.

55
• If variables are autocorrelated its not an issue.
• Z is in error term.
• Z is a time series and have autocorrelation.
Autocorrelation • Z is consistent which is making a trend , as Z is
explained in error term we see error term correlated with
its own term , AUTOCORRELATION.
• There is an issue when error terms are
correlated. Violates the assumption that error
should be random.
• However variables can be auto correlated.

56
Variable which was
significant before ,
due to auto Y-Y=Error
becomes [ Actual-estimated=error]
insignificance
hence not reliable.

Problem it Error is consistent /


Same sign again and
e=+

Creates
again / + Auto, -
Auto.

e=+ e=+

57
58 How to Detect

• Darbin Watson Range [ 0 to 4]


• Value of 2= No Auto
• 0 to 2 [ + Auto]
• 2 to 4 [ - Auto]
• If the value of DW is less than or
greater than 2, than we check the
severity of the Auto through Serial
correlation LM test.
Ho: No autocorrelation [ P< 0.05]
Accept

Serial
Correlation H1: Auto exists [ P> 0.05] Reject

LM Test
IF DW value is > 2 and LM test
[confirms] hypothesis is
rejected than autocorrelation
exits.
59
Range of DW 0 to 4

If DW=2 : there is no autocorrelation

If DW< 2: It means there is + autocorrelation

If DW>2: It means – autocorrelation.

In example there is + autocorrelation

Is it severe we can apply serial correlation LM test

Hypothesis of SCLM test is

If Prob value > 0.05 it means we can accept Ho. Shows no autocorrelation

If Pr0b value < 0.05 it means we can reject Ho. Shows autocorrelation in
regression model

60
Bring the omitted variable within the
model.

Z is to be added within the model.


Removal
This will make error random, no trend,
Auto can be removed.

The theory and literature will guide you


which variables to be included within
the model.

61
2. Cochrane-Orcutt Method / GLS

3. AR1

[ GSL and AR1 remove auto but Form of the variable


changes, Interpretation changes]

4. If sample size is large and no more variable is being


added than use

HAC –Test [ Heteroscedasticity autocorrelation


Consistent test.

HAC-Test adjust the auto . Prob values and results can


be reliable.
62
63
64
P> 0.05
Accept Ho
No auto

65
Removal

Addition of relevant
variable

HAC test [ Estimate click


option}
66
67
68
69
70
71
Assumption # 6
Heteroscedasticity
• It happens in cross sectional data.
• Error term has a constant variance.
• What is constant variance?
• For example we have nominal data
• Income of different groups
• Low income, Medium and High
• There will be many individuals in one group
• Multiple consumption points in one group

72
There are different
groups of income and
we have seen their
consumption. Than
draw a regression line.
There will be some
errors i.e. difference
between actual and
estimated values. If we
draw the error if can
have certain
distribution. Plot it
vertically. If you see the
variance is constant for
all groups. This is called
HOMOSCADASTICITY.

73
If the variances are not constant,
assumption is violated it is
called: HETROSCADASTICITY”.

74
For example: there are three
cities and cities have low to
high population. There will be
different tax revenues for each
city. More population have
more variance. With less
population the deviation will
be less. So there must be a
factor due to which the
variance is not constant. Factor
can be a part of model or not
and it is called “
PROPORTIONAL FACTOR”.

75
Coefficient B have a
distribution with some mean
and stander error. The
variation in error term
doesn't change mean value
but it changes its deviation.
Beta remains the same
however variation might
increase or decrease which
changes its distribution like
Auto. So there will be a
problem of significance of a
variable the way it happens in
AUTOCORRELATION.
Estimates re not reliable.

76
Cross sectional data,
for example there are
100 countries and find
relationship between
consumption and
income. Every entity
will be having an error.
Like in time series
there is an error for
every time period. If
the variance is due to
population than draw a
plot will be like this. As
population increases
error increases. Can be
any factor may be
income etc.

77
78
79
CORRELATION

80
81
82
83
REGRESSION

84
85
86
87
Plot Data / Draw Graphs
• plot gdp fdi inf smc to enter

88
89
90
91
Summary statistics
• cor gdp fdi inf smc to enter

92
93
Histogram
• hist gdp enter

94
95
Regression
• ls gdp c fdi inf smc to enter

96
Regression

97
Stata

98
Importing The Data

99
100
101
102
103
104
REGRESSION
• regress GDP INF TO SMC FDI ENTER

105
106
Summary Statistics

Summarize

107
108
Correlation

109
110
111
112
Panel Data Analysis
in Stata

113
• Basics about panel Data
• Model estimation
• Diagnostic check
• Selection of appropriate model

114
115
116
117
118
Models of Panel Data

119
120
121
122
123
124
125
126
127
128
129
Copy and Paste Directly
• Can copy from excel and paste in Stata.
• 1st copy from excel and than click on data editor in Stata and simply
paste the data.

130
131
132
133
134
135
Analysis

136
Pooled OLS

137
138
139
140
141
142
Normality Test

143
Skewness and
kurtosis < +1, -1

144
Multicollinearity Test

145
Multicollinearity
VIF < 3.3,5, 10

146
Heteroscedasticity

147
Probability value >
0.05 , we cant reject
the Null hypothesis ,
means that there is
no heteroscedasticity

148
• Every cross section have some unique characteristics therefore pooled
regression is not suitable choice.

• Moving towards

• Fixed effect models


• and Random effect models

149
• 1st step is we have to declare the data as panel data in Stata.

150
151
152
153
154
155
156
157
158
159
160
161
162
Can use syntax command

xtreg Profitabilty Liquidity Leverage Activity, fe

163
164
165
Need to store the result first before moving
on to Random effect model

166
167
168
169
170
Random Effect Model

171
172
173
Can use syntax command

xtreg Profitabilty Liquidity Leverage Activity, re

174
175
176
177
178
•Final Model Selection Criterion

•Fixed or random

179
Housman Test

180
181
182
183
If Probability value <
0.05 it means fixed
model is appropriate
model , we can not
reject the Null
hypothesis that
random effect model
is appropriate

184
How to Import Results

185
Install asdoc use
this command

186
Click on this and
command will be
generated and in
beginning add asdoc
and press enter

187
188
189
Command for
correlation

190
Command for
correlation in word
file

191
192
Graphs

193
194
195
196
197
198
199
Analysis in R

200
Interface

201
Import Data

202
203
204
205
206
Attach data

• attach(Time_Series_Data)

207
208
Draw Graph
plot(gdp,fdi,main="graph")

209
Correlation cor(gdp,fdi)

210
Multiple Regression

211
Download Data from WDI &
Perform Panel Data Analysis
CLEANING OF THE DATA
Replacing double dots
with single
Double dots are missing
values and STATA don’t
recognise it
Data is in string form and
we need to “destring” all
the variables
“destring” command run for all
the variables
PANEL DATA ANALYSIS
Setting the data
as panel
Fixed effect
Save Results
Random Effect
Save Results
To choose between fixed or random

If Probability value <


0.05 it means fixed
model is appropriate
model , we can not
reject the Null
hypothesis that
random effect model is
appropriate
The End
262
263

You might also like