You are on page 1of 61

FINANCIAL

MODELLING AND
DERIVATIVES
PRACTICAL FILE
NAME – MUSKAN ARORA

ROLL NUMBER – 2020UBA9015

COURSE – BACHELOR OF BUSINESS ADMINISTRATION

SEMESTER - 6

|
INDEX

S.NO NAME OF THE EXPEIMENT TEACHER’S REMARK


1 To write the regression model
and it advantage
2 To make an Anova Table and
calculate all missing values
3 To calculate alpha and beta
value in any of the regression
model
4 What do you understand by
R-square and adjusted R
square. Explain them using
formulas and practicals
shared in class
5 Write down the assumptions
of regression analysis.
6 Using any statistical tool justify
the linearity assumptions of
regression analysis.
7 Using any statistical tool
justify the multicolinearty
assumptions of regression
analysis.
8 Using any statistical tool justify
the autocorrelation assumptions
of regression analysis.
9 Using any statistical tool
justify the hetroscedasticity
assumptions of regression
analysis.
EXPERIMENT – 1

AIM – To write the regression model and its advantages


Introduction/Theory
A regression model is a statistical method used to analyze the relationship between
a dependent variable and one or more independent variables. The purpose is to find
a mathematical equation that can be used to predict the value of the dependent
variable based on the values of the independent variables.

The advantages of a regression model include:

1. Predictive Power: A regression model can be used to make predictions about


future outcomes. For example, a regression model could be used to predict how
much a company's revenue will increase in the next quarter based on historical
data.

2. Quantifiable Results: Regression models produce quantifiable results, which can


be useful for decision-making. The model provides an equation that can be used to
make predictions, and it also provides measures of the uncertainty associated with
those predictions.

3. Statistical Significance: Regression models test for statistical significance, which


means that they can determine whether the relationship between the independent
and dependent variables is real or just due to chance.
STEPS
1. Enter your data into Excel.
2. Install Data Analysis ToolPak plugin.
3. Open "Data Analysis" to reveal the dialog box.
4. Enter variable data.
5. Select output options.
6. Analyze your results.
7. Create a scatter plot.
8. Add regression trendline.

In Excel, we use regression analysis to estimate the relationships between two or


more variables. There are two basic terms that you need to be familiar with:

The Dependent Variable is the factor you are trying to predict.

The Independent Variable is the factor that might influence the dependent variable.

Consider the following data where we have a number of COVID cases and masks
sold in a particular month.

 Go to the Data tab > Analysis group > Data analysis.


 Select Regression and click OK.
The following argument window will open.
Select the Input Y Range as the number of masks sold and Input X Range as
COVID cases. Check the residuals and click OK.

You will get the following output:


Summary Output

The summary output tells you how well the calculated linear regression equation
fits your data source.

The Multiple R is the Correlation Coefficient that measures the strength of a linear
relationship between two variables. The larger the absolute value, the stronger is
the relationship.
 1 means a strong positive relationship
 -1 means a strong negative relationship
 0 means no relationship at all
R Square signifies the Coefficient of Determination, which shows the goodness of
fit. It shows how many points fall on the regression line. In our example, the value
of R square is 0.96, which is an excellent fit. In other words, 96% of the dependent
variables (y-values) are explained by the independent variables (x-values).

Adjusted R Square is the modified version of R square that adjusts for predictors
that are not significant to the regression model.

Standard Error is another goodness-of-fit measure that shows the precision of your
regression analysis.

ANOVA

ANOVA stands for Analysis of Variance. It gives information about the levels of
variability within your regression model.

 Df is the number of degrees of freedom associated with the sources of variance.


 SS is the sum of squares. The smaller the Residual SS viz a viz the Total SS, the
better the fitment of your model with the data.
 MS is the mean square.
 F is the F statistic or F-test for the null hypothesis. It is very effectively used to
test the overall model significance.
 Significance F is the P-value of F.
Graph of Regression Analysis
EXPERIMENT – 2

AIM – To make an ANOVA table and calculate all missing values


Introduction/Theory

The analysis of variance, or ANOVA, is among the most popular methods for
analyzing how an outcome variable differs between groups, for example, in
observational studies or in experiments with different conditions.

But how do we conduct the ANOVA when there are missing data? In this post, I
show how to deal with missing data in between- and within-subject designs using
multiple imputation (MI) in R.

The ANOVA model

In the one-factorial ANOVA, the goal is to investigate whether two or more groups
differ with respect to some outcome variable y. The statistical model can be written
as

yij=μj+eij,

where yij denotes the value of y for person i in group j, and μjis the mean in
group j. The (omnibus) null hypothesis of the ANOVA states that all groups have
identical population means. For three groups, this would mean that

μ1=μ2=μ3.1=2=3.

This hypothesis is tested by looking at whether the differences between groups are
larger than what could be expected from the differences within groups. If this is the
case, then we reject the null, and the group means are said to be “significantly”
different from one another.

In the following, we will look at how this hypothesis can be tested when the
outcome variable contains missing data. Let’s illustrate this with an example.
Example 1: between-subjects ANOVA

For this example, I simulated some data according to a between-subject design


with three groups, n = 50 subjects per group, and a “medium” effect size of f = .25,
which roughly corresponds to an R2=6.8%2=6.8% (Cohen, 1988).

You can download the data from this post if you want to reproduce the results
(CSV, Rdata). Here are the first few rows.

The three variables mean the following:

 group
: the grouping variable
 y
: the outcome variable (with 20.7% missing data)
 x
: an additional covariate

In this example, cases with lower values in

x
had a higher chance of missing data in
y
. Because
x
is also positively correlated with
y
, this means that smaller
y
values are missing more often than larger ones.

Listwise deletion

Lets see what happens if we run the ANOVA only with those cases that have

y
observed (i.e., listwise deletion). This is the standard setting on most statistical
software.

In R, the ANOVA can be conducted with the

lm()
function as follows.

In this example, the F-test at the bottom of the output indicates that the group
means are not significantly different from one another, F(( 2, 116 )) = 1.361 (p =
0.26).1 In addition, the effect size (R2 2 = 0.023) is quite a bit smaller than what
was used to generate the data.

In fact, this result is a direct consequence of how the missing data were simulated.
Fortunately, there are statistical methods that can account for the missing data and
help us obtain more trustworthy results.

Multiple imputation

One of the most effective ways of dealing with missing data is multiple imputation
(MI). Using MI, we can create multiple plausible replacements of the missing data,
given what we have observed and a statistical model (the imputation model).

In the ANOVA, using MI has the additional benefit that it allows taking covariates
into account that are relevant for the missing data but not for the analysis. In this
example,

x
is a direct cause of missing data in
y
. Therefore, we must take
x
into account when making inferences about
y
in the ANOVA.

Running MI consists of three steps. First, the missing data are imputed multiple
times. Second, the imputed data sets are analyzed separately. Third, the parameter
estimates and hypothesis tests are pooled to form a final set of estimates and
inferences.

For this example, we will use the


mice
and
mitml
packages to conduct MI.
library(mice)
library(mitml)

Specifying an imputation model is very simple here. With the following command,
we generate 100 imputations for
y
on the basis of a regression model with both
group
and
x
as predictors and a normal error term.
# run MI
imp <- mice(data = dat, method = "norm", m = 100)

The imputed data sets can then be saved as a list, containing 100 copies of the
original data, in which the missing data have been replaced by different
imputations.

# create a list of completed data sets


implist <- mids2mitml.list(imp)

Finally, we fit the ANOVA model to each of the imputed data sets and pool the
results. The analysis part is done with the

with()
command, which applies the same linear model,
lm()
, to each data set. The pooling es then done with the
testEstimates()
function.
# fit the ANOVA model
fit2 <- with(implist, lm(y ~ 1 + group))
# pool the parameter estimates
testEstimates(fit2)
#
# Call:
#
# testEstimates(model = fit2)
#
# Final parameter estimates and inferences obtained from 100 imputed data sets.
#
# Estimate Std.Error t.value df P(>|t|) RIV FMI
# (Intercept) 0.027 0.144 0.190 3178.456 0.850 0.214 0.177
# groupB 0.207 0.198 1.044 5853.312 0.297 0.149 0.130
# groupC -0.333 0.208 -1.600 2213.214 0.110 0.268 0.212
#
# Unadjusted hypothesis test as appropriate in larger samples.

Notice that we did not need to actually include

x
in the ANOVA. Rather, it was enough to include
x
in the imputation model, after which the analyses proceeded as usual.

We now have estimated the regression coefficients in the ANOVA model (i.e., the
differences between group means), but we have yet to decide whether the means
are all equal or not. To this end, we use a pooled version of the F�-test above,
which consists of a comparison of the full model (the ANOVA model) with a
reduced model that does not contain the coefficients we wish to test.2

In this case, we wish to test the coefficients pertaining to the differences between
groups, so the reduced model does not contain

group
as a predictor.
# fit the reduced ANOVA model (without 'group')
fit2.reduced <- with(implist, lm(y ~ 1))

The full and the reduced model can then be compared with the pooled version of
the F�-test (i.e., the Wald test), which is known in the literature as D1�1.

# compare the two models with pooled Wald test


testModels(fit2, fit2.reduced, method = "D1")
#
# Call:
#
# testModels(model = fit2, null.model = fit2.reduced, method = "D1")
#
# Model comparison calculated from 100 imputed data sets.
# Combination method: D1
#
# F.value df1 df2 P(>F) RIV
# 3.635 2 7186.601 0.026 0.195
#
# Unadjusted hypothesis test as appropriate in larger samples.

In contrast with listwise deletion, the F�-test under MI indicates that the
groups are significantly different from one another.

This is because MI makes use of all the observed data, including the covariate

x
, and used this information to generated replacements for missing
y
that took its relation with
x
into account. To see this, it is worth looking at a comparison of the observed and
the imputed data.

The difference is not extreme, but it is easy to see that the imputed data tend to
have more mass at the lower end of the distribution of

y
(especially in groups A and C).

This is again a result of how the data were simulated: Lower

y
values, through their relation with
x
, are missing more often, which is accounted for using MI. Conversely, using
listwise deletion placed the group means more closely together than they should
be, and this affected the results in the ANOVA.
EXPERIMENT – 3

AIM – To calculate alpha and beta value in any of the regression model

Introduction/Theory
A regression model is a statistical method used to analyze the relationship between
a dependent variable and one or more independent variables. The purpose is to find
a mathematical equation that can be used to predict the value of the dependent
variable based on the values of the independent variables.

STEPS
1) Enter the data into excel.
2) Select the data and enter the function slope() to find the beta as shown in the
figure below.
3) You will get the slope (beta)
4) Use intercept() to find the alpha as shown below.

5) You will get the slope and intercept.

Alternative Method

1) Enter your data in excel.


2) Click on data analysis toolpack

3) A small window appears as shown.


4) Select Regression and click OK. This brings a window for you to fill up
regression parameters and options.

5) Select the input and output range and new worksheet if you want and click
OK.
6) You will get the following results.
The alpha is in intercept coefficients and Beta is the X variable 1
coefficients.
EXPERIMENT – 4

AIM – To understand the concept of r square and adjusted r square and to


explain them using formulae and practical shared in the class.

Introduction/ Theory

R-squared statistic or coefficient of determination is a scale invariant statistic that


gives the proportion of variation in target variable explained by the linear
regression model.
This might seem a little complicated, so let me break this down here. In order to
determine the proportion of target variation explained by the model, we need to first
determine the following-

1. Total Sum of Squares

Total variation in target variable is the sum of squares of the difference


between the actual values and their mean.

TSS or Total sum of squares gives the total variation in Y. We can see that it
is very similar to the variance of Y. While the variance is the average of the
squared sums of difference between actual values and data points, TSS is the
total of the squared sums.

Now that we know the total variation in the target variable, how do we
determine the proportion of this variation explained by our model? We go
back to RSS.

2. Residual Sum of Squares


As we discussed before, RSS gives us the total square of the distance of actual
points from the regression line. But if we focus on a single residual, we can
say that it is the distance that is not captured by the regression line. Therefore,
RSS as a whole gives us the variation in the target variable that is not
explained by our model.

3. Calculate R-Squared

Now, if TSS gives us the total variation in Y, and RSS gives us the variation in Y
not explained by X, then TSS-RSS gives us the variation in Y that is explained
by our model! We can simply divide this value by TSS to get the proportion of
variation in Y that is explained by the model. And this our R-squared statistic!

R-squared = (TSS-RSS)/TSS
= Explained variation/ Total variation
= 1 – Unexplained variation/ Total variation
So R-squared gives the degree of variability in the target variable that is explained
by the model or the independent variables. If this value is 0.7, then it means that the
independent variables explain 70% of the variation in the target variable.

R-squared value always lies between 0 and 1. A higher R-squared value indicates a
higher amount of variability being explained by our model and vice-versa.

If we had a really low RSS value, it would mean that the regression line was very
close to the actual points. This means the independent variables explain the majority
of variation in the target variable. In such a case, we would have a really high R-
squared value.
On the contrary, if we had a really high RSS value, it would mean that the regression
line was far away from the actual points. Thus, independent variables fail to explain
the majority of variation in the target variable. This would give us a really low R-
squared value.

So, this explains why the R-squared value gives us the variation in the target variable
given by the variation in independent variables.

Problems with R-squared statistic


The R-squared statistic isn’t perfect. In fact, it suffers from a major flaw. Its value
never decreases no matter the number of variables we add to our regression model.
That is, even if we are adding redundant variables to the data, the value of R-squared
does not decrease. It either remains the same or increases with the addition of new
independent variables. This clearly does not make sense because some of the
independent variables might not be useful in determining the target variable.
Adjusted R-squared deals with this issue.

Adjusted R-squared statistic


The Adjusted R-squared takes into account the number of independent variables
used for predicting the target variable. In doing so, we can determine whether adding
new variables to the model actually increases the model fit.
Let’s have a look at the formula for adjusted R-squared to better understand its
working.

Here,

 n represents the number of data points in our dataset


 k represents the number of independent variables, and
 R represents the R-squared values determined by the model.

So, if R-squared does not increase significantly on the addition of a new independent
variable, then the value of Adjusted R-squared will actually decrease.

On the other hand, if on adding the new independent variable we see a significant
increase in R-squared value, then the Adjusted R-squared value will also increase.

We can see the difference between R-squared and Adjusted R-squared values if we
add a random independent variable to our model.
As you can see, adding a random independent variable did not help in explaining the
variation in the target variable. Our R-squared value remains the same. Thus, giving
us a false indication that this variable might be helpful in predicting the output.
However, the Adjusted R-squared value decreased which indicated that this new
variable is actually not capturing the trend in the target variable.

Clearly, it is better to use Adjusted R-squared when there are multiple variables in
the regression model. This would allow us to compare models with differing
numbers of independent variables.

Step 1: Create the Data

For this example, we’ll create a dataset that contains the following variables for 12
different students:

 Exam Score
 Hours Spent Studying
 Current Grade
Step 2: Fit the Regression Model

Next, we’ll fit a multiple linear regression model using Exam Score as the response
variable and Study Hours and Current Grade as the predictor variables.

To fit this model, click the Data tab along the top ribbon and then click Data
Analysis:

If you don’t see this option available, you need to first load the Data Analysis
ToolPak.

In the window that pops up, select Regression. In the new window that appears,
fill in the following information:
Once you click OK, the output of the regression model will appear:
Step 3: Interpret the Adjusted R-Squared

The adjusted R-squared of the regression model is the number next to Adjusted R
Square:
The adjusted R-squared for this model turns out to be 0.946019.

This value is extremely high, which indicates that the predictor variables Study
Hours and Current Grade do a good job of predicting Exam Score.
Calculation of R Square

Suppose we have below values for x and y and we want to add the R squared
value in regression.

Figure 3. Sample data for R squared value

How to find the R2 value


There are two methods to find the R squared value:

 Calculate for r using CORREL, then square the value


 Calculate for R squared using RSQ

Enter the following formulas into our worksheets:

 In cell G3, enter the formula =CORREL(B3:B7,C3:C7)


 In cell G4, enter the formula =G3^2
 In cell G5, enter the formula =RSQ(C3:C7,B3:B7)

Figure 4. Output: How to find R squared value


The results in G4 and G5 show that both methods have the same result for R
squared which is 0.100443671. With Excel, adding the R squared value is very
easy with the help of the functions CORREL and RSQ.

R squared is relevant in various fields such as in stock market and mutual funds
because it is able to find the probability or present the correlation between two
variables, and it has the ability to explain how much of the movement of one
variable can explain the trend of another variable.
EXPERIMENT – 5

AIM – To write down the assumptions of regression analysis


Introduction/Theory

Regression is a parametric approach. ‘Parametric’ means it makes


assumptions about data for the purpose of analysis. Due to its parametric
side, regression is restrictive in nature. It fails to deliver good results with
data sets which doesn’t fulfill its assumptions. Therefore, for a successful
regression analysis, it’s essentialto validate these assumptions.
Assumptions in Regression Analysis
1. The Regression Model should be Linear in its Coefficients as well as the
Error Term
This formula will hold good in our case
Y = B0 + B1X1 + B2X2 + B3X3 + € where € is the error term.

2. The Error Term should have a Population Mean of Zero


The error term is critical because it accounts for the variation in the
dependent variable that the independent variables do not explain.
Therefore, the average valueof the error term should be as close to zero as
possible for the model to be unbiased.

3. All the Independent Variables in the Equation are Uncorrelated with the
Error Term
In case there is a correlation between the independent variable and the error
term, itbecomes easy to predict the error term. It violates the principle that
the error term represents an unpredictable random error. Therefore, all the
independent variables should not correlate with the error term.
4. Observations of the Error Term should also have No Relation with each other
The rule is such that one observation of the error term should not allow us to
predict the next observation.

5. The Error Term should be Homoscedastic (it should have a constantvariance)


This assumption of the classical linear regression model entails that the variation of
the error term should be consistent for all observations. Plotting the residuals
versus fitted value graph enables us to check out this assumption.

6. None of the Independent Variables should be a Linear Function of the otherVariables


When the two variables move in a fixed proportion, it is referred to as a perfect
correlation. For example, any change in the Centigrade value of the temperature
will bring about a corresponding change in the Fahrenheit value. This assumption
of the classical linear regression model states that independent values should not
have a direct relationship amongst themselves.

7. There Should be No Autocorrelation in the Data


One of the critical assumptions of multiple linear regression is that there should be
no autocorrelation in the data. When the residuals are dependent on each other,
there is autocorrelation. This factor is visible in the case of stock prices when the
price of a stock is not independent of its previous one.
Plotting the variables on a graph like a scatterplot allows you to check for
autocorrelations if any. Another way to verify the existence of autocorrelation is
the Durbin-Watson test.
8. There Should be No Multicollinearity in the Data
Another critical assumption of multiple linear regression is that there should not be
much multicollinearity in the data. Such a situation can arise when the independent
variables are too highly correlated with each other.
In our example, the variable data has a relationship, but they do not have much
collinearity. There could be students who would have secured higher marks in spite
of engaging in social media for a longer duration than the others.
Similarly, there could be students with lesser scores in spite of sleeping for lesser
time. The point is that there is a relationship but not a multicollinear one.
If you still find some amount of multicollinearity in the data, the best solution is to
remove the variables that have a high variance inflation factor.
EXPERIMENT – 6

Aim – To use any statistical tool and justify the linearity assumptions of regression
analysis

Introduction/ Theory

The linearity test is one of the assumption tests in linear regression using the
ordinary least square (OLS) method. The objective of the linearity test is to
determine whether the distribution of the data of the dependent variable and the
independent variable forms a linear line pattern or not?

The linearity assumption must be fulfilled because the regression used is linear
regression. In the linearity assumption test in linear regression, you test the
distribution of the data between the dependent variable and the independent
variable.

STEPS

The data we use for exercise can be seen in the table below –
.

In STATA, you will find several icons. Then you select the table icon with a pencil
drawing. In the next step, you input all the data I have conveyed above. Data from
the rice consumption variable (Y) is inputted in the first column, then data from the
income (X1) and population (X2) variables are entered in the 2nd column and 3rd
column.

To test linearity in linear regression, I will use a scatter plot graph. In creating a
scatter plot graph between rice consumption (Y) and income (X1), you type in the
command in STATA as follows:

twoway (scatter Y X1)

Next, you can press enter, and the scatter plot results of the linearity test between
rice consumption (Y) and income (X1) can be seen below:
In creating a scatter plot graph between rice consumption (Y) and population (X2),
type in the command in STATA as follows:

twoway (scatter Y X2)

You can press enter, and the scatter plot results of the linearity test between rice
consumption (Y), and population (X2) can be seen below:
Results

Based on the scatter plot graph for the rice consumption variable with the income
variable, we can see that the data distribution forms a linear trend line. The linear
line is formed from the bottom left to the top right (positive linear line).

The same thing also happens for the scatter plot graph for the rice consumption
variable with the population variable. We can see that the data distribution forms a
positive linear trend.

Based on the results of the linearity test using a scatter plot, we can conclude that
the regression model has fulfilled the linearity assumption. Therefore, it is correct
that we choose to use linear regression.
EXPERIMENT – 7

Aim – Use any statistical tool to justify the multicollinearity assumptions of


regression analysis

Introduction/Theory

Multicollinearity occurs when independent variables in a regression model are


correlated. This correlation is a problem because independent variables should be
independent. If the degree of correlation between variables is high enough, it can
cause problems when you fit the model and interpret the results.

STEPS

1) Open the Microsoft excel file where the data is stored.


2) If you do not have data analysis tab in your Microsoft excel, Navigate to
thehighlighted
3) Click on add-ins, and navigate to Analysis toolpak and click on GO ;
4) Click/Tick Analysis ToolPak, Analysis ToolPak-VBA and SolverAdd-
in then Click OK.

You would observe you have a new tab Data in the Microsoft excel
windows display

Click on the Data tab, this would appear


5) Arrange the data in order in which they would be regressed in

the microsoft excel filewith each arrangement in separate excel

sheets.

INFR, UNEMP, EXR and FDI are the explanatory

variables in this study.INFR = f(UNEMP, EXR, FDI)

........................................................................... (1)

UNEMP = f(EXR, FDI, INFR) ...........................(2)

FDI = f(INFR, UNEMP, EXR) ......................... (3)

EXR = f(INFR, UNEMP, FDI) ...........................(4)

EQUATION 1:

For Equation 1 to examine the degree of multicollinearity

between INFR and the other explanatory variables. INFR is used


as dependent variable while the other variables are retained as

explanatory variables.

INFR would come first in the Microsoft excel file followed by other
explanatory variables as below:

EQUATION 2:

For Equation 2 to examine the degree of multicollinearity

between UNEMP and the other explanatory variables. UNEMP

is used as dependent variable while the other variables are


retained as explanatory variables.

UNEMP would come first in the Microsoft excel file followed by

other explanatory variables as below:

EQUATION 3:

For Equation 3 to examine the degree of multicollinearity

between FDI and the other explanatory variables. FDI is used as

dependent variable while the other variables are retained as

explanatory variables.
EQUATION 4:
For Equation 4 to examine the degree of multicollinearity between EXR

and the other explanatory variables. EXR is used as dependent variable

while the other variables are retained as explanatory variables.

EXR would come first in the Microsoft excel file followed by other

explanatory variables as below:


6) Regress each explanatory variable on other explanatory variables.

That is one explanatory variable would be used as dependent

variable while the other explanatory variable would be retained as

explanatory variable.

INFR, UNEMP, EXR and FDI are the explanatory

variables in this study.INFR = f(UNEMP, EXR, FDI)

........................................................................... (1)

UNEMP = f(EXR, FDI, INFR) ...........................(2)

FDI = f(INFR, UNEMP, EXR) ......................... (3)

EXR = f(INFR, UNEMP, FDI) ...........................(4)

Navigate: Data >> Data analysis >> Regression


Click OK, this window displays
STEP SEVEN:

EQUATION 1:

INFR = f(UNEMP, EXR, FDI)

Put the mouse cursor in the Input Y range and Highlight the Y variable which is

INFR columnand do the same for the X variableswhich are UNEMP, EXR and FDI.
FOR Y
Click OK and the regression result shows:
The same procedure would be followed for other variables.

EQUATION 2

EQUATION 3
EQUATION 4

8) Collect the R2 of each equation/variable and put it in a table


EQUATION VARIABLE R2

1 INFR
0.363227848
2 UNEMP
0.930691032
3 FDI
0.208836755
4 EXR
0.937053465

9) Compute 1 - R2
EQUATION VARIABLE R2 1 - R2

1 INFR
0.363227848 0.636772
2 UNEMP
0.930691032 0.069309
3 FDI
0.208836755 0.791163
4 EXR
0.937053465 0.062947

STEP TEN: Compute Variance Inflation factor for each variable and
interpret.
1
VIF Formula = 1/
1− 𝑅2
EQUATION VARIABLE R2 1 - R2 VIF

1 INFR
0.363227848 0.636772 1.57042
2 UNEMP
0.930691032 0.069309 14.42815
3 FDI
0.208836755 0.791163 1.263962
4 EXR
0.937053465 0.062947 15.8865

Decision:
EQUATIO VARIAB R2 1 - R2 VIF Decision
N LE
VIF < 5, There is
little orno evidence of
1 INFR
0.3632278 0.6367 1.5704 multicollinearity of
48 72 2 INFR
with other
explanatoryvariables.
VIF > 10, There is
evidence of high
2 UNEMP
0.9306910 0.0693 14.428 multicollinearity of
32 09 15 UNEMP with other
explanatory
variables.
VIF < 5, There is
little orno evidence
3 FDI
0.2088367 0.7911 1.2639 of multicollinearity
55 63 62 of FDI with other
explanatory
variables.
VIF > 10, There is
evidence of high
4 EXR
0.9370534 0.0629 15.886 multicollinearity of
65 47 5 EXRwith other
explanatory
variables.
EXPERIMENT – 8

AIM – Using any statistical tool justify the autocorrelation assumptions of


regression analysis.
Introduction/Theory

Autocorrelation is a mathematical representation of the degree of similarity


between a given time series and a lagged version of itself over successive time
intervals. It's conceptually similar to the correlation between two different time
series, but autocorrelation uses the same time series twice: once in its original
form and once lagged one or more time periods.

For example, if it's rainy today, the data suggests that it's more likely to rain
tomorrow than if it's clear today. When it comes to investing, a stock might have a
strong positive autocorrelation of returns, suggesting that if it's "up" today, it's
more likely to be up tomorrow, too.

Naturally, autocorrelation can be a useful tool for traders to utilize; particularly for
technical analysts.

STEPS

1. Select one column of data:


The sample data shown above is found in QI Macros Help > Open QIMacros
Sample Data> XmRChart.xlsx > Autocorrelation tab
2. Next, click on the QI Macros menu and choose Statistical Tools >
Regression & Other Statistics > AutoCorrelation:
3. Evaluate the Autocorrelation Results:

NOTE: Maximum correlations is set to 25.


Correlations for lags:
 If a correlation is outside of the confidence limits (e.g., lag 1 = 0.8), the
data is autocorrelated at lag 1.
In this example:
 Lag 1 has a positive autocorrelation (0.8)
 Lag 5 and 6 have a negative autocorrelation (-0.67 and -0.79)
EXPERIMENT – 9
Aim – Using any statistical tool justify the hetroscedasticity assumptions of
regression analysis.

Introduction/Theory

The concept of heteroscedasticity - the opposite being homoscedasticity - is used


in statistics, especially in the context of linear regression or for time series analysis,
to describe the case where the variance of errors of the model is not the same for
all observations, while often one of the basic assumption in modeling is that the
variances are homogeneous and that the errors of the model are identically
distributed.

In regression analysis, heteroscedasticity (sometimes spelled heteroskedasticity)


refers to the unequal scatter of residuals or error terms. Specfically, it refers to the
case where there is a systematic change in the spread of the residuals over the
range of measured values.

Heteroscedasticity is a problem because ordinary least squares (OLS) regression


assumes that the residuals come from a population that has homoscedasticity,
which means constant variance.

When heteroscedasticity is present in a regression analysis, the results of the


analysis become hard to trust. Specifically, heteroscedasticity increases the
variance of the regression coefficient estimates, but the regression model doesn’t
pick up on this.

This makes it much more likely for a regression model to declare that a term in the
model is statistically significant, when in fact it is not.

The simplest way to detect heteroscedasticity is with a fitted value vs.


residualplot.

Once you fit a regression line to a set of data, you can then create a scatterplot that
shows the fitted values of the model vs. the residuals of those fitted values.

STEPS

1) Select the data


2) Go to insert
3) Select charts
4) Create a scatter plot
5) You will the following results

Notice how the residuals become much more spread out as the fitted values get
larger. This “cone” shape is a telltale sign of heteroscedasticity.

Heteroscedasticity occurs naturally in datasets where there is a large range of


observed data values. For example:

 Consider a dataset that includes the annual income and expenses of


100,000people across the United States. For individuals with lower
incomes, there
will be lower variability in the corresponding expenses since these
individuals likely only have enough money to pay for the necessities. For
individuals with higher incomes, there will be higher variability in the
corresponding expenses since these individuals have more money to spend if
they choose to. Some higher-income individuals will choose to spend most
of their income, while some may choose to be frugal and only spend a
portion of their income, which is why the variability in expenses among
these higher-income individuals will inherently be higher.
 Consider a dataset that includes the populations and the count of flower
shops in 1,000 different cities across the United States. For cities with
smallpopulations, it may be common for only one or two flower shops to
be present. But in cities with larger populations, there will be a much
greater variability in the number of flower shops. These cities may have
anywhere between 10 to 100 shops. This means when we create a
regression analysis and use population to predict number of flower shops,
there will inherently be greater variability in the residuals for the cities
with higher populations.

How to Fix Heteroscedasticity

There are three common ways to fix heteroscedasticity:

1. Transform the dependent variable

One way to fix heteroscedasticity is to transform the dependent variable in some


way. One common transformation is to simply take the log of the dependent
variable.

For example, if we are using population size (independent variable) to predict the
number of flower shops in a city (dependent variable), we may instead try to use
population size to predict the log of the number of flower shops in a city.

Using the log of the dependent variable, rather than the original dependent
variable, often causes heteroskedasticity to go away.

2. Redefine the dependent variable

Another way to fix heteroscedasticity is to redefine the dependent variable. One


common way to do so is to use a rate for the dependent variable, rather than the
raw value.
For example, instead of using the population size to predict the number of flower
shops in a city, we may instead use population size to predict the number of flower
shops per capita.

In most cases, this reduces the variability that naturally occurs among larger
populations since we’re measuring the number of flower shops per person, rather
than the sheer amount of flower shops.

3. Use weighted regression

Another way to fix heteroscedasticity is to use weighted regression. This type of


regression assigns a weight to each data point based on the variance of its fitted
value.

Essentially, this gives small weights to data points that have higher variances,
which shrinks their squared residuals. When the proper weights are used, this can
eliminate the problem of heteroscedasticity.

Conclusion

Heteroscedasticity is a fairly common problem when it comes to regression


analysis because so many datasets are inherently prone to non-constant variance.

However, by using a fitted value vs. residual plot, it can be fairly easy to spot
heteroscedasticity.

And through transforming the dependent variable, redefining the dependent


variable, or using weighted regression, the problem of heteroscedasticity can often
be eliminated.
THANK YOU

You might also like