Professional Documents
Culture Documents
MODELLING AND
DERIVATIVES
PRACTICAL FILE
NAME – MUSKAN ARORA
SEMESTER - 6
|
INDEX
The Independent Variable is the factor that might influence the dependent variable.
Consider the following data where we have a number of COVID cases and masks
sold in a particular month.
The summary output tells you how well the calculated linear regression equation
fits your data source.
The Multiple R is the Correlation Coefficient that measures the strength of a linear
relationship between two variables. The larger the absolute value, the stronger is
the relationship.
1 means a strong positive relationship
-1 means a strong negative relationship
0 means no relationship at all
R Square signifies the Coefficient of Determination, which shows the goodness of
fit. It shows how many points fall on the regression line. In our example, the value
of R square is 0.96, which is an excellent fit. In other words, 96% of the dependent
variables (y-values) are explained by the independent variables (x-values).
Adjusted R Square is the modified version of R square that adjusts for predictors
that are not significant to the regression model.
Standard Error is another goodness-of-fit measure that shows the precision of your
regression analysis.
ANOVA
ANOVA stands for Analysis of Variance. It gives information about the levels of
variability within your regression model.
The analysis of variance, or ANOVA, is among the most popular methods for
analyzing how an outcome variable differs between groups, for example, in
observational studies or in experiments with different conditions.
But how do we conduct the ANOVA when there are missing data? In this post, I
show how to deal with missing data in between- and within-subject designs using
multiple imputation (MI) in R.
In the one-factorial ANOVA, the goal is to investigate whether two or more groups
differ with respect to some outcome variable y. The statistical model can be written
as
yij=μj+eij,
where yij denotes the value of y for person i in group j, and μjis the mean in
group j. The (omnibus) null hypothesis of the ANOVA states that all groups have
identical population means. For three groups, this would mean that
μ1=μ2=μ3.1=2=3.
This hypothesis is tested by looking at whether the differences between groups are
larger than what could be expected from the differences within groups. If this is the
case, then we reject the null, and the group means are said to be “significantly”
different from one another.
In the following, we will look at how this hypothesis can be tested when the
outcome variable contains missing data. Let’s illustrate this with an example.
Example 1: between-subjects ANOVA
You can download the data from this post if you want to reproduce the results
(CSV, Rdata). Here are the first few rows.
group
: the grouping variable
y
: the outcome variable (with 20.7% missing data)
x
: an additional covariate
x
had a higher chance of missing data in
y
. Because
x
is also positively correlated with
y
, this means that smaller
y
values are missing more often than larger ones.
Listwise deletion
Lets see what happens if we run the ANOVA only with those cases that have
y
observed (i.e., listwise deletion). This is the standard setting on most statistical
software.
lm()
function as follows.
In this example, the F-test at the bottom of the output indicates that the group
means are not significantly different from one another, F(( 2, 116 )) = 1.361 (p =
0.26).1 In addition, the effect size (R2 2 = 0.023) is quite a bit smaller than what
was used to generate the data.
In fact, this result is a direct consequence of how the missing data were simulated.
Fortunately, there are statistical methods that can account for the missing data and
help us obtain more trustworthy results.
Multiple imputation
One of the most effective ways of dealing with missing data is multiple imputation
(MI). Using MI, we can create multiple plausible replacements of the missing data,
given what we have observed and a statistical model (the imputation model).
In the ANOVA, using MI has the additional benefit that it allows taking covariates
into account that are relevant for the missing data but not for the analysis. In this
example,
x
is a direct cause of missing data in
y
. Therefore, we must take
x
into account when making inferences about
y
in the ANOVA.
Running MI consists of three steps. First, the missing data are imputed multiple
times. Second, the imputed data sets are analyzed separately. Third, the parameter
estimates and hypothesis tests are pooled to form a final set of estimates and
inferences.
Specifying an imputation model is very simple here. With the following command,
we generate 100 imputations for
y
on the basis of a regression model with both
group
and
x
as predictors and a normal error term.
# run MI
imp <- mice(data = dat, method = "norm", m = 100)
The imputed data sets can then be saved as a list, containing 100 copies of the
original data, in which the missing data have been replaced by different
imputations.
Finally, we fit the ANOVA model to each of the imputed data sets and pool the
results. The analysis part is done with the
with()
command, which applies the same linear model,
lm()
, to each data set. The pooling es then done with the
testEstimates()
function.
# fit the ANOVA model
fit2 <- with(implist, lm(y ~ 1 + group))
# pool the parameter estimates
testEstimates(fit2)
#
# Call:
#
# testEstimates(model = fit2)
#
# Final parameter estimates and inferences obtained from 100 imputed data sets.
#
# Estimate Std.Error t.value df P(>|t|) RIV FMI
# (Intercept) 0.027 0.144 0.190 3178.456 0.850 0.214 0.177
# groupB 0.207 0.198 1.044 5853.312 0.297 0.149 0.130
# groupC -0.333 0.208 -1.600 2213.214 0.110 0.268 0.212
#
# Unadjusted hypothesis test as appropriate in larger samples.
x
in the ANOVA. Rather, it was enough to include
x
in the imputation model, after which the analyses proceeded as usual.
We now have estimated the regression coefficients in the ANOVA model (i.e., the
differences between group means), but we have yet to decide whether the means
are all equal or not. To this end, we use a pooled version of the F�-test above,
which consists of a comparison of the full model (the ANOVA model) with a
reduced model that does not contain the coefficients we wish to test.2
In this case, we wish to test the coefficients pertaining to the differences between
groups, so the reduced model does not contain
group
as a predictor.
# fit the reduced ANOVA model (without 'group')
fit2.reduced <- with(implist, lm(y ~ 1))
The full and the reduced model can then be compared with the pooled version of
the F�-test (i.e., the Wald test), which is known in the literature as D1�1.
In contrast with listwise deletion, the F�-test under MI indicates that the
groups are significantly different from one another.
This is because MI makes use of all the observed data, including the covariate
x
, and used this information to generated replacements for missing
y
that took its relation with
x
into account. To see this, it is worth looking at a comparison of the observed and
the imputed data.
The difference is not extreme, but it is easy to see that the imputed data tend to
have more mass at the lower end of the distribution of
y
(especially in groups A and C).
y
values, through their relation with
x
, are missing more often, which is accounted for using MI. Conversely, using
listwise deletion placed the group means more closely together than they should
be, and this affected the results in the ANOVA.
EXPERIMENT – 3
AIM – To calculate alpha and beta value in any of the regression model
Introduction/Theory
A regression model is a statistical method used to analyze the relationship between
a dependent variable and one or more independent variables. The purpose is to find
a mathematical equation that can be used to predict the value of the dependent
variable based on the values of the independent variables.
STEPS
1) Enter the data into excel.
2) Select the data and enter the function slope() to find the beta as shown in the
figure below.
3) You will get the slope (beta)
4) Use intercept() to find the alpha as shown below.
Alternative Method
5) Select the input and output range and new worksheet if you want and click
OK.
6) You will get the following results.
The alpha is in intercept coefficients and Beta is the X variable 1
coefficients.
EXPERIMENT – 4
Introduction/ Theory
TSS or Total sum of squares gives the total variation in Y. We can see that it
is very similar to the variance of Y. While the variance is the average of the
squared sums of difference between actual values and data points, TSS is the
total of the squared sums.
Now that we know the total variation in the target variable, how do we
determine the proportion of this variation explained by our model? We go
back to RSS.
3. Calculate R-Squared
Now, if TSS gives us the total variation in Y, and RSS gives us the variation in Y
not explained by X, then TSS-RSS gives us the variation in Y that is explained
by our model! We can simply divide this value by TSS to get the proportion of
variation in Y that is explained by the model. And this our R-squared statistic!
R-squared = (TSS-RSS)/TSS
= Explained variation/ Total variation
= 1 – Unexplained variation/ Total variation
So R-squared gives the degree of variability in the target variable that is explained
by the model or the independent variables. If this value is 0.7, then it means that the
independent variables explain 70% of the variation in the target variable.
R-squared value always lies between 0 and 1. A higher R-squared value indicates a
higher amount of variability being explained by our model and vice-versa.
If we had a really low RSS value, it would mean that the regression line was very
close to the actual points. This means the independent variables explain the majority
of variation in the target variable. In such a case, we would have a really high R-
squared value.
On the contrary, if we had a really high RSS value, it would mean that the regression
line was far away from the actual points. Thus, independent variables fail to explain
the majority of variation in the target variable. This would give us a really low R-
squared value.
So, this explains why the R-squared value gives us the variation in the target variable
given by the variation in independent variables.
Here,
So, if R-squared does not increase significantly on the addition of a new independent
variable, then the value of Adjusted R-squared will actually decrease.
On the other hand, if on adding the new independent variable we see a significant
increase in R-squared value, then the Adjusted R-squared value will also increase.
We can see the difference between R-squared and Adjusted R-squared values if we
add a random independent variable to our model.
As you can see, adding a random independent variable did not help in explaining the
variation in the target variable. Our R-squared value remains the same. Thus, giving
us a false indication that this variable might be helpful in predicting the output.
However, the Adjusted R-squared value decreased which indicated that this new
variable is actually not capturing the trend in the target variable.
Clearly, it is better to use Adjusted R-squared when there are multiple variables in
the regression model. This would allow us to compare models with differing
numbers of independent variables.
For this example, we’ll create a dataset that contains the following variables for 12
different students:
Exam Score
Hours Spent Studying
Current Grade
Step 2: Fit the Regression Model
Next, we’ll fit a multiple linear regression model using Exam Score as the response
variable and Study Hours and Current Grade as the predictor variables.
To fit this model, click the Data tab along the top ribbon and then click Data
Analysis:
If you don’t see this option available, you need to first load the Data Analysis
ToolPak.
In the window that pops up, select Regression. In the new window that appears,
fill in the following information:
Once you click OK, the output of the regression model will appear:
Step 3: Interpret the Adjusted R-Squared
The adjusted R-squared of the regression model is the number next to Adjusted R
Square:
The adjusted R-squared for this model turns out to be 0.946019.
This value is extremely high, which indicates that the predictor variables Study
Hours and Current Grade do a good job of predicting Exam Score.
Calculation of R Square
Suppose we have below values for x and y and we want to add the R squared
value in regression.
R squared is relevant in various fields such as in stock market and mutual funds
because it is able to find the probability or present the correlation between two
variables, and it has the ability to explain how much of the movement of one
variable can explain the trend of another variable.
EXPERIMENT – 5
3. All the Independent Variables in the Equation are Uncorrelated with the
Error Term
In case there is a correlation between the independent variable and the error
term, itbecomes easy to predict the error term. It violates the principle that
the error term represents an unpredictable random error. Therefore, all the
independent variables should not correlate with the error term.
4. Observations of the Error Term should also have No Relation with each other
The rule is such that one observation of the error term should not allow us to
predict the next observation.
Aim – To use any statistical tool and justify the linearity assumptions of regression
analysis
Introduction/ Theory
The linearity test is one of the assumption tests in linear regression using the
ordinary least square (OLS) method. The objective of the linearity test is to
determine whether the distribution of the data of the dependent variable and the
independent variable forms a linear line pattern or not?
The linearity assumption must be fulfilled because the regression used is linear
regression. In the linearity assumption test in linear regression, you test the
distribution of the data between the dependent variable and the independent
variable.
STEPS
The data we use for exercise can be seen in the table below –
.
In STATA, you will find several icons. Then you select the table icon with a pencil
drawing. In the next step, you input all the data I have conveyed above. Data from
the rice consumption variable (Y) is inputted in the first column, then data from the
income (X1) and population (X2) variables are entered in the 2nd column and 3rd
column.
To test linearity in linear regression, I will use a scatter plot graph. In creating a
scatter plot graph between rice consumption (Y) and income (X1), you type in the
command in STATA as follows:
Next, you can press enter, and the scatter plot results of the linearity test between
rice consumption (Y) and income (X1) can be seen below:
In creating a scatter plot graph between rice consumption (Y) and population (X2),
type in the command in STATA as follows:
You can press enter, and the scatter plot results of the linearity test between rice
consumption (Y), and population (X2) can be seen below:
Results
Based on the scatter plot graph for the rice consumption variable with the income
variable, we can see that the data distribution forms a linear trend line. The linear
line is formed from the bottom left to the top right (positive linear line).
The same thing also happens for the scatter plot graph for the rice consumption
variable with the population variable. We can see that the data distribution forms a
positive linear trend.
Based on the results of the linearity test using a scatter plot, we can conclude that
the regression model has fulfilled the linearity assumption. Therefore, it is correct
that we choose to use linear regression.
EXPERIMENT – 7
Introduction/Theory
STEPS
You would observe you have a new tab Data in the Microsoft excel
windows display
sheets.
........................................................................... (1)
EQUATION 1:
explanatory variables.
INFR would come first in the Microsoft excel file followed by other
explanatory variables as below:
EQUATION 2:
EQUATION 3:
explanatory variables.
EQUATION 4:
For Equation 4 to examine the degree of multicollinearity between EXR
EXR would come first in the Microsoft excel file followed by other
explanatory variable.
........................................................................... (1)
EQUATION 1:
Put the mouse cursor in the Input Y range and Highlight the Y variable which is
INFR columnand do the same for the X variableswhich are UNEMP, EXR and FDI.
FOR Y
Click OK and the regression result shows:
The same procedure would be followed for other variables.
EQUATION 2
EQUATION 3
EQUATION 4
1 INFR
0.363227848
2 UNEMP
0.930691032
3 FDI
0.208836755
4 EXR
0.937053465
9) Compute 1 - R2
EQUATION VARIABLE R2 1 - R2
1 INFR
0.363227848 0.636772
2 UNEMP
0.930691032 0.069309
3 FDI
0.208836755 0.791163
4 EXR
0.937053465 0.062947
STEP TEN: Compute Variance Inflation factor for each variable and
interpret.
1
VIF Formula = 1/
1− 𝑅2
EQUATION VARIABLE R2 1 - R2 VIF
1 INFR
0.363227848 0.636772 1.57042
2 UNEMP
0.930691032 0.069309 14.42815
3 FDI
0.208836755 0.791163 1.263962
4 EXR
0.937053465 0.062947 15.8865
Decision:
EQUATIO VARIAB R2 1 - R2 VIF Decision
N LE
VIF < 5, There is
little orno evidence of
1 INFR
0.3632278 0.6367 1.5704 multicollinearity of
48 72 2 INFR
with other
explanatoryvariables.
VIF > 10, There is
evidence of high
2 UNEMP
0.9306910 0.0693 14.428 multicollinearity of
32 09 15 UNEMP with other
explanatory
variables.
VIF < 5, There is
little orno evidence
3 FDI
0.2088367 0.7911 1.2639 of multicollinearity
55 63 62 of FDI with other
explanatory
variables.
VIF > 10, There is
evidence of high
4 EXR
0.9370534 0.0629 15.886 multicollinearity of
65 47 5 EXRwith other
explanatory
variables.
EXPERIMENT – 8
For example, if it's rainy today, the data suggests that it's more likely to rain
tomorrow than if it's clear today. When it comes to investing, a stock might have a
strong positive autocorrelation of returns, suggesting that if it's "up" today, it's
more likely to be up tomorrow, too.
Naturally, autocorrelation can be a useful tool for traders to utilize; particularly for
technical analysts.
STEPS
Introduction/Theory
This makes it much more likely for a regression model to declare that a term in the
model is statistically significant, when in fact it is not.
Once you fit a regression line to a set of data, you can then create a scatterplot that
shows the fitted values of the model vs. the residuals of those fitted values.
STEPS
Notice how the residuals become much more spread out as the fitted values get
larger. This “cone” shape is a telltale sign of heteroscedasticity.
For example, if we are using population size (independent variable) to predict the
number of flower shops in a city (dependent variable), we may instead try to use
population size to predict the log of the number of flower shops in a city.
Using the log of the dependent variable, rather than the original dependent
variable, often causes heteroskedasticity to go away.
In most cases, this reduces the variability that naturally occurs among larger
populations since we’re measuring the number of flower shops per person, rather
than the sheer amount of flower shops.
Essentially, this gives small weights to data points that have higher variances,
which shrinks their squared residuals. When the proper weights are used, this can
eliminate the problem of heteroscedasticity.
Conclusion
However, by using a fitted value vs. residual plot, it can be fairly easy to spot
heteroscedasticity.