EMF - Prático

PERGUNTAS E RESPOSTAS EMF
These assumptions are important because they are necessary for the Ordinary
Least Squares (OLS) estimator to be unbiased and have the lowest variance among
all linear unbiased estimators.
If any of these assumptions is violated, the OLS estimator may still be unbiased, but
its variance may not be minimal, leading to less precise and less reliable estimates.
The assumptions of linearity and independence are necessary for OLS to be

unbiased. The assumption of normality is needed for the statistical inference
methods used in regression analysis (e.g., hypothesis testing, confidence intervals)
to be valid. The assumption of equal variances is needed for the OLS estimator to
have the minimum variance among all linear unbiased estimators.
Interpretação do Beta 1 numa SLR
Interpretação do Beta 1 numa MLR

Interpretação do Beta 1 se for uma dummy numa SLR.
Coeficiente de Determinação -> R-squared

Interpretação do Beta 1 se for uma dummy numa MLR.
What can we do to check for potential issues in a regression?
Here are some tests and diagnostic tools that you can use to identify potential issues in a
regression analysis:
1. Examining the residuals:

Plot the residuals against the predicted values to check for patterns that may indicate a violation
of the assumptions of the model.
Use the Breusch-Pagan test or the White test to formally test for heteroscedasticity.
2. Checking the fit of the model:

Use the coefficient of determination (R-squared) and adjusted R-squared to measure the
goodness of fit of the model.
Use the Akaike information criterion (AIC) and the Bayesian information criterion (BIC) to
compare the fit of different models.
3. Examining the correlation between the independent variables:

Calculate the variance inflation factor (VIF) for each independent variable to check for
multicollinearity.
4. Testing for outliers:

Use the studentized residuals to identify potential outliers.
Use the Cook's distance measure to identify observations that have a disproportionate influence
on the estimates of the regression coefficients.
5. Checking for influential observations:

Use the DFBETAS (distance from the mean divided by the standard error) or the influence
measures (hat matrix, leverage, etc.) to identify observations that may be influential.
These are just a few examples of the tests and diagnostic tools that are available for identifying
potential issues in a regression analysis. It is important to carefully consider the assumptions of
the model and the characteristics of the data when choosing the appropriate tests and
diagnostics to use.
What is the Shapiro-Wilk test and what can we use it for?
The Shapiro-Wilk test is a statistical test used to assess whether a sample of data comes from a
population with a normal distribution. The test assumes that the sample is drawn from a normally
distributed population.
To use the Shapiro-Wilk test, you need to:

1. Collect a sample of data.
2. Calculate the mean and standard deviation of the sample.
3. Use the Shapiro-Wilk test to test the hypothesis that the sample comes from a population
with a normal distribution.
The Shapiro-Wilk test returns a test statistic and a p-value. If the p-value is less than a
predetermined significance level (usually 0.05), you can reject the null hypothesis that the sample
comes from a normally distributed population. If the p-value is greater than the significance level,
you cannot reject the null hypothesis.
The Shapiro-Wilk test is sensitive to deviations from normality and is often used as a robust
alternative to the more sensitive D'Agostino-Pearson omnibus test. However, the Shapiro-Wilk
test is not as powerful as some other tests (e.g., the Anderson-Darling test) and may not detect
deviations from normality in some cases. It is important to note that the Shapiro-Wilk test
assumes that the sample is drawn from a normally distributed population. If this assumption is
not met, the test may not be reliable.
How do we analyze a Shapiro-Wilk W test in practice?

What is the p-value and how can we interpret it?
In a regression analysis, the p-value is the smallest significance level for which we can reject the
null hypothesis. It is the probability of finding as extreme or more extreme values than H0. The
p-value for a given predictor variable is used to test the null hypothesis that the true population
coefficient for that variable is equal to zero. If the p-value is less than a predetermined
significance level (usually 0.05), you can reject the null hypothesis and conclude that there is a
statistically significant relationship between the predictor variable and the response variable.
For example, if the p-value for a given predictor variable is 0.03, you can reject the null hypothesis
that the true population coefficient is zero and conclude that there is a statistically significant
relationship between the predictor variable and the response variable.
More Examples:
Here is a complete example of how you might interpret the results of a regression analysis when
the p-value is 0.117:
Since the p-value is greater than the predetermined significance level of 0.05, you do not have
sufficient statistical evidence to reject the null hypothesis that the true population coefficient for
predictor X is equal to zero. This means that you cannot conclude that there is a statistically
significant relationship between predictor X and response Y.
On the other hand, if the p-value is 0.000, you can reject the null hypothesis that the true
population coefficient for predictor X is equal to zero. This means that you can conclude that
there is a statistically significant relationship between predictor X and response Y. It is important
to carefully interpret the results of the regression analysis and consider the implications of the
findings in the context of your research question or hypothesis.
When the p-value is 0.000, it means that the probability of observing a test statistic at least as
extreme as the one observed, given that the null hypothesis is true, is very low. This can be
expressed using the phrase "extreme or more extreme."
For example, if the p-value of a test is 0.000, you can say that the probability of observing a test
statistic at least as extreme as the one observed, given that the null hypothesis is true, is less than
0.001. This can be expressed as: "The probability of observing a test statistic at least as extreme
as the one observed, given that the null hypothesis is true, is 0.000 (or less than 0.001)."
In summary, when the p-value is 0.000, it means that the probability of observing a test statistic
at least as extreme as the one observed, given that the null hypothesis is true, is very low. This
suggests that the null hypothesis is unlikely to be true and that the observed result is statistically
significant.
If you are using a 10% significance level (alpha = 0.10) and the p-value of a test is 0.117, you
would not reject the null hypothesis.
The p-value is the probability of observing a test statistic at least as extreme as the one observed,
given that the null hypothesis is true. If the p-value is less than the predetermined significance
level (alpha), you can reject the null hypothesis and conclude that the observed result is
statistically significant. If the p-value is greater than the significance level, you cannot reject the
null hypothesis.
For example, if the p-value is 0.117 and the significance level is 0.10, you cannot reject the null
hypothesis because the p-value is greater than the significance level. This means that you do not
have sufficient statistical evidence to conclude that the observed result is statistically significant.
It is important to carefully consider the p-value and the significance level when interpreting the
results of a statistical test. The significance level represents the threshold for determining
statistical significance, and the p-value is the probability of observing a test statistic at least as
extreme as the one observed, given that the null hypothesis is true.
Interpretation of the R-Squared
Here is an example of how you might interpret the R-squared values of 0.07 and 0.68:
The R-squared value is a measure of the proportion of the variance in the response variable that
is explained by the predictor variable. An R-squared value of 0.07 means that 7% of the variance
in the response variable is explained by the predictor variable. This suggests that the predictor
variable has a relatively weak relationship with the response variable.
On the other hand, if you obtain an R-squared value of 0.68, this means that 68% of the variance
in the response variable is explained by the predictor variable. This suggests that the predictor
variable has a relatively strong relationship with the response variable.
It is important to note that the R-squared value is just one measure of the fit of the model. Other
measures, such as the adjusted R-squared, may also be used to assess the fit of the model. In
addition, the R-squared value should be interpreted in the context of the research question or
hypothesis being tested and the characteristics of the data.
R-Squared vs. Adjusted R-Squared
The coefficient of determination (R-squared) is a measure of the proportion of the variance in

the response variable that is explained by the predictor variables in a regression model. It is
defined as the ratio of the sum of squares explained by the model to the total sum of squares.
The R-squared value can range from 0 to 1, where a value of 0 indicates that the model does not
explain any of the variance in the response variable and a value of 1 indicates that the model
explains all the variance in the response variable.
A high R-squared value indicates a good fit of the model to the data.
The adjusted R-squared is a modified version of the R-squared that adjusts for the number of
predictor variables in the model. It is defined as:
Adjusted R-squared = 1 - (1 - R-squared) * (n - 1) / (n - k - 1)
where n is the sample size and k are the number of predictor variables in the model.
The adjusted R-squared adjusts for the fact that R-squared tends to increase as the number of
predictor variables increases, even if the additional variables do not improve the fit of the model.
As a result, the adjusted R-squared is a better measure of the fit of the model when comparing
models with different numbers of predictor variables.
How can we construct a confidence interval?
For example, suppose you are estimating the mean height of a population of adults with
95% confidence, and you collect a sample of 100 adults. The sample mean height is 170
cm and the standard error is 2 cm. The critical value for a 95% confidence interval is 1.96.
The margin of error is 1.96 * 2 = 3.92 cm. The confidence interval is 170 cm +/- 3.92 cm,
or 166.08 cm to 173.92 cm.
This is a general process for building a confidence interval. The specific steps and
calculations may vary depending on the type of data and the population parameter of
interest. It is important to carefully consider the assumptions and requirements for
constructing a confidence interval and to use the appropriate methods and formulas.
CI 95% = Beta1 +- Critical value x Standard Error

Tests
Test Reset:
Definition: The Test Reset is a statistical test used to assess the functional form of a
regression model. It tests the null hypothesis that the model is correctly specified (i.e., that
the functional form of the model is appropriate for the data) against the alternative
hypothesis that the model is misspecified.
T-statistic formula: See paper example.
Interpretation: If the Test Reset statistic is statistically significant, you can reject the null
hypothesis that the model is correctly specified and conclude that the model is mis
specified. If the Test Reset statistic is not statistically significant, you cannot reject the null
hypothesis and you should consider the model to be correctly specified.
F-test:
Definition: The F-test is a statistical test used to compare the fit of two or more nested
models. It tests the null hypothesis that the more complex model (i.e., the model with
more parameters) does not fit the data significantly better than the simpler model.
This test can be used for testing the overall significance of the model or the joint
significance of 2 or more variables.
T-statistic formula: Formula with SSR of unrestricted and restricted for the joint
significance. Formula with R-squared for overall significance.
Interpretation: If the F-test statistic is statistically significant, you can reject the null
hypothesis that the more complex model does not fit the data significantly better than
the simpler model and conclude that the more complex model is preferred. If the F-test
statistic is not statistically significant, you cannot reject the null hypothesis and you should
consider the simpler model to be preferred.
T-test for a single coefficient:
Definition: The t-test for a single coefficient is a statistical test used to determine whether
the value of a regression coefficient is significantly different from zero. It tests the null
hypothesis that the regression coefficient is equal to zero against the alternative
hypothesis that the coefficient is not equal to zero.
T-statistic formula: t = (Beta estimated – Beta under H0) / Standard Error of Beta
Interpretation: If the t-test for a single coefficient statistic is statistically significant, you
can reject the null hypothesis that the regression coefficient is equal to zero and conclude
that the coefficient is significantly different from zero. If the t-test for a single coefficient
statistic is not statistically significant, you cannot reject the null hypothesis and you should
conclude that the coefficient is not significantly different from zero.
White Test:
Definition: The Test White (also known as the Breusch-Pagan test, the white complete
test, and the white simplified test) is a statistical test used to assess the heteroscedasticity
of a regression model. It tests the null hypothesis that the variance of the residuals is
constant across all observations against the alternative hypothesis that the variance of the
residuals is not constant.
T-statistic formula: For Breusch Pagan, we have the normal regression on U-squared –
here we test for the overall significance of the model (i.e, test all the coefficients.)
For white complete, we have a model where every variable is multiplying by each other,
where we also test for the overall significance.
Lastly, for white simplified, we have a model regressed for y-squared. We test the
coefficient on y-squared. We can also use: LM = N x R-squared.
Interpretation: If the Test White statistic is statistically significant, you can reject the null
hypothesis that the variance of the residuals is constant and conclude that the variance is
not constant (i.e., there is heteroscedasticity in the model). If the Test White statistic is not
statistically significant, you cannot reject the null hypothesis and you should consider the
variance of the residuals to be constant.
Note: When we have HSK, we can do the following:
- First, we have to say that there is statistical evidence of heteroskedasticity in the
residual variable. The betas are not the most efficient estimators. If the remaining
assumptions are verified, they are still unbiased but not efficient. The covariance
matrix is not valid.
- Use log-variables, that might correct the disparities or use Robust Standard Errors.
Test to an interaction between variables:
Definition: The test of an interaction between variables is a statistical test used to

determine whether the effect of one predictor variable on the response variable depends
on the value of another predictor variable. It tests the null hypothesis that there is no
interaction between the predictor variables against the alternative hypothesis that there
is an interaction. An example is B1=B2, so, B1-B2=0.
T-statistic formula: The test of an interaction between variables typically involves testing
the significance of a product term (i.e., the product of two predictor variables) in a
regression model. Here, we usually use a change in variable and then do a t-test on the
sigma.
Chow Test:
Definition: The Chow test is a statistical test used to compare the fit of a model to two or
more different subsamples of data. It tests the null hypothesis that the coefficients of the
model are equal across all subsamples against the alternative hypothesis that the
coefficients are not equal. We use Special Chow Test when samples are different.
T-statistic formula: See on paper.
Interpretation: If the Chow test statistic is statistically significant, you can reject the null
hypothesis that the coefficients of the model are equal across all subsamples and
conclude that the coefficients are not equal. If the Chow test statistic is not statistically
significant, you cannot reject the null hypothesis and you should consider the coefficients
to be equal across all subsamples.
What is the base group of a regression?
In a regression analysis, the base group is the group that serves as the reference point
for comparison with the other groups in the model. In other words, it is the group
against which the effects of the other groups are measured.
If we have dummies in the model, the base group is usually the group for which the
dummy is = 0. Another way to determine it might be to evaluate the group with the
greater proportion within the sample.
If we change units of measurement in a regression, for example for Market Cap,

how will the coefficient change and T-statistic?
If we re-run a regression that is using MktCap in thousands of dollars but this time in billions of
dollars, B3 will be 1 000 000 times smaller, since1 thousand dollars is 1 000 000 smaller than 1
billion dollars. So, for example, if we have a coefficient of 1 000 000 thousand dollars, that in
billion dollars is simply 1 billion dollars.
Moreover, the T-statistic will not change when you transform the scale of an independent
variable. Since it is a measure of the significance of the coefficient estimate, it is not affected by
the scaling of the independent variable.
List some potential threats to the identification of causal effects in

econometrics.
There are several potential threats to the identification of causal effects in econometrics.
Remember, zero conditional mean is crucial for causality (MLR.4). Some of the most
common include:
1. omitted variable bias: This occurs when an important variable that affects the
outcome of interest is not included in the model. This can lead to biased coefficient
estimates and incorrect conclusions about the causal effect of the variables that are
included.
2. reverse causality: This occurs when the direction of causality is reversed, so that
the supposed independent variable is actually affected by the dependent variable. This
can lead to biased coefficient estimates and incorrect conclusions about the causal effect.
3. endogeneity: This occurs when an independent variable is correlated with an
omitted variable that affects the dependent variable. This can lead to biased coefficient
estimates and incorrect conclusions about the causal effect.
4. selection bias: This occurs when the sample used to estimate the model is not
representative of the population of interest, leading to biased coefficient estimates and
incorrect conclusions about the causal effect.
5. measurement error: This occurs when the variables in the model are measured
with error, leading to biased coefficient estimates and incorrect conclusions about the
causal effect.
What assumptions must be satisfied for the causal effect to be measured

accurately?
To interpret the coefficient on an independent variable as measuring the causal effect of that
variable on the dependent variable, certain assumptions must be satisfied. These assumptions
include:
1. The independent variable must be exogenous, meaning that it is not affected by the
other variables in the model or by omitted variables that affect the dependent variable.
2. There must be no reverse causality, meaning that the dependent variable does not
affect the independent variable.
3. There must be no omitted variable bias, meaning that all variables that affect the
dependent variable and are correlated with the independent variable must be included in the
model.
4. The sample used to estimate the model must be representative of the population of
interest and must not be subject to selection bias.
5. The variables in the model must be measured accurately and without error.
If these assumptions are satisfied, then the coefficient on the independent variable can be
interpreted as measuring the causal effect of that variable on the dependent variable.
The key assumption for causality is MLR.4, zero conditional mean.
Figure 1 - Practical Example

What is the consequence of estimating a model without a relevant variable?
How can we calculate the bias? PRATICAL EXAMPLE
We create an instrument variable to account for endogeneity. What are the

requirements for it to be valid? PRACTICAL EXAMPLE
Define instrument variables and what is needed for them to be valid.
Instrumental variables (IVs) are used in econometrics to identify the causal effect of an
independent variable on a dependent variable when the assumption of exogeneity (that
the independent variable is not affected by omitted variables that affect the dependent
variable) is not satisfied.
IVs are variables that are correlated with the independent variable of interest and are
believed to affect the dependent variable only through their effect on the independent
variable.
IVs are used in a two-stage least squares (2SLS) regression, in which the IV is used to
predict the independent variable in the first stage, and the predicted values of the
independent variable are used as an explanatory variable in the second stage, along with
any other relevant variables. The coefficient estimate for the independent variable in the
second stage is then interpreted as the causal effect of the independent variable on the
dependent variable.
For an IV to be valid, it must meet the following requirements:

1. The IV must be correlated with the independent variable, X.
2. The IV must not be correlated with the error term in the second stage equation.
And thus, with Y.
3. The IV must not be correlated with any omitted variables that affect the dependent
variable.
If these requirements are satisfied, then the coefficient estimate for the independent
variable in the second stage equation can be interpreted as a valid estimate of the causal
effect of the independent variable on the dependent variable. However, it is important to
note that IVs are not always available and may not always meet the necessary
assumptions, in which case alternative methods may be needed to identify the causal
effect.
Example on Instrument Variables
Here is an example of a regression with real world variables and an instrument variable to
account for endogeneity:
Example: A researcher is interested in studying the relationship between education and

income. They gather data on a sample of individuals and run a simple regression of
income on education. However, they are concerned that there may be omitted variable
bias, as education and income may be correlated with other factors (e.g., intelligence,
ambition) that are not included in the model. To address this issue, the researcher decides
to use an instrument variable to control for endogeneity.
The researcher decides to use years of schooling as the instrument variable, since this is
likely to be correlated with education but not directly with income. They run a two-stage
least squares (2SLS) regression, with income as the dependent variable, education as the
main independent variable, and years of schooling as the instrument.
Questions:
1. Why is the researcher using an instrument variable in this regression?
The researcher is using an instrument variable to control for endogeneity in the
relationship between education and income. Without controlling for endogeneity, the
results of the regression may be biased due to omitted variable bias.
2. What is the instrument variable in this regression and how is it related to the
main independent variable?
The instrument variable in this regression is years of schooling. It is likely to be correlated
with the main independent variable, education, but not directly with the dependent
variable, income.
3. What type of regression is being used in this example and how does it differ
from a simple linear regression?
The researcher is using a two-stage least squares (2SLS) regression in this example. It
differs from a simple linear regression in that it includes an instrument variable to control
for endogeneity. In a 2SLS regression, the instrument variable is used to predict the main
independent variable in the first stage, and then the predicted values of the main
independent variable are used as an explanatory variable in the second stage to predict
the dependent variable. This allows the researcher to control for omitted variable bias and
estimate the true causal relationship between the main independent variables and the
dependent variable.
What does endogeneity refer to and why do we use instrument variables?
Endogeneity refers to a situation in which an explanatory variable in a regression model is

correlated with the error term. This can cause biased and inconsistent estimates of the effect of
the explanatory variable on the dependent variable.
Instrument variables are used to control for endogeneity in regression analysis. An instrument
variable is a variable that is correlated with the explanatory variable of interest but is not directly
related to the dependent variable. By using the instrument variable to predict the explanatory
variable, the researcher can control for omitted variable bias and estimate the true causal
relationship between the explanatory variable and the dependent variable.
For example, suppose a researcher is interested in studying the relationship between education
and income. They gather data on a sample of individuals and run a simple regression of income
on education. However, they are concerned that there may be omitted variable bias, as
education and income may be correlated with other factors (e.g., intelligence, ambition) that are
not included in the model. To address this issue, the researcher decides to use an instrument
variable to control for endogeneity. They might choose to use years of schooling as the
instrument variable, since this is likely to be correlated with education but not directly with
income. By using a two-stage least squares (2SLS) regression and including years of schooling as
the instrument variable, the researcher can control for omitted variable bias and estimate the
true causal relationship between education and income.
True or False Questions
This equation represents the orthogonality condition in instrumental variables (IV)

regression.
It states that the residuals of the predicted values of the dependent variable (ûi) are
orthogonal, i.e., independent, or uncorrelated, to the fitted values of the instrument
variable (ŷi). In other words, the equation says that the deviation of the predicted
dependent variable from the mean predicted dependent variable is always zero when it
is averaged over all the fitted values of the instrument variable.
The orthogonality condition is important in IV regression because it ensures that the

estimates of the parameters in the model are unbiased and consistent. It allows the
researcher to estimate the true causal relationship between the explanatory variable and
the dependent variable, even in the presence of omitted variable bias.
Basically, this is practically the same as the Sum of Squared Residuals, but in this case,
Residuals are not multiplying by the difference between observed and fitted values but
by the difference between fitted values and the mean of observed values. What
econometrics tells us is that for this case, the value will also always be zero since for every
product, there is a negative product.
Consider the following population model that satisfies MLR.1 - MLR.4:
yi =β0 +β1Di +β2xi +ε
where Di is a dummy variable. Omitting D from the regression causes a bias in

the intercept (β0) only.
True or False
This statement is false. Omitting a variable from a linear regression model can cause a
bias in the estimates of all of the parameters, not just the intercept.
In particular, if a variable that is correlated with the dependent variable is omitted from
the regression model, the estimates of the coefficients for the other variables will be
biased. This is because the omitted variable is confounded with the other variables,
meaning that it is correlated with both the dependent variable and at least one of the
independent variables.
In the case of the population model you have provided, if the dummy variable Di is
omitted from the regression, the estimate of the intercept (β0) will be biased, because Di
is correlated with the error term ε and therefore confounds the relationship between the
intercept and the dependent variable. However, the estimates of the coefficients for xi
(β2) and ε will also be biased, because Di is correlated with both variables as well.
When a dummy variable is included in a regression model, it serves as a control or

comparison group for the other variables in the model. If the dummy variable is omitted
from the model, the estimates of the parameters will be biased because the comparison
group is not being considered. This can lead to biased estimates of all of the parameters
in the model, not just the intercept.
The statement "One must report the heteroskedasticity-adjusted standard errors every time that
the assumption of normal errors is rejected in the data" is false because it is not always necessary
to use the heteroskedasticity-adjusted standard errors when the assumption of normal errors is
rejected.
Heteroskedasticity-adjusted standard errors are used to correct for the bias that can occur in the
estimates of the standard errors of the coefficients in a linear regression model when the errors
are heteroskedastic (have unequal variances). However, there are other methods that can be
used to account for heteroskedasticity, such as using a different error distribution or using
weighted least squares. In some cases, it may be more appropriate to use one of these other
methods, depending on the specific context and goals of the analysis.
To "report" the heteroskedasticity-adjusted standard errors means to include them in the results
of the analysis, either by calculating them directly or by using software that provides them as
part of the output. This typically involves presenting the estimates of the coefficients and their
corresponding standard errors, as well as any relevant statistical tests or confidence intervals.
The statement "Correlation between the regressors and the residuals (û) causes the OLS
estimators to be biased" is false because correlation between the regressors and the residuals
does not necessarily cause the OLS estimators to be biased.
The OLS estimators are a set of parameter values that are calculated to minimize the sum of the
squared residuals for a given linear regression model. The OLS estimators are unbiased if the
assumptions of the linear regression model are satisfied, regardless of whether there is
correlation between the regressors and the residuals.
However, there are other factors that can cause the OLS estimators to be biased, such as omitted
variable bias (omitting a variable that is correlated with the dependent variable from the model)
or multicollinearity (high correlation between the independent variables). These factors can
affect the accuracy of the OLS estimators, but they are not related to the correlation between
the regressors and the residuals.
EXAMPLE EXAM
(a) The coefficient beta1 (β1) represents the estimated change in the mutual fund's
performance (RET) for a one unit change in the average SAT score of the manager's
undergraduate institution (SAT), ceteris paribus.
Beta2 (β2) represents the estimated difference in the mutual fund's performance
(RET) if the manager has an MBA degree or not.
(b) The population intercept for a manager of an equity and fixed income fund with
an MBA degree would be β0 + β2, since the EQTY variable is equal to 0 and the
MBA variable is equal to 1.
The population intercept for a manager of an equity only fund without an MBA
degree would be β0 + β5, since the EQTY variable is equal to 1 and the MBA
variable is equal to 0.
(c) Including the return of the S&P500 in the model as a control for market
movements could potentially be a good idea, as it may help to control for any
external factors that may be influencing the mutual fund's performance. However,
it's also possible that the S&P500 return is correlated with other factors in the
model, which could lead to problems with multicollinearity. It's important to
carefully consider the potential benefits and drawbacks of including this variable
in the model.
(d) According to the model, the difference in expected returns between the fund
managed by Bob and the one managed by Jack would be β4 * (Bob's tenure - Jack's
tenure). In this case, the difference in expected returns would be:
β4 * (5 - 1) = 4 * β4.
(b) The variables that have a statistically significant effect on performance at the 95%
confidence level are SAT and EQTY. The coefficient for SAT represents the average
estimated change in the mutual fund's performance (RET) for a one unit change in the
average SAT score of the manager's undergraduate institution (SAT).
The coefficient for EQTY represents the estimated average difference in the mutual fund's
performance (RET) between equity only funds and both equity and fixed income funds.
(c) The p-value for SAT is 0.032, which means that there is a 3.2% probability that the
relationship between the mutual fund's performance (RET) and the average SAT score of
the manager's undergraduate institution (SAT) is due to chance. This is a relatively low p-
value, indicating that there is a strong likelihood that the relationship is not due to chance
and that the SAT score has a statistically significant effect on the mutual fund's
performance.
(d)
(i) If it is true that fund managers without an MBA invest in riskier stocks, then the
coefficient for the MBA variable would be biased. This is because the coefficient estimates
the effect of the MBA dummy variable on the mutual fund's performance, and if fund
managers without an MBA are investing in riskier stocks, then this would confound the
relationship between the MBA dummy variable and the mutual fund's performance.
(ii) If the funds' market beta (a measure of the exposure to systematic risk in the fund's
portfolio) was included as a control variable, it would likely have a moderating effect on
the coefficient of the MBA variable. This is because the market beta would help to control
for the effect of systematic risk on the mutual fund's performance, which would allow for
a more accurate estimate of the effect of the MBA dummy variable on the mutual fund's
performance.
R-squared is a measure of the proportion of the variance in the dependent variable (RET
in this case) that is explained by the independent variables in the regression model. A high
R-squared value indicates that a large proportion of the variance in the dependent
variable is explained by the independent variables, while a low R-squared value indicates
that a small proportion of the variance in the dependent variable is explained by the
independent variables.
It is possible for a regression model to have a high R-squared value and still be considered
"useless" if the model does not accurately capture the underlying relationships in the data.
For example, if the model includes variables that are not related to the dependent variable
or if the model is overly simplified and does not adequately represent the complexity of
the data, then the model may have a high R-squared value but still be considered
"useless" because it is not an accurate representation of the data.
On the other hand, it is also possible for a regression model to have a low R-squared value
and still be considered useful if the model accurately captures the underlying relationships
in the data. For example, if the model includes only a few variables that are strongly
related to the dependent variable, then the model may have a low R-squared value but
still be considered useful because it accurately represents the relationships in the data.
Therefore, it is important to consider both the R-squared value and the underlying
relationships in the data when evaluating the usefulness of a regression model.
Based on the result of the Breusch-Pagan test, there is insufficient evidence to conclude
that heteroskedasticity is present in the data. The p-value of the test (0.1708) is greater
than the significance level (usually set at 0.05), which indicates that there is not enough
evidence to reject the null hypothesis of homoskedasticity.
(a) Based on the result of the Breusch-Pagan test and the information provided, it is not
possible to conclude whether the homoskedasticity assumption is valid or not.
(b) The result of the Breusch-Pagan test does not conflict with the evidence from the plot
because the test does not provide sufficient evidence to conclude that
heteroskedasticity is present in the data.
(c) If the assumption of constant variance is violated in a regression model, this can lead
to biased and inconsistent estimators for the model parameters. This can affect the
interpretation and reliability of the results of the empirical analysis. However, based
on the result of the Breusch-Pagan test, there is not enough evidence to conclude that
the assumption of constant variance has been violated in this case.
(d) If heteroskedasticity is present in the data, one way to account for it is to use a different
estimation method, such as weighted least squares, which allows for different
variances of the errors across the fitted values. Another option is to transform the
dependent variable or the independent variables in order to stabilize the variance of
the errors. Another option is to use heteroskedasticity-consistent standard errors,
which are adjusted to account for the presence of heteroskedasticity. However, based
on the result of the Breusch-Pagan test, there is not enough evidence to conclude that
heteroskedasticity is present in this case, so these methods may not be necessary.
True or False
A p-value=0.23 means that the null hypothesis is rejected with a probability of

23%
True or False
False. A p-value is the probability of obtaining a test statistic at least as extreme as the
one observed, given that the null hypothesis is true. Therefore, a p-value greater than the
chosen significance level (usually set at 0.05) indicates that there is not enough evidence
to reject the null hypothesis. In other words, a p-value of 0.23 means that there is a 23%
probability of obtaining the observed test statistic, or a more extreme test statistic, given
that the null hypothesis is true. This means that there is insufficient evidence to reject the
null hypothesis at the chosen significance level.
True. The expression 𝑛𝑖=1(𝑦𝑖−𝑦)2 represents the sum of the squared residuals (also known as the
residual sum of squares or RSS) for a given model, where yi is the observed value of the
dependent variable for the ith observation, y is the mean of the observed values of the
dependent variable, and 𝑛 is the number of observations.
The expression 𝑛𝑖=1(𝑦𝑖−𝑦)2 represents the sum of the squared errors of prediction (also known
as the mean squared error or MSE) for a given model, where ŷi is the predicted value of the
dependent variable for the ith observation.
Since the residuals represent the difference between the observed and predicted values of the
dependent variable, it follows that the sum of the squared residuals is always equal to or greater
than the sum of the squared errors of prediction. This is because the squared residuals include
the squared errors of prediction, as well as any additional error that may be present due to
omitted variables or other factors that are not accounted for in the model. Therefore, it is always
true that the sum of the squared residuals is never smaller than the sum of the squared errors of
prediction.
You are correct that the relationship between the coefficients in the two models is given by γ1 =
β1 × 12 and the relationship between the standard errors of the coefficients in the two models
is given by se(γ1) = se(β1) × 12. This is because the two models are estimating the same
relationship between weight and the number of chocolate chip cookies eaten in a week but using
different units of measurement. In the first model, the independent variable is measured in the
number of cookies eaten per week, while in the second model, the independent variable is
measured in the number of boxes of cookies eaten per week. Since a box of chocolate chip
cookies contains 12 cookies, the relationship between the number of boxes of cookies eaten per
week and the number of cookies eaten per week is given by boxi = cookiesi / 12. This means that
the coefficients and standard errors in the two models are related by the factor of 12.
What is the consequence of omitting a variable in a MLR?
In a multiple linear regression (MLR) model, omitting a variable can have several consequences,
including:
1. Biased estimates: If the omitted variable is related to the outcome and the predictor
variables in the model, then the estimates of the coefficients for the predictor variables
may be biased.
2. Decreased model fit: Omitting a relevant variable may also result in a decrease in the
overall fit of the model. This can be seen in a decrease in the R-squared value, which is a
measure of how well the model fits the data.
3. Incorrect inferences: If a variable is omitted from the model, it may be difficult to accurately
interpret the results of the model. For example, if an omitted variable is correlated with
both the outcome and the predictor variables, it may be difficult to accurately interpret the
relationships between the predictor variables and the outcome.
4. Inconsistent results: Omitting a variable may also lead to inconsistent results, especially if
the variable is related to the outcome in different ways at different levels of the predictor
variables.
It is important to carefully consider which variables to include in an MLR model, as omitting a relevant
variable can have significant consequences on the results and interpretation of the model.

EMF - Prático

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

EMF - Prático

Uploaded by

Copyright:

Available Formats

PERGUNTAS E RESPOSTAS EMF

The assumptions of linearity and independence are necessary for OLS to be

Interpretação do Beta 1 numa MLR

Coeficiente de Determinação -> R-squared

1. Examining the residuals:

2. Checking the fit of the model:

3. Examining the correlation between the independent variables:

4. Testing for outliers:

5. Checking for influential observations:

To use the Shapiro-Wilk test, you need to:

How do we analyze a Shapiro-Wilk W test in practice?

Interpretation of the R-Squared

The coefficient of determination (R-squared) is a measure of the proportion of the variance in

Adjusted R-squared = 1 - (1 - R-squared) * (n - 1) / (n - k - 1)

How can we construct a confidence interval?

CI 95% = Beta1 +- Critical value x Standard Error

T-statistic formula: See paper example.

Test to an interaction between variables:

Definition: The test of an interaction between variables is a statistical test used to

T-statistic formula: See on paper.

If we change units of measurement in a regression, for example for Market Cap,

List some potential threats to the identification of causal effects in

What assumptions must be satisfied for the causal effect to be measured

Figure 1 - Practical Example

We create an instrument variable to account for endogeneity. What are the

For an IV to be valid, it must meet the following requirements:

Example on Instrument Variables

Example: A researcher is interested in studying the relationship between education and

Endogeneity refers to a situation in which an explanatory variable in a regression model is

This equation represents the orthogonality condition in instrumental variables (IV)

The orthogonality condition is important in IV regression because it ensures that the

yi =β0 +β1Di +β2xi +ε

where Di is a dummy variable. Omitting D from the regression causes a bias in

When a dummy variable is included in a regression model, it serves as a control or

A p-value=0.23 means that the null hypothesis is rejected with a probability of

You might also like