You are on page 1of 32

FEDERAL STATE AUTONOMOUS EDUCATIONAL INSTITUTION

FOR HIGHER PROFESSIONAL EDUCATION


NATIONAL RESEARCH UNIVERSITY HIGHER SCHOOL OF ECONOMICS

Faculty: Banking Institute

Homework

Ivanov Vladimir
Shcheglova Mariia
Zhakina Anastasiia
Gnedovskaya Elizaveta
Tazabaeva Kamila

Moscow
2024
Table of contents
Introduction..........................................................................................................................................3
Part 1. Classical regression analysis.....................................................................................................8
Part 1a...............................................................................................................................................8
Part 1b.............................................................................................................................................11
Part 2. Violations of other assumptions..........................................................................................18
Conclusion..........................................................................................................................................23
Appendix............................................................................................................................................25
Introduction

The economic problem to be addressed is to identify the key factors influencing housing prices in
Moscow and the Moscow region.

The goal of the research is to investigate the relationship between housing prices and various factors
such as apartment type, proximity to metro stations, location within different regions, number of
rooms, total area, living area, kitchen area, floor level, number of floors, and renovation status in
both Moscow and the Moscow region.

Stated hypotheses:

1. What are the main factors that contribute to the change in the housing prices in Moscow
and the Moscow Oblast region?

2. Is there a possible connection between housing prices and distance to the closest metro
station?

3. Is there a significant difference between housing prices in Moscow and the Moscow
oblast region?

Data was collected in November 2023. It’s consisted of housing prices in Moscow and the Moscow
Oblast region. Number of observations – 22 676.

Description columns:

Variable Description
price Price of the apartment (in millions), our dependent variable.
Type of apartment. This is a binary variable equal to 1 if the housing is
type
secondary and 0, if is it new construction
The time in minutes required to walk from the apartment to the nearest
mins
metro station.
The region where the apartment is located. This is a binary variable
moscow
equal to 1 if region is Moscow and 1, if Moscow Region.
The total number of rooms in the apartment, including bedrooms, living
rooms
rooms, etc.
area The total area of the apartment in square meters.
The living area of the apartment in square meters, i.e., the area usable
livingarea
for living.
kitchen The area of the kitchen in square meters.
floor The floor on which the apartment is located.
cosmetic Binary variable equal to 1, if there is cosmetic type of renovation
euro Binary variable equal to 1, if there is euro type of renovation
designer Binary variable equal to 1, if there is designer type of renovation

The collected data is cross-sectional as it was collected at a single point of time (November 2023)
and several objects (a number of flats) were observed.
d) Provide plots and graphs to illustrate the variables. Use bar charts and box plots to illustrate
categorical variables, scatter plots to illustrate the relationships between the variables.

Figure 1. Box rafts for price, area and floor variables

The median value of the apartment price is at the level of 12.5 million rubles. The average value is
at the level of 15.2 million rubles. Half of the values are at the level of 10 million to 18 million
rubles. The maximum value excluding emissions is below the level of 30 million rubles, while the
minimum value is below 1 million rubles.

The median value, as well as the average value of the apartment area, is at the level of 38-39 square
meters. Half of the values are at the level of 35 to 42 square meters. The maximum value excluding
emissions is 54 square meters, while the minimum value is 22 square meters.
The average number of floors of the apartment for sale is at the level of 9 floors. At the same time,
half of the values are the number of floors from 4 to 13 floors.

Most of the observations (12117) relate to Moscow.


Most of the observations (13367) relate to 1-room apartments.

There is a positive dependence of the price of an apartment on its area. However, it seems that the
dependence is non-linear, and after a certain area (around 40 square meters), the price begins to
grow faster as the area grows. Perhaps there is no fundamental difference between apartments of 25
and 35 square meters for buyers, unlike apartments of 50 and 60 square meters.
It is noticeable that the average price of an apartment in Moscow is much higher than in the
Moscow region. At the same time, the minimum price in the Moscow region exceeds the minimum
price in Moscow.
Part 1. Classical regression analysis
Part 1a

a) Choose a model and state the assumptions about the nature of explanatory variables and
disturbances. The model should include categorical and continuous explanatory variables as
discussed before.

The basic model (1.1) is presented below. We have not included all the variables in it, but only
those variables whose coefficient signs and distribution we can explain in common sense.

pric e i=β 0+ β1 min s i + β 2 mosco wi + β 3 room si + β 4 typ ei + ε i (1.1)

where:

β 0 is the intercept term

β 1 , β 2 , β 3and β 4 are the coefficients associated with each explanatory variable

ε i is the error term.

We believe that

1) the longer it takes to get to the metro, the cheaper the apartment ( β 1< 0), because
proximity to the metro station usually increases the attractiveness of housing due to
convenience and quick access to transport;

2) apartments in the Moscow region are cheaper ( β 2> 0), because we believe that the
infrastructure and access to services in the Moscow region may be less developed;

3) the more rooms, the more expensive the apartment, the larger the area ( β 3 >0), however, if
we analyzed the price per square meter, the results would be the opposite;

4) secondary housing is cheaper ( β 4 <0 ), because such housing often has wear and tear and
requires repair or modernization.

We evaluate the model using the least squares method therefore, we believe that all Gauss-Markov
assumptions are fulfilled, namely independence (one observation does not influence another),
homoscedasticity (variance of the errors is constant for all values of the explanatory variables) and
exogenety (explanatory variables are not correlated with the errors) and errors are normally
distributed with a mean of 0.

The results of the evaluation of the (1.1) model are presented in Table 1.
Table 1. Base model

b) Propose a linear hypothesis to test using t-test.

Despite the fact that R gives us the p-value automatically (see Table 1), we can manually check the
significance of the coefficients using the t-test. Let's consider testing the significance of the variable
related to the proximity to the metro station.

H 0 : β 1=0

H 1 : β1 ≠ 0

To test this hypothesis, we can conduct a t-test for the coefficient


β^i 1 ,17
t= = ≈ 14 , 63,
SE ( ^β i) 0 ,08
where:
^
β i is the estimated coefficient for variable
SE( ^ β ) is the standard error of the coefficient estimate.
i

Under the null hypothesis, t follows a t -distribution with degrees of freedom equal to n−k −1,
where
n is the number of observations and
k is the number of coefficients being estimated (including the intercept).

We can use the t-test statistic to obtain the p-value associated with the test. If the p-value is less than
a chosen significance level (e.g., 0.05), we reject the null hypothesis and conclude that the variable
mins is significant in explaining the variation in the price of the apartment.

We perform a similar «procedure» for other variables.

c) Propose an alternative model which can be considered as an extended version of a base model.
Conduct F-test to choose between models.

In the alternative version (1.2) of the model, we decided to include all variables from the dataset.

pric e i=β 0+ β1 min s i + β 2 mosco wi + β 3 room si + β 4 typ ei +¿

+ β 5 are ai + β 6 livingare ai + β 7 floo r i + β 8 eur oi + β 9 designe r i +ε i (1.2)

The results of the evaluation of the (1.2) model is presented in Table 2.

Table 2. Alternative model estimation result


We have removed insignificant variables, namely kitchen and cosmetic , in order to understand
which of the two models is more correct, we will conduct an F-test. Let's denote:

RS S 1 as the sum of squared residuals from the reduced model (base model)

RS S 2 as the sum of squared residuals from the full model (alternative model)

k 1 as the number of parameters estimated in the reduced model (including the intercept)

k 2 as the number of parameters estimated in the full model (including the intercept)

H 0 : β 5=β 6=β 7=β 8=β 0 =0

(RS S 1−RS S2 )/(k 2−k 1 )


F= =4764 , 3
RS S 2 /(n−k 2 −1)

Calculated F-statistic is greater than the critical value => we reject the null hypothesis and conclude
that the full model (alternative model) is more appropriate than the reduced model (base model).

The results of the ANOVA test from R are presented in Table 3.

Table 3. ANOVA test result

Part 1b

d) Propose an alternative model assuming misspecification of the first model in (a). Describe in
what way the model can be misspecified. What are the consequences of misspecification types on
the estimates of the model?

In order to fulfill this point, we have detected best model (1.4) in g), and then «ruined» it. The
incorrect model looks like this:

pric e i=β 0+ β1 typ ei + β 2 mosco wi + β 3 rooms + β 4 are ai+ ¿

β 5 ln ( floor )i + β 6 eur oi + β 7 designe r i + β 8 cosmeti ci + ε i (1.3)


Table 4. Incorrect model estimate results

Based on the results (see Table 4), we see that it may has

1) Estimating regression with irrelevant variables (moscow , cosmetic ,kitchen )

2) One or more of the omitted variables are relevant (there are significant coefficients which
are not included in our model) (livingarea)

3) Low adjusted R2

4) Misspecifying the functional form (by not taking logarithm of price)

5) Potentially missing structural break

e) Test for the correct specification using

i. Various sets of explanatory variables;


ii. Linear vs. Non-linear form of the model;
iii. Ramsey test that allows testing for possible omitted variables that are not in the available set;
iv. Chow test.
In each step describe the hypothesis to be tested. Describe the violated assumptions and
consequences for the OLS estimates. Name the test employed for testing. Write down a conclusion
whether the hypothesis is rejected or accepted and what this means from an economic point of
view.

i) We can try different combinations of variables to see which set provides the best fit for the data.
Different alternative models (by subtracting/adding potentially insufficient/sufficient additional
variables and its variations from/to the incorrect model). We compute AIC/BIC and adjusted R 2 for
each model (not R2 because incorporates a trade-off between goodness-of-fit and the number of
regressors employed in the model).

Add. variables AIC BIC Adj. R2


235476. 235548.
moscow 3 4 0.6507664
235476. 235548.
kitchen 2 2 0.6507685
235476. 235548.
cosmetic 2 2 0.6507686
235444. 235516.
livingarea 3 3 0.6512715
235478. 235558.
moscow+ kitchen 2 2 0.6507532
235478. 235558.
moscow+ cosmetic 2 2 0.6507532
235446. 235526.
moscow+livingarea 3 3 0.6512561
235478. 235558.
kitchen+ cosmetic 0 1 0.6507552
235445. 235526.
kitchen+ livingarea 9 0 0.6512613
235446. 235526.
cosmetic +livingarea 1 1 0.6512590
235480. 235568.
moscow+ kitchen+cosmetic 0 1 0.6507397
235447. 235536.
moscow+ kitchen+livingarea 9 0 0.6512462
235448. 235536.
moscow+ cosmetic+livingarea 1 1 0.6512434
235447. 235535.
kitchen+ cosmetic+ livingarea 7 7 0.6512497
235449. 235545.
moscow+ kitchen+cosmetic +livingarea 7 7 0.6512343
Table 5. AIC/BIC and adjusted R2 for incorrect model specifications

According to Table 5 model that includes only the livingarea variable according to all these criteria
this is the best-fitting model among the tested combinations of additional variables. R 2 is
insignificantly, but has increased to 0.65 compared to the base incorrect model.

ii) + iii) We can test for non-linear relationships between the dependent and independent variables
by 1) BoxCox testing and 2) Ramsey test helps to detect whether there is misspecification in the
functional form of the model. To determine omitted variables, we can use 1) the «short» and «long»
regression test and 2) the same Ramsey test.
Box-Cox test suggests a log transformation of price (since λ equal to 0) for a model.

Figure 2. BoxCox test for incorrect model

Ramsey test is based on idea that under the null hypothesis nonlinear functions of ^
y i should not help
explaining y i (auxiliary regression). Since p−value of the Ramsey test in 2.2e , the H0 is rejected
-16

(«model is correctly specified»), meaning that there is misspecification in model.

In the «long» regression we add additional variable livingarea (potentially omitted one) to incorrect
model and conduct F test. H0: additional variables included in the long regression are not
statistically significant. The result presented below: livingarea is significant since p−value equals
to 7.02e-14.

Table 6. Short/Long regression test result

iv) The Chow test is used to test for structural breaks in the data, which could indicate that different
groups or segments within the data have different relationships. The null hypothesis is that there is
no structural break. We want to understand whether pricing is really different in Moscow and in the
Moscow region. Recall that the general specification is
' '
y i= X i β + gi X i γ + ε i

where:

gi – group indicator

H 0 :γ =0

Since p−value equals to 0, we will reject the null hypothesis and conclude that there is evidence of
a structural break in the relationship between the observations in Moscow and Moscow region.
f) Conduct tests or procedures not listed above if you find them relevant;

We have conducted the Breusch-Pagan test for heteroscedasticity and checked for multicollinearity
using VIF.

Multicollinearity was not detected due to VIF values less than 10.

type moscow area l n(floor) euro designer c osmetic kitchen


2,4 1,6 2,4 1,1 1,8 1,7 3,5 2,2
Table 7. VIF test result

For Breush-Pagan test


2 2
H 0 :Var ( ε i ) =σ i =σ

2
H 1 : Var ( ε i )=σ i f (α 0 +α 1 z 1 +…+ α r z ri )

Since p−value equals to 2.2e-16, we have rejected null hypothesis therefore heteroscedasticity is
presented.

g) Based on Part 1b steps conclude what is the final model you choose.
Write down an equation with estimated parameters that respects the conclusion. Make an
interpretation of the significant coefficients.

Based on the analysis of various model specifications, the final chosen model is best ¿. Here's the
rationale:

1) Model Selection Process:

 Initially, several model specifications were considered, starting with best ¿, which was our
alternative model (1.2). In this model, we have included the data we have and removed all
insignificant variables based on t test, namely cosmetic и kitc h en (p.s. but the best final
model includes them).
 The Box-Cox transformation was applied to investigate the need for transforming the
dependent variable, indicating that logarithm transformation is necessary.
Figure 3. BoxCox test for best_model_1

 Subsequent models were built by iteratively adding or modifying variables to improve


model fit and address potential issues.

Table 8. Models’ comparison (results)


Models’ Comparison:

 best ¿showed the highest adjusted R2 (0.8962), indicating the best fit among the considered
models.
 Diagnostic plots, such as component-residual (CR) plots, confirmed the adequacy of the
final model. CR plots are useful for detecting non-linearity in the relationship between each
predictor variable and the dependent variable by plotting the component of each predictor
against the corresponding residuals.

Figure 4. Component-residual (CR) plots for best_model_5

 Additionally, we have calculated AIC and BIC. Based on the provided AIC values, best ¿ has
the lowest AIC value (17107.87), indicating the best fit among the five models.

AIC BIC
235330.6 235418.6
best ¿ 0 5
best ¿ 24507.93 24611.99
best ¿ 17760.22 17880.29
best ¿ 17761.32 17873.38
best ¿ 17107.87 17219.94
Table 9. AIC/BIC criteria for models
Final Model Specifications:

ln ⁡( pric ei )=β 0 + β 1 typ e i+ β2 min s i + β 3 mosco wi + β 4 rooms+ β5 are ai +¿

2
β 6 are ai + β 7 ln (livingarea )i + β 8 floo r i + β 9 eur oi + β 9 designe r i

+ β 10 cosmeti ci + β 11 kitche ni +ε i (1.4)

Part 2. Violations of other assumptions

Use the model chosen in Part 1 e) and further develop it.

a) Describe why multicollinearity can be a problem in general and, in particular, in the final
model you have chosen in Part 1b. Test your model for multicollinearity. If there is a problem of
multicollinearity, try to solve it.

Multicollinearity is a situation when a regression model obtains two or more predictor variables that
are highly correlated between each other. Thus, multicollinearity may be a problem in several
aspects:

 Multicollinearity creates difficulties in interpreting the individual coefficients of correlated


variables. This is because the effect of one variable on the outcome variable cannot be
separated from the effect of other correlated variables;
 Multicollinearity can reduce the predictive power of the model. When predictor variables
are highly correlated, it becomes more difficult for the model to accurately estimate the true
relationship between the predictor variables and the outcome variable;
 Multicollinearity can cause inflated standard errors, which leads to widening confidence
intervals and difficulties in determining the significance of variables;
 Multicollinearity makes it difficult to select significant variables for the model. Variables
that are correlated with each other may falsely appear to be less significant.

In the case of the chosen model, multicollinearity could cause problems in determining the
significance of certain variables. For example, the variable rooms could be erroneously
overestimated, leading to incorrect study results.

The test has demonstrated that there is no multicollinearity in the final model that was selected in
part 1b. However, there are problems with area variable. In final model this variable was squared
(see (1.4)).

b) Describe what are the consequences of heteroscedasticity of residuals for OLS estimates. Test
for heteroscedasticity and make a conclusion. If there is heteroscedasticity describe approaches to
deal with it. Run an estimation that provides better than OLS estimates. Provide comments on the
resulting estimates of the parameters and standard errors.

Heteroscedasticity is a situation where the error variance in a regression model is not constant
across all levels of the independent variables. Heteroscedasticity can create problems for OLS:
 Biased coefficient estimates: in the presence of heteroscedasticity, the OLS estimator is still
unbiased, but it is no longer efficient;
 Incorrect inference: heteroscedasticity problem can lead to incorrect inferences about the
statistical significance of the coefficients: the t-statistics and p-values may be unreliable;
 Model misspecification: where the functional form of the model does not accurately capture
the relationship between the independent and dependent variables, that leads to incorrect
interpretations of the relationships between variables and inaccurate predictions.

To determine heteroscedasticity in the model (1.4), the Breusch-Pagan test was used, which
demonstrated the presence of heteroscedasticity. There are several approaches to deal with
heteroscedasticity in regression analysis:
 The problem of heteroskedasticity can be solved by transforming the dependent or
independent variables using mathematical functions (logarithms or square roots). This
makes it possible to stabilize the error variance and make the relationship between variables
linear;
 Robust standard errors. These errors adjust the standard errors of coefficient estimates to
correct for heteroscedasticity. This method does not require data transformation or change in
estimation method and is robust to violations of the homoscedasticity assumption;

In the case of our model, we used robust standard errors to solve the problem of heteroskedasticity
(see Table 10).
Table 10. Best model with robust SE estimation results

c) Discuss whether there is a problem of endogeneity in your model. Explain, what variables might
be endogenous. What instruments would you use if they were available?

Endogeneity is the second assumption of the Gauss-Markov theory. Endogeneity is one of the core
problems, appearing in data analysis. Endogeneity (reversed causality) is a situation, when variables
that should be exogeneous becomes endogenous. Moreover, if there is endogeneity in the model,
random variables might appear to be dependent or correlated. Consequently, regressors and
disturbances in model should be independent or at least they should be uncorrelated. However,
independence is very strong assumption. That is why, random variables are required to be at least
uncorrelated. When this assumption is violated, the consequences might lead to measurement error
and estimated slopes will be biased and inconsistent. The endogenous problem is most common
with time-series data. Since we consider cross-sectional data, the presence of endogeneity in the
variables is rather unlikely.

To determine the endogeneity in the model (1.4), the Hausman test was used, which demonstrated
whether random variables are exogeneous or endogenous. It is worth to mention that the Hausman
test cannot be performed before the estimation of instrumental variables in the model.

To solve the problem of endogenous variables, can be used instrumental variable approach. The
instrumental variable is able to take exogeneous variation of the endogenous variable. Moreover,
the instrumental variable filters endogenous variable, taking only the exogeneous part of it. In
addition, the instrumental variables should be uncorrelated with the error term, which means there
should be added some additional data or some additional variable. Consequently, the instrumental
variable has to be itself exogeneous and valid. Additionally, the number of instruments should be
greater than endogenous variables.

d) Write the best fitted model after all the steps of the analysis. Provide comments on the
interpretation of the coefficients and goodness-of-fit. Are the stated in Introduction hypotheses
confirmed by empirical evidence?

According to the results of the analysis, the best fitted model is the following:

ln ( pric e i )=1 , 2+0 , 35 type i−0.01 min s i−0 , 38 mosco w i+ 0.05 room s i+ 0 , 01 are ai +¿

2
−0,00001 are a i + 0 ,18 ln ( livingarea )i + 0,003 floo r i−0 , 09 eur o i+ 0 , 23 designe r i

−0 , 26 cosmeti c i +0 , 01 kitche ni

Some interpretation of the most significant coefficients:

 For each unit increase in type (assuming other variables remain constant), the price increases
by a factor of exp(0.35) ≈ 1.42
 For each unit increase in mins (assuming other variables remain constant), the price
decreases by a factor of exp(-0.01) ≈ 0.99
 For each unit increase in moscow (assuming other variables remain constant), the price
decreases by a factor of exp(-0.38) ≈ 0.68
Interpretation of the goodness-of-fit includes coefficient of determination ( R2 ¿ because it explains
the share of variability in the dependent variable explained by the model. R2 of 0.8962 indicates
that approximately 89.62% of the variability in the dependent variable ( pric e i ¿ is explained by the
10 independent variables included in the model above. This suggests that the model provides a good
fit to the data.
The three main hypothesis, which are stated in the introduction above, are confirmed by the
empirical evidences:
1. After the Ordinary Least Squares (OLS) estimation and the usage of the t-test statistic to
obtain the p-value associated with the test, it was confirmed that main factor affecting the
housing prices in Moscow and the Moscow Region is area in square meters;
2. It is theoretically and empirically proved that there is connection between housing prices
and distance to the closest metro station. After the OLS, it was confirmed by the estimation
of the significance for the exogeneous distance variable and its impact on endogenous
variable (price).
3. It is confirmed that there is a difference in housing prices in Moscow and the Moscow oblast
due to the exogeneous variables in the model (1.4).
Conclusion
Provide a conclusion and a discussion to the analysis performed. Namely:
a) State the problem and the hypotheses you have investigated
The problem that are stated in our research is finding the connection between housing prices and
variables that are stated above. In our research 3 hypothesis has been provided:
 The main factors that contribute to the change in the housing prices in Moscow and the
Moscow Oblast region is area in square meters;
 There is a connection between housing prices and distance to the closest metro station;
 There is a significant difference between housing prices in Moscow and the Moscow oblast
region.
b) Briefly describe the features of the data you used;
The data, that was analysed contains 22 676 observations. It has information about prices,
area, year of construction, number rooms, floor and not only region, but also district of apartment.
Prices are varied from 1 000 000 to 501 460 000 mln RUB and have area distributed from 8 to 305
square metres. Most of the apartments were concentrated in САО and ЮАО, so in our analysis
binary variable of these two districts were created.
c) Make salient the features of the model you have built;
Thus, the final model (1.4) is a multifactorial model which includes 11 variables. Moreover,
the model obtains sufficiently high predictive ability. It is worth noting that about 89.62% of the
variability of the dependent variable (price) is explained by the model. In addition, this model lacks
multicollinearity. In this model, there was a problem of heteroscedasticity, which was solved by
applying robust standard errors.
d) Describe the results you have found. Do they support or contradict the stated hypothesis?
In our work, we formulated three main hypotheses, which we tested by building models. In
the first two hypothesis, we made the assumption that the main factors that influence the dynamics
of housing prices in Moscow and the Moscow region are the area (square meters) and the distance
to the closest metro station. In order to assess the impact of the apartment's location on its value, a
variable was introduced, an additional minute was introduced, which indicates the time in minutes
needed to walk from the apartment to the nearest metro station. The variables area and living area
were also introduced to assess the impact of square meters on the cost of an apartment.
As a result of building the model, it was found out that the mins variable has a significant impact on
the cost of apartments, and therefore is included in the final model (1.4). Moreover, the variables
area and living area have a significant impact on the dependent variable, and therefore were
included in the model. Thus, hypothesis 1 and 2 was confirmed. In hypothesis 3, we assumed that
there is a significant price gap between apartments in Moscow and the Moscow region. This
hypothesis has also been confirmed.
e) What could be an application (theoretical or practical) of your investigation?
Based on our research different companies that work in the real estate area, such as realtors
or investment companies can analyze and make forecasts of prices on apartments based on different
parameters. In such cases as new building project investment company can make a conclusion about
possible profits with the knowledge of the location of building, area of apartments, etc., whereas
realtors would know range of prices that apartment can be sold depending on the factors stated
above.
Appendix

#Выгружаем данные (которые мы предварительно почистили)


library(readxl)
data <- read_excel("C:/Users/Маша/Downloads/data.xlsx")
View(data)

#Загружаем библиотеки
library(dplyr)
library(Benchmarking)
library(fixest)
library(ggplot2)
library(sandwich)
library(lmtest)
library(stargazer)
library(corrplot)
library(car)
library(tidyr)
library(ggthemes)
library(viridis)

#Саммари по переменным
summary(data)

#### Пункт 4 Introduction (графики)) ####

#Bar chart (пример для mins)


ggplot(data, aes(x = mins)) +
geom_bar(fill = "steelblue") +
theme_minimal() +
ggtitle("Distribution by minutes to metro") +
xlab("Minutes to Metro") +
ylab("Frequency")

#Scatter plot (пример для mins)


ggplot(data, aes(x = mins, y = price))+
geom_point(alpha = 0.6,color = "steelblue")+
theme_minimal()+
ggtitle("Scatter plot (Price vs. Minutes to Metro)")+
xlab("Minutes to Metro")+
ylab("Price")

#Box plot (пример для rooms)


ggplot(data, aes(x = factor(rooms), y = price)) +
geom_boxplot(color = "steelblue") +
theme_minimal() +
labs(title = "Price Distribution by Nimber of Rooms",
x = "Numbers of Rooms",
y = "Price") +
scale_y_continuous(limits = c(0, 750))

#### Часть 1а ####

#Сделаем base model (основные переменные!)


base_model <- lm(price ~ mins + moscow + rooms + type, data = data)
summary(base_model)

stargazer(base_model,
title = "Base Model",
out = "mod_summary.html",
type = "html",
digits = 4) %>%
cat()

#t-test (проверим, действительно ли переменная moscow незначима)


t_test <- t.test(price ~ moscow, data = data)
t_test # p-value < 0.05 => we reject H0 => there is a significant
difference in housing prices between Moscow and the Moscow Oblast region

# Альтернативная модель - расширенная версия (убрали незначимую


переменные kitchen, cosmetic)
alt_model <- lm(formula = price ~ mins + moscow + rooms + type + area +
livingarea +
floor + euro + designer, data = data)
summary(alt_model)

stargazer(alt_model,
title = "Alternative Model",
out = "mod_summary.html",
type = "html",
digits = 4) %>%
cat()

# F-test to compare models


anova(base_model, alt_model)
anova_result <- anova(base_model, alt_model) #p-value меньше уровня
значимости, поэтому H0 отвергается => альтернативная модель лучше
anova_result

stargazer(anova_result,
title = "F-test for model comparison",
out = "mod_summary.html",
type = "html",
digits = 4) %>%
cat()

#Альтернативный способ провести F тест


residuals_base <- residuals(base_model)
residuals_alt <- residuals(alt_model)
RSS_base <- sum(residuals_base^2)
RSS_alt <- sum(residuals_alt^2)

n <- length(residuals_base)
p_base <- length(coef(base_model)) - 1
p_alt <- length(coef(alt_model)) - 1

RSS_diff <- RSS_base - RSS_alt


df_diff <- (p_alt - p_base)

F_statistic <- (RSS_diff / df_diff) / (RSS_alt / (n - p_alt - 1))

p_value <- pf(F_statistic, df_diff, n - p_alt - 1, lower.tail = FALSE) #


p_value #равен 0 => coefficients of the additional variables in the
alternative model are not all zero

#### Часть 1b ####

#Пункт d (придумали неправильную спецификацию)


wrong_model <- lm(formula = price ~ type + moscow + area +
log(floor) + euro + designer + cosmetic + kitchen,
data = data)
summary(wrong_model) #R2=0,645

stargazer(wrong_model,
title = "Incorrect Model",
out = "mod_summary.html",
type = "html",
digits = 4) %>%
cat()

#Пункт е (i)

#Вектор незначимых переменных модели wrong_model


additional_variables <- c("moscow", "kitchen", "cosmetic","livingarea")

#Хотим вевести таблицу с критериями для различных вариаций нашей модели


results <- data.frame(Model = character(), AIC = numeric(), BIC =
numeric(), Adjusted_R_squared = numeric())

#Делаем через циклы вариации наших моделей


for (i in 1:length(additional_variables)) {
combinations <- combn(additional_variables, i)
for (j in 1:ncol(combinations)) {
formula <- as.formula(paste("price ~ type + rooms + area + log(floor)
+ euro + designer +", paste(combinations[, j], collapse = " + ")))
current_model <- lm(formula, data = data)
results <- rbind(results, data.frame(Model = paste(combinations[, j],
collapse = " + "),
AIC = AIC(current_model),
BIC = BIC(current_model),
Adjusted_R_squared =
summary(current_model)$adj.r.squared))
}
}

#Результаты
View(results)

#Какая модель лучшая?


best_model_aic <- results[which.min(results$AIC), ]
best_model_bic <- results[which.min(results$BIC), ]
best_model_adj_r_squared <-
results[which.max(results$Adjusted_R_squared), ]

print("Best Model based on AIC:")


print(best_model_aic)
print("Best Model based on BIC:")
print(best_model_bic)
print("Best Model based on Adjusted R-squared:")
print(best_model_adj_r_squared) #Лучшая модель - когда просто есть
переменнвая cosmetic

wrong_model_i <- lm(formula = price ~ type + moscow + area +


log(floor) + euro + designer + cosmetic + kitchen,
data = data)
summary(wrong_model)

#Пункт е (ii + iii)

boxCox(wrong_model) #BoxCox

reset_test <- resettest(wrong_model, power = 2:3) #Тест Рамсея


reset_test #Неправильная спецификация (p-value маленький)

long_model <- lm(price ~ type + moscow + area +


log(floor) + euro + designer + cosmetic + kitchen +
livingarea, data = data)
test_result <- anova(wrong_model, long_model)
test_result #Длинная регрессия (с living area) лучше
stargazer(test_result,
title = "Short/Long regression test",
out = "mod_summary.html",
type = "html",
digits = 4) %>%
cat()

#Пункт e (iv)
# Chow-test
#install.packages("strucchange")
library (strucchange)
chow_test <- sctest(wrong_model, type = "Chow", breakpoints =
data$moscow)
summary(chow_test)
test_statistic <- chow_test$statistic
p_value <- chow_test$p.value
p_value #равен 0 => there is evidence of a structural break

#Пункт f

#Breusch-Pagan Test (гетероскедастичность)


bp_test <- bptest(wrong_model)
bp_test

#Collinearity
library(car)
vif_values <- vif(wrong_model)
View(vif_values)

# Interpretations
interpretations <- c(ifelse(bp_test$p.value < 0.05, "Reject null
hypothesis: heteroscedasticity present",
"Fail to reject null hypothesis: no evidence
of heteroscedasticity"),
ifelse(any(vif_values > 10), "Potential
multicollinearity detected",
"No multicollinearity detected"))

test_names <- c("Breusch-Pagan Test for Heteroscedasticity",


"Collinearity")
test_results <- data.frame(Test = test_names, P_Value = p_values,
Interpretation = interpretations)
print(test_results)
View(test_results)

#Пункт g
#Для того, чтобы определить неправильную спецификацию, необходимо
выяснить правильную
#Выпишем модель альтернативную, она лучше, но возможно надо
логарифмировать

best_model_1 <- lm(formula = price ~ type + mins + moscow + rooms + area


+ livingarea +
floor + euro + designer, data = data)
summary(best_model_1) #R2=0.65

boxCox(best_model_1, lambda = seq(-2,2)) #логарифмировать зависимую


переменную надо

#Попробуем другую спецификацию спецификацию


best_model_2 <- lm(formula = log(price) ~ type + mins + moscow + rooms +
area + livingarea +
floor + euro + designer + cosmetic + kitchen, data =
data)

summary(best_model_2) #тут спецификация лучше, R2 0.86

#Посмотрим на остатки
crPlots(best_model_2) #для area и livingarea явно напрашивается квадрат

best_model_3 <- lm(formula = log(price) ~ type + mins + moscow + rooms +


area + I(area^2) + livingarea + I(livingarea^2)+
floor + euro + designer + cosmetic + kitchen, data =
data)
summary(best_model_3) #тут спецификация лучше, R2 0.893, но теперь
проблема с living area

best_model_4 <- lm(formula = log(price) ~ type + mins + moscow + rooms +


area + I(area^2) + livingarea +
floor + euro + designer + cosmetic + kitchen, data =
data)
summary(best_model_4) #тут R2 0.893, не шибко лучше стало, квадрат living
area определенно лишний

best_model_5 <- lm(formula = log(price) ~ type + mins + moscow + rooms +


area + I(area^2) + log(livingarea) +
floor + euro + designer + cosmetic + kitchen, data =
data)
summary(best_model_5) #тут спецификация лучше, R2 0.8962, сделали
логарифм living area
crPlots(best_model_5) #все хорошо

stargazer(best_model_5,
title = "Best Model",
out = "mod_summary.html",
type = "html",
digits = 4) %>%
cat()

#Т.о., 5-ая спецификация - наилучшая

#Посчитаем AIC/BIC
AIC_val <- AIC(best_model_1, best_model_2, best_model_3, best_model_4,
best_model_5)
BIC_val <- BIC(best_model_1, best_model_2, best_model_3, best_model_4,
best_model_5)
View(AIC_val)
View(BIC_val)

#Сделаем сравнительную таблицу


models <- list(best_model_1, best_model_2, best_model_3, best_model_4,
best_model_5)
stargazer(models, title = "Regression Model Summaries", type = "html")

#### Часть 2 ####

#Пункт а) - мультиколлениарность
vif_values <- vif(best_model_5_robust)
vif_values #проблема только area, но не очень критично

#Пункт б) - гетероскедастичность
bptest(best_model_5) #есть гетероскедастичность
vif(best_model_5) #проблема только area, но не очень критично

#Robust standard errors


library(sandwich)
best_model_5_robust <- lm(formula = log(price) ~ type + mins + moscow +
rooms + area + I(area^2) + log(livingarea) +
floor + euro + designer + cosmetic + kitchen,
data = data)
robust_se <- sqrt(diag(vcovHC(best_model_5_robust)))
coef_robust <- coeftest(best_model_5, vcov = vcovHC(best_model_5, type =
"HC1")) #коэффициенты

standard_errors_1 <- summary(best_model_5)$coefficients[, "Std. Error"]


standard_errors_2 <- robust_se

standard_errors_1
standard_errors_2 #видно, что ошибки меньше у второй модели
View(coef_robust)
View(coeftest(best_model_5))

# Пункт c) - эндогенность

#install.packages("Formula")
#install.packages("plm")
library(Formula)
library(plm)

model_mnk <- lm(price ~ type + mins + moscow + rooms + area + I(area^2) +


log(livingarea) + floor + euro + designer + cosmetic + kitchen, data =
data)

model_panel <- plm(formula = price ~ type + mins + moscow + rooms + area


+ I(area^2) + log(livingarea) + floor + euro + designer + cosmetic +
kitchen,
data = data,
model = "within")
#Hausman Test
hausman_test <- phtest(model_mnk, model_panel)
print(hausman_test)

You might also like