Professional Documents
Culture Documents
To cite this article: M. Carmen Morillo Balsera, Sandra Martínez-Cuevas, Iñigo Molina Sánchez,
César García-Aranda & M. Estibaliz Martinez Izquierdo (2018): Artificial neural networks and
geostatistical models for housing valuations in urban residential areas, Geografisk Tidsskrift-Danish
Journal of Geography, DOI: 10.1080/00167223.2018.1498364
Article views: 20
NOTE
Abbreviations: ANN: Artificial Neural Networks; OK: ordinary Kriging; MLP: multi-layer perceptron
CONTACT M. Carmen Morillo Balsera mariadelcarmen.morillo@upm.es ETSI Topografía, Geodesia y Cartografía, Universidad Politécnica de Madrid,
Campus Sur de la UPM, Km 7.5 de la Autovía de, Madrid, Valencia, Spain
© 2018 The Royal Danish Geographical Society
2 M. C. MORILLO BALSERA ET AL.
well as the fact that it is easy to use and has many metropolitan area of its capital (Figure 1(a)). Pozuelo de
applications (Pérez, 2017). The ANNs belong to the field Alarcon has an extension of a 43.2 km2 and a popula-
of Artificial Intelligence (AI), understood as a set of tion of 84,989 inhabitants, according to data supplied in
algorithms whose purpose is to imitate human reason- 2016 by the National Institute of Statistics. The munici-
ing through a deductive logic or manipulation of sym- pality has 12,298 registered plots according to the
bols (Martín & Sanz, 2006). online cadaster (SEC, 2016, initials in Spanish), and
Artificial Intelligence began to be applied in real 90,033 single-family homes (the SEC is a telematic
estate valuation in the early nineties. Since then, point of contact for the cadaster which allows users to
numerous experiences have emerged and the creation make queries and download open data). As indicated in
of new models is increasing. Real estate valuation Figure 1(b), most single-family housing units are
develop by IA has been done by Kontrimas and located outside the town centre. The most isolated or
Verikas (2011) in Lithuania, Daşkiran (2015) in Turkey semi-detached units are in gated communities located
and Tabales, Carmona, and Caridad (2017) in Córdoba on the outskirts of the municipality. Most row and
(Spain), among others. closed block units are located in or near the town
centre.
Figure 1. a. Map indicating the location of Pozuelo de Alarcon within the Community of Madrid; 1b. Detailed map of the
distribution of single-family homes in Pozuelo de Alarcon.
GEOGRAFISK TIDSSKRIFT-DANISH JOURNAL OF GEOGRAPHY 3
Figure 2. Rendering of projected UTM data (ETRS89, zone 30) with the 100 m delimitation.
Table 1. Population frequencies referred to the qualitative groups housing units into 9 categories (1 to 9) to
variable single-family housing. estimate the cost of construction in the cadaster.
Variables Categories Frequencies Percentage
Class Single-family housing 9033
Modality (1) Isolated or semi-detached 5641 62.5%
An exploratory analysis has been performed using the
(2) Row or closed block 3391 37.5% available variables of numerical type such as unit price,
age and surface area. In this scenario, quantitative and
● Modality: Urban single-family residential housing qualitative analyses have been performed, which are
units. The modality can be one of two categories: aimed to know the statistical distribution of the differ-
isolated or semi-detached, vs. row or closed block. ent variables. Also, the distribution of the data concern-
Table 1 shows the population frequencies referred ing the mean is very heterogeneous, that is the data are
to the qualitative variable single-family housing. not clustered around the mean. Regarding the variable
“Surface area”, a mean value 296.5 m2 is observed with
● Year of Construction: Year in which the construc- 50% of single-family housing areas less than or equal to
tion of the housing unit was completed and the 214 m2. Finally, in the case of variable “Age (year of
first occupancy license obtained. construction)”, its mean value matches with the year
● Category: Refers to the quality of the building’s 1982, the data is left-skewed and leptokurtic distribu-
construction, on a scale of 1 to 9. One (1) repre- ted. It is uniformly distributed. Generally, it is concluded
sents the best possible quality for a housing unit, that none of the variables follows a normal distribution.
while nine (9) is the worst. This attribute has been The plots in the study area were represented using the
defined according to the guidelines established by “Projected Coordinate System: ETRS89_UTM_zone_30N”
the General Directorate of the Cadaster, which (Figure 2). To avoid border errors, when applying the
4 M. C. MORILLO BALSERA ET AL.
Ordinary Kriging method, the boundary of housing devel- 2.3.2 Artificial Neural Networks
opments was set at 100 m inside the limit. Furthermore, Traditional Hedonic Models are not exempt from limita-
the typology of each of the variables was also considered tions; among the problems they may present is the
when building the models. multicollinearity between the predetermined variables
of the model because of the intrinsic characteristics of
the variables. Moreover, as far as the analysis of random
2.3 Methods perturbations is concerned, estimating with cross-sec-
tional data implies the possible presence of heterosce-
The methodology presented in this study has two dis-
dasticity (Caridad Y Ocerín, Nuñez-Tabales, & Ceular
tinct parts. The first involves using geostatistical proce-
Villamandos, 2008).
dures to study fluctuations in housing prices, while the
An alternative, to the econometric methods of
second consists of adapting a MLP model to estimate
Hedonic modelling are the Artificial Neural Networks.
unit prices for housing units based on the input vari-
A Network of Artificial Neurons is a system of intercon-
ables and identify which variables are most important
nection of neurons (nodes) that collaborate with each
in determining these prices.
other to produce an exit stimulus. It is inspired by the
way the nervous system works. The second method
2.3.1 Geostatistical analysis of fluctuations in applied, in this study, is based on Artificial Neural
housing prices Networks, specifically the Multilayer Perceptron algo-
The basic tool used in geostatistics is the semivario- rithm (MLP). The disadvantage with traditional
gram (or simply variogram). The semivariogram is a Hedonic Methods is that it has a predictive function
second-order moment that makes it possible to analyse for a specific domain since the network has been
the spatial continuity of a given variable, Z. If the vari- trained for that domain.
able is the unit price of housing units, the variogram It is important to remark that we do not use the
can be used to express the spatial dependence of prices lineal regression (LR). The main decision was because
for variables separated by different distances (Chica- the Root Mean Square Error (RMSE) of MLP (RMSE =26.4
Olmo, Cano, & Chica-Olmo, 2007). €/m2) was minor than LR (RMSE =60.9 €/m2). The MLP is
The Ordinary Kriging (OK) method is a spatial interpo- the most commonly used structure, according to
lation to establish the values of the variable indicated Freeman and Skapura (1993), Haykin (1999), García-
where not sampled. This estimation process is based on Rubio (2004), Peterson and Flanagan (2009), and Park
determining the covariances (variograms) for data at the and Bae (2015), as it provides the best results in this
observation points, which means it considers the spatial type of analysis (Figure 3).
correlation between the data. According to Krige (1951), Every neural model requires a propagation rule (acti-
the method is a multiple spatial regression based on the vation function) that combines the outputs from each
following algorithm (Londoño & Valdés, 2012). neuron with the corresponding weights established by
the connection pattern, thus specifying the relative
X
n
value of the inputs received from each neuron
^ 0Þ ¼
ZðS λi ZðSi Þi ¼ 1; . . . ; n
i¼1 (Caridad Y Ocerín & Ceular Villamandos, 2001).
The multilayer perceptron is defined as a function of
where: the input and output variables of a given network. This
Zˆ (S0): is the datum to be estimated. relationship is determined by propagating the input
λ i: are the weights. variable values forward. To this end, each neuron in
Z(Si): are the data based on which the estimate is to the network processes the information received by its
be made. inputs and produces a response or activation that is
This algorithm indicates the prediction of a point, given propagated, through corresponding connections, to the
the value of the nearest points. After selecting the most neurons in the following layer.
appropriate theoretical variogram to fit the available data The activation function “relates” the weighted
represented in the experimental semivariogram, it is then sum of units of a layer with the values of units in
necessary to validate the fit because if the fit is defective, the hidden layer (in our study, the activation func-
the results obtained from subsequent Kriging would be tion of the hidden layer is the hyperbolic tangent)
less than optimal. The procedure most commonly used to and output (the activation function of the output
validate the fit is called cross-validation. layer is the identity). There is a node called “BIAS”,
GEOGRAFISK TIDSSKRIFT-DANISH JOURNAL OF GEOGRAPHY 5
it is the component that represents the difference We have calculated the predictive prices with MLP
between the values of the dependent variable to be and we obtained results very similar to the originally
predicted and the model to be created. values (only in the 0.45% of the cases the difference
To view the theoretical development of the opera- was higher than 70 €/m2). Also, we did not obtain a
tion of a Multilayer Perceptron Neural Network, in the significant difference in the spatial interpolation (OK)
simplest case, see the article (Caridad Y Ocerín et al., values calculated with the predictive prices get with
2008, p. 32). MLP or with the original prices, so it was decided to
The model was validated using cross-validation. The study the geographical environment with the original
model’s efficiency is realized in the validation or verifi- prices.
cation data subset.
influence, i.e. there is no anisotropy. The values defining It is observed that in the higher values, the variability is
this variogram (Figure 4) are: greater since it moves away from the sampling points.
To validate the model, a cross-validation was per-
● Nugget, related to the amount of short range formed. It can be seen in Figure 5 how the zones
variability in the data. In this case, it is the random corresponding to the house prices are distributed in
proportion of unit prices variable, which repre- Pozuelo de Alarcon (Madrid). The area with the highest
sents 12.20% of the total variability of this variable. prices, from 725 Є/m2 to 1300 Є/m2, corresponds to a
● Partial Sill represents the part of the spatial depen- private residential area with isolated or detached
dence compared to the total variance. In this case, houses with large plots. This settlement is of recent
this means that 87.8% of the total variability of construction and the housing is of high quality. It is
unit prices which is explained by the spatial auto- located next to a business park, adjacent to a sport area
correlation component. and has direct access to the main communication
● Range, the distance after which data are no longer routes. These residential areas are characterized by
correlated. In this case, it is equal to 188.3 m, large backyards and gardens with free open spaces.
meaning that there is spatial price interdepen- Prices, between 560 Є/m2 and 725 Є/m2, also corre-
dence up to an average distance of 188.3 m spond to isolated single-family housing, but with smal-
between dwellings. ler parcels. The population density is higher, and they
do not have many services, although this category has
Figure 5 shows the result of interpolation using OK very good communication with the main road infra-
and Figure 6 shows the Prediction Standard Error Map, structure. The lowest prices, <560 Є/m2, are in the
where the variability of the predicted values is analysed. urban area of Pozuelo de Alarcon. In these areas the
buildings are older and the typology that predominates The MLP model applied in this study has the
is a single-family dwelling in a closed block. The esti- following characteristics. The input layer, which
mated values should be as close as possible to the represents the independent or predictor variables,
values observed. To assess this fact, the mean square is formed by the following neurons: “modality”
error is calculated and to evaluate the variability of the (with two categories, two nodes or neurons are
prediction, the standardized mean square error that needed), “age”, “X coordinates”, “Y coordinates”,
must approach 1 is applied. “area” and “quality” (with nine categories, nine
In this study, the mean of the errors is close to zero, nodes or neurons are needed). The hidden layer,
the mean square error is approximately 98 €/m2 and which corresponds to the weights assigned to each
the mean square of the standardized errors is close to 1. of the input variables to build the model, is adapted
These conditions conclude that the adjustment of the according to a Hyperbolic Tangent function (activa-
variogram has been appropriated. tion function). As output layer, which corresponds to
the response variable or dependent variable “unit
Price”, the activation function is the identity. The
learning of the network is performed by the mathe-
3.2 Artificial Neural Networks (MLP)
matical method “Gradient descent”.
In the design of the neural network architecture, several As an error function we chose the Sum of squares of
combinations were considered: number of hidden the errors. The aim is to build a model that has the best
layers, number of neurons per layer, activation func- predictive capacity, where the variable to be predicted
tions and training algorithm. is the housing unit price. To guarantee the generalizing
8 M. C. MORILLO BALSERA ET AL.
Table 2. Selected MLP features. The details of the selected MLP can be seen summar-
MLP Features ized in Table 2 and Figure 7.
Architecture 15:15–7:1 The input layer generates the hidden layer and the
Neurons in the input layer 15
Neurons in the hidden layer 7 hidden layer generates the output layer, hence the
Neurons in the output layer 1 need for synaptic weights. The synaptic weights to
Number of weights 120(112 + 8)
Activation function (hidden layer) Hyperbolic Tangent move from the input layer to the hidden layer would
Activation function (output layer) Identity function indicate the degree of participation of each of the
Function error Sum of squares of the errors
variables at the time of generating the hidden layer.
Next, we proceed to observe the most relevant vari-
able, a sampling process with different value combina-
tions has been carried out. The delivered results show
that the most important variable is the predictor vari-
able “housing quality” with 43%, followed by “age” with
35%, “area” with 18%, “modality” with 3%, and finally,
geographical coordinates with 1%. In this model, the
root mean square error (RMSE) is approximately
26.41 €/m2.
4. Discussion
It might be, that the decision to take the model as
isotropic in the global housing trend unit price is, in
principle, not justified and debatable. Therefore, it is
considered important to clarify this situation, because
the anisotropy in the model would affect spatial con-
tinuity and modify the proposed model. The justifica-
tion for the isotropic model is provided below.
Figure 8 shows the evolution of prices in the study
area according to the geographical coordinates. The
slight anisotropy observed in the global trends of
unit prices, justify this decision, as an experimental
omnidirectional variogram is reached, which is in fact
independent of the direction.
The slight anisotropy of Figure 8 does not significantly
influence an improvement of the OK model. When com-
paring these values with those obtained with the OK
model without anisotropy (Table 3), it can be concluded
that the RMSE has practically identical values in the two
models. In conclusion, it is perfectly justified for this
research the decision to take the model as isotropic in
the global trends of the housing unit prices.
Figure 7. Perceptron multilayer neural network graph.
5. Conclusions
capacity of the network, the set of observations of the The most important conclusion is that both the
sample has been randomly divided into two subsets: Ordinary Kriging models and the Neural Networks mod-
els, applied to the prediction of dwellings unit prices in
● The training (to create the model) with a total of the municipality of Pozuelo de Alarcon (Madrid), are
2750 records (accounting for 70% of the sample). required to understand the housing price valuations.
● The test (to evaluate the model) which contains The Neural Network Methods (Multilayer Perceptron)
6282 records. of classic treatment have in this investigation provided:
GEOGRAFISK TIDSSKRIFT-DANISH JOURNAL OF GEOGRAPHY 9
● Higher precision in the prediction of unit prices. In This type of studies is very useful for the develop-
this case it has had 98% precision. ment of new urban plans, since it is possible to
● The model identifies the variables that most identify urban variables (urban morphology, equip-
influence the housing price. In this application ment, infrastructure, etc.) that may be related to the
the variables have been quality of construction price. This is particularly interesting when obtaining
and age. a spatial distribution of dwelling prices in new urban
● The implemented model is useful for price predic- growth areas. Further research should be carried out
tions in future developments of residential areas. identifying which of the urban variables are statisti-
cally significant in relation to price. Subsequently,
Ordinary Kriging models have completed this study with regard to the housing unit prices, to improve
mainly in: the hedonic models, it is recommended, as has been
verified in this research, the use of both techniques
● Visual information on unit prices, highlighting the since they provide an improved and more efficient
higher and lower price of housing groups (Figure 5). model in this type of analysis.
● The autocorrelation in the price of housing until a
maximum distance near to 188.3 m, are obtained
from the variographic analysis. At a greater dis- Disclosure statement
tance, Tobler’s geographical law, which states No potential conflict of interest was reported by the authors.
“everything is related to everything else, but near
things are more related than distant thing” (Tobler,
1970), is no longer fulfilled. Also, 87.8% of the total ORCID
variability of unit prices is explained by the spatial M. Carmen Morillo Balsera http://orcid.org/0000-0002-
autocorrelation component. 0788-8394
10 M. C. MORILLO BALSERA ET AL.