You are on page 1of 15

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/373903122

Prediction of House Prices in Lagos-Nigeria Using Machine Learning Models

Article in European Journal of Theoretical and Applied Sciences · September 2023


DOI: 10.59324/ejtas.2023.1(5).22

CITATIONS READS

3 344

5 authors, including:

Mmesoma P. Nwankwo Emmanuel Chibuogu Asogwa


Nnamdi Azikiwe University, Awka Nnamdi Azikiwe University, Awka
19 PUBLICATIONS 13 CITATIONS 19 PUBLICATIONS 5 CITATIONS

SEE PROFILE SEE PROFILE

Okwuchukwu Ejike Chukwuogo Okechukwu J. Obulezi


Nnamdi Azikiwe University, Awka Nnamdi Azikiwe University, Awka
11 PUBLICATIONS 7 CITATIONS 108 PUBLICATIONS 177 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Okwuchukwu Ejike Chukwuogo on 15 September 2023.

The user has requested enhancement of the downloaded file.


Prediction of House Prices in Lagos-Nigeria
Using Machine Learning Models
Mmesoma Peace Nwankwo
Department of Statistics, Faculty of Physical Sciences, Nnamdi Azikiwe University, Awka, Nigeria
Ndukaku Macdonald Onyeizu
Department of Computer Science, Faculty of Physical Sciences, Nnamdi Azikiwe University, Awka, Nigeria
Emmanuel Chibuogu Asogwa
Department of Computer Science, Faculty of Physical Sciences, Nnamdi Azikiwe University, Awka, Nigeria
Chukwuogo Okwuchukwu Ejike
Department of Computer Science, Faculty of Physical Sciences, Nnamdi Azikiwe University, Awka, Nigeria
Okechukwu J. Obulezi 
Department of Statistics, Faculty of Physical Sciences, Nnamdi Azikiwe University, Awka, Nigeria

Suggested Citation
Abstract:
Nwankwo, M.P., Onyeizu, N.M., This paper considers the relationship between the price of houses
Asogwa, E.C., Ejike, C.O. & and the features namely the number of bedrooms, parking space, and
Obulezi, O.J. (2023). Prediction of
different house types. In this study, a machine learning approach was
House Prices in Lagos-Nigeria used to develop prediction models that predicted house prices in
Using Machine Learning Models. Lagos. Different machine learning techniques were used, train-test
European Journal of Theoretical and
split to split the data into training sets for training and building the
Applied Sciences, 1(5), 313-326. model and test data to test the accuracy of the model, performance
DOI: 10.59324/ejtas.2023.1(5).22
metric mean absolute error to set the baseline for the model,
Variance Inflation Factor (VIF) to help remove multicollinearity
between features and Streamlit interactive dashboards to
communicate with the model. Correlation and regression methods were used to examine the relationship
and build the model. It is observed that there is a strong positive correlation between the number of
bedrooms and the number of toilets, likewise the number of bedrooms and the number of bathrooms.
It also shows that there is a moderate positive correlation between the number of bedrooms and price.
The model shows that the number of bedrooms, parking spaces, and house types play an important role
in determining the price of houses.

Keywords: Train-test split, Variance Inflation Factor (VIF), Correlation, Ridge regression, Machine learning.

Introduction Nigeria is known for its expensive house price,


according to forbes.com it is ranked the 55 th
Different factors affect the price of houses in most expensive city to live in the world. Ajah a
Nigeria, states, towns, and localities depending suburb in Lagos state, is considered one of the
on choice. An increasing factor in a particular best places to start a family. Unlike other areas
state might be a decreasing factor in another such as Lekki, Ikoyi, and Victoria Island, Ajah is
state. Lagos known as the commercial capital of

This work is licensed under a Creative Commons Attribution 4.0 International License. The license permits unrestricted
use, distribution, and reproduction in any medium, on the condition that users give exact credit to the original author(s)
and the source, provide a link to the Creative Commons license, and indicate if they made any changes.
a mix of both upper-class and middle-class constant shifting diagonals of the moment
citizens. matrix.
Machine learning (ML) is the subset of artificial
intelligence (AI) that focuses on building
systems that learn or improve performance Literature review
based on the data they consume. There are This section summarizes the concept of relevant
basically three types of machine learning: work on Prediction of House Prices in Lagos-
supervised, unsupervised, and reinforcement Nigeria using Machine Learning Models using a
learning. Supervised learning is effective for a machine learning model. Here, the House price
variety of business purposes, including sales prediction can be divided into two categories
forecasting, inventory optimization, and fraud (Zulkifley et al., 2020), first by focusing on house
detection. Some examples of use cases include: characteristics, and secondly by focusing on the
• Predicting real estate prices model used in house price prediction. Many
researchers have produced a house price
• Classifying whether bank transactions prediction model, including Temur, Akgün, &
are fraudulent or not Temur, (2019), Jafari, & Akhavian, (2019), Gao
• Finding disease risk factors et al. (2022), The Danh Phan (2018), Yu et al.
(2016) etc.
• Determining whether loan applicants are
low-risk or high-risk Byeonghwa (2015) implemented machine
learning algorithms for housing price prediction
• Predicting the failure of industrial accuracy. The housing data was analyzed from
equipment’s mechanical parts townhouses in Fairfax Country and compared
Regression is a supervised learning technique the classification accuracy performance of
that aims to find the relationships between the various algorithms. To help a real estate agent,
dependent and independent variables. Ridge he then developed a better prediction model for
regression is a method of estimating the enhanced decisions based on house price
coefficients of multiple regression models in assessment.
scenarios where the independent variables are Jafari and Akhavian (2019) stated that the square
highly correlated (Hilt, & Seegrist, 1977). It has footage of a unit of a house is the most
been used in many fields including important variable in predicting the price of a
econometrics, chemistry, and engineering house, followed by the number of bathrooms
(Gruber, & Schucany, 2020). Uzoma, & and number of bedrooms.
Jeremiah, (2016) developed outlier detection and
optimal variable selection techniques in Raga Madhuri, Anuradha, & Vani Pujitha, (2019)
regression analysis and other fascinating papers discussed diverse regression techniques such as
by the authors include (Anabike et al., 2023; Gradient boosting and AdaBoost Regression,
Innocent et al., 2023; Abuh, Onyeagu, & Ridge, Elastic Net, Multiple linear, and LASSO
Obulezi, 2023a; Abuh, Onyeagu, & Obulezi, to locate the most excellent. The performance
2023b; Obulezi et al., 2022; Onyekwere, & measures used are Mean Square Error (MSE)
Obulezi, 2022; Onyekwere et al., 2022). and Root Mean Square Error (RMSE).
The ridge estimator is given as The rest of this paper is organized as follows; In
section 3, we present the material and method.
Here, the big data is cleaned for subsequent use.
𝛽𝛽̂𝑅𝑅 = (𝑋𝑋 𝑇𝑇 𝑋𝑋 + 𝜆𝜆𝐼𝐼)−1 𝑋𝑋 𝑇𝑇 𝑦𝑦 (1) In section 4, we analyze the data and discussed
the results in section 5. We then conclude the
paper in section 6.
where y is the regressand, X is the design matrix,
I is the identity matrix and λ ≥ 0 serves as the

314

www.ejtas.com EJTAS 2023 | Volume 1 | Number 5


Material and Method Correlation analysis is primarily concerned with
finding out whether a relationship exists
The dataset used for this analysis is the Nigerian between variables and then determining the
housing dataset retrieved from Kaggle magnitude and action of that relationship.
containing 25 unique states, 189 unique towns,
and 24326 rows with 8 columns. After cleaning
the data Ajah-Lagos was chosen as the town of n ∑ xy−∑ x ∑ y
interest. ρ= (3)
�(n ∑ x2 −(∑ x)2)(n ∑ y2 −(∑ y)2 )

Model
In the course of the analysis, certain libraries in Matplotlib and plotly express were used for data
Python were employed to model and analyze the visualization.
data. Panda was used to read, clean, and
Data cleaning
manipulate the data, Scikit-learn (Sklearn) is an
important library in Python that provide tools Figure 1 Shows the prices of houses in different
for machine learning algorithm for regression states in Nigeria, It shows that Lagos state has
problem, classification problem, and clustering. the most expensive house followed by Abuja.
Ridge regression was used for prediction to help We can also see that in different towns in
reduce overfitting and multicollinearity, Nigeria, Ikoyi houses seem more expensive than
(Akinwande, Dikko & Gulumbe 2015). in any other town, not only that but the gap
Multicollinearity arises when there is a between Ikoyi and any other house was too
correlation between the independent variables’ much, houses in Ifako-Ijaiye can’t be in the
percentages to help reduce this VIF was league of houses in Maitama-District, which
employed. The computational formula for VIF could be a result of outliers.
is given as
A low standard deviation shows that the data
points tend to be close to their mean and vice-
VIF =
1
(2) versa. However, from this it is noticed that the
1−R2 value of std is high are far away from the mean.
This is to say we have outliers that contributed
to this.

Figure 1. 20 Most Expensive Houses in Nigeria [N100m]

315

www.ejtas.com EJTAS 2023 | Volume 1 | Number 5


Figure 2. 20 Most Expensive Houses in Nigeria by Town

An outlier is a data point that is significantly


different from the rest of the data. Boxplot helps
identify outliers, it is located outside the
whiskers. In fig 4, it shows that some data points
are far away from each other, we have a house
selling at approximately 1.8 trillion naira.

Figure 4. Boxplot of House Prices

It was noticed that a house in Ikoyi Lagos was


selling at N1.8 trillion and a house in Ifako-Ijaiye
was selling at N55 billion. Whether these are
true, we don’t know but it affected our dataset
and needs to be dropped. Using quantiles to trim
Figure 3. Descriptive Statistics for Price
the dataset, to contain data points between the
10th percentile and 90th percentile, changing the
total number of dataset contained, having 19774
rows with the 8 columns.

316

www.ejtas.com EJTAS 2023 | Volume 1 | Number 5


Figure 6. Number of Data Points in Each
State

Using the value-count function in Python, to


count the number of data points contained in
each state, showing that Lagos contains the
highest number of data points, trimming this
series to contain series with data points greater
than 100 using a function in Python. This
resulted to fig 7.

Figure 5. Outliers

Figure 7. Stabilizing the Data with Records


Greater than 100

After cleaning the dataset, fig17 shows the mean


price of an apartment, with Lagos still leading
the chart.
Comparing fig 9 and Figure 2, we still see Ikoyi
leading the chart but not far away from Maitama
district.
Ajah ranks 24th from the chart with Ikoyi
leading the chart followed by Lagos island, VI
this is true because we know that these towns
have houses where the elites, celebrity lives.

317

www.ejtas.com EJTAS 2023 | Volume 1 | Number 5


Figure 8. Average Price of Houses by State

Figure 9. 10 Expensive Houses by Town (with cleaned dataset)

318

www.ejtas.com EJTAS 2023 | Volume 1 | Number 5


Figure 10. Towns with Expensive Houses in Lagos

Results

(a) counts of building in Ajah Lagos (b) chart showing price of building in Ajah
Figure 11. Counts and Prices of Different Building in Ajah

319

www.ejtas.com EJTAS 2023 | Volume 1 | Number 5


The diagram below shows the relationship
between price and number of bedrooms,the
correlation coefficient is 0.48 which shows that
they are moderately correlated.

Figure 13. Correlation Matrix Between


Different Variables

From this plot and correlation matrix, we can see


that some variables are highly correlated they can
Figure 12. Correlation Between Bedroom affect our analysis.it can be seen that bedrooms,
and Price bathrooms, toilets are highly correlated, this is
possible as the number of bedroom increases,
the number of bathrooms and toilets tend to
increase. We can use VIF to test
multicollinearity. (see Etaga et al [19]).

Figure 14. Heat Map Showing the Correlation Between Variables

320

www.ejtas.com EJTAS 2023 | Volume 1 | Number 5


(a) 1st ViF (b) Vif after dropping toilet

(c) vif after dropping total room (d) vif after dropping bathroom
Figure 15. VIF Before and After

As predicted from the correlation matrix, it can After cleaning this data, I built the model, to
be observed that in Variance Inflation Factors build a model for prediction there are three
Results (VIFs), 5 of the independent variable’s stages which include:
VIF’s exceeded 10, which indicate very strong 1. Splitting the data: This involves splitting
presence of multicollinearity, dropping some of the data into features and target. In python we
these columns reduce the value of the vif. A call the predictors (independent variable)
boxplot for the distribution of Ajah house price Features while the dependent variable is referred
was plotted and it was discovered that there were to as Target. The predictor involved in this
outliers and it was trimmed using quantiles. The analysis is bathrooms, parking-space, house
research model is specified thus price = f types. The dependent variable is price. These
(number of bedrooms, number of parking space, data are further splitted into TRAIN-TEST
house-type). SPLIT
(a) Train-test split: A train test split is when
you split your data into a training set and a
testing set. The training set is used for
training the model, and the testing set is
used to test your model. This allows you
to train your models on the training set,
and then test their accuracy on the unseen
testing set. The data was splitted into two
sets.80% for training and 20% for testing.
Having (1292, 3) for training and (323, 3)
for testing. This ensures that both sets are
representative of the entire dataset, and
Figure 16. Boxplot of Ajah House Price

321

www.ejtas.com EJTAS 2023 | Volume 1 | Number 5


gives a good way to measure the accuracy apartment is N58429682.66 then our prediction
of the models. we would be off by an average N12644596.9.
(b) Fitting the model: This involves making a Mean absolute error(MAE) is a performance
pipeline, in pipeline we can have many metric in the context of machine learning,
transformers but one regressor which absolute error refers to the magnitude of
comes at the end of the pipeline and then difference between the prediction of an
fitting it with our data. observation and the true value of that
observation given as:

∑ |y−y
�|
MAE = (4)
n

The Training MAE 8986866.53, this shows that


training mae beats the baseline mae by
N3657730 it is useful to predict house price.The
next step is to use it on set of data that it has not
seen before which is our test set. Test MAE:
8525486.28 The test mae also beats our baseline
mae.
3. Evaluating and communicating results:
Figure 17. Fitting a Model Ability to communicate results to the stake
holders.

The pipeline has a:


Fig18 shows the predicted value of price yˆ
• Transformer which takes a dataset as an against y test price.
input and creates an augmented dataset as
output.In this analysis OneHotEncoder is used The regression equation is given as:
to encode categorical data (House-types) to
numerical data
Price = 33722523.23 +4352474.19*bedrooms
• Estimator: An estimator in machine +111280.33*parking-space +12215882.35*DD
learning is an algorithm that fits on the input +673130.34 *SDD
dataset to generate a model, which is a +4962884.93*BOF +-3873523.72*TD +-
transformer. Regression is an estimator that 3729909.35*DB +-7236944.02*SDB +-
trains on a dataset with labels and features and 3011520.53 *TB where
produces a ridge regression model.
33722523.23: intercept
2. Baselines: A baseline model is essentially
a simple model that acts as a reference in a
machine learning project used for comparison DD : Detached duplex
purpose. for regression problems, the common
rule is to create baseline models that predict the TD : Terraced duplex
mean or median of the training data output. In SDD : semi detached duplex
the model the mean-apt-price which is the mean
of y train is 58429682.66 and Baseline MAE BOF : Block of flat
which is the mae of y-pred-baseline and y train is SDB : semi detached bungalow
12644596.9. from the baseline model: it shows
DB : Detached bungalow
that if we predicted that the price of an

322

www.ejtas.com EJTAS 2023 | Volume 1 | Number 5


TB : Terraced bungalow are all the coefficients. Each one of these coefficient affects price either
positively or negatively.

Figure 18. Out Sampling

Figure 19. Feature Importance

323

www.ejtas.com EJTAS 2023 | Volume 1 | Number 5


Feature Importance refers to technique that house is a detached duplex the price increases by
calculates a score for all the input features for a N12.5 million but when it is a semidetached
given model— the scores simply represent the duplex the price decreases by N7.5million, it also
“importance” of each feature. A higher score shows that parking space add little amount to the
means that the specific feature will have a larger price of an apartment in Ajah and the number of
effect on the model that is being used to predict bedrooms also increases the house price by N2.5
a certain variable. This shows that when the million.

(a ) (b)

(c ) (d )
Figure 20. Predicted Price of Different House Types

Discussion that when the house is a detached duplex the


price increases by N12.5 million but when it is a
The outcome of this study was reached utilizing semidetached duplex the price decreases by a
a regression model with three stages, including million, it also shows that parking space adds a
data splitting, baseline modeling for comparison, little amount to the price of an apartment in Ajah
and result evaluation, as seen in figures 18.0 and and the number of bedrooms also increases the
19.0 above. The price of the house increases by house price by 2.5 million.
N12.5 million when it is a detached duplex, but
decreases by N1 million when it is a
semidetached duplex. References
Abuh, M., Onyeagu, S. & Obulezi, O.J. (2023a).
Conclusion Exponentiated Power Lindley-Logarithmic
Distribution and its Applications. Asian Research
In this article, machine learning models have Journal of Mathematics, 19(8), 47-60.
been deployed to predict house prices in Lagos https://doi.org/10.9734/ARJOM/2023/v19i8
Nigeria. Ikoyi Lagos state has the most 686
expensive house in Nigeria followed by Maitama
district Abuja. From the feature importance, it Abuh, M., Onyeagu, S.I. & Obulezi, O.J.
shows that house type plays an important role in (2023b). Comparative Study Based on
predicting the price of a building. This shows Simulation of Some Methods of Classical

324

www.ejtas.com EJTAS 2023 | Volume 1 | Number 5


Estimation of the Parameters of Exponentiated Jerry Distribution and its Applications.
Lindley – Logarithmic Distribution (May 22, International Journal of Innovative Science and Research
2023). Asian Journal of Probability and Statistics, Technology, 8(5), 522–533.
22(4), 14-30. https://doi.org/10.5281/zenodo.7949632
Akinwande, M. O., Dikko H. G., &Gulumbe S. Onyekwere, C.K. & Obulezi, O.J. (2022). Chris-
U (2015) Identifying the Limitation of Stepwise Jerry Distribution and Its Applications. Asian
Selection for Variable Selection in Regression Journal of Probability and Statistics, 20(1), 16–30.
Analysis. American Journal of Theoretical and Applied https://doi.org/10.9734/AJPAS/2022/v20i13
Statistics, 4, 414-419. 0480
Anabike, I. C., Igbokwe, C. P., Onyekwere, C. Onyekwere, C.K., Obulezi, O.J., Udofia, E.M. &
K., & Obulezi, O. J. (2023). Inference on the Anabike, I.C. (2022). Modification of Shanker
Parameters of Zubair-Exponential Distribution Distribution using Quadratic Rank
with Application to Survival Times of Guinea Transmutation Map. Journal of Xidian University,
Pigs. Journal of Advances in Mathematics and 16(8), 179–198.
Computer Science, 38(7), 12–35. https://doi.org/10.37896/jxu16.8/020
https://doi.org/10.9734/jamcs/2023/v38i7176 Park, B. & Bae, J.K. (2015). Using machine
9 learning algorithms for housing price prediction:
Gao, G., Bao, Z., Cao, J., Qin, A.K. & Sellis, T. The case of Fairfax County, Virginia housing
(2022). Location-Centered House Price data. Expert systems with applications, 42(6), 2928–
Prediction: A Multi-Task Learning Approach. 2934.
ACM Trans. Intell. Syst. Technol, 13(2), 32. https://doi.org/10.1016/j.eswa.2014.11.040
https://doi.org/10.1145/3501806 Raga Madhuri, C.H., Anuradha, G. & Vani
Gruber, M. & Schucany, W.R. (2020). Improving Pujitha, M. (2019). House price prediction using
Efficiency by Shrinkage (Statistics: A Series of regression techniques: A comparative study. In 2019
Textbooks and Monographs) 1st Edition. USA: International conference on smart structures and
Routledge. systems (ICSSS). IEEE.
Hilt, D.E., & Seegrist, D.W. (1977). Ridge: a Temur, A.S., Akgün, M. & Temur, G. (2019).
computer program for calculating ridge regression Predicting housing sales in Turkey using
estimates. Research Note NE-236. Upper Darby. PA: ARIMA, LSTM and hybrid models. Journal of
U.S. Department of Agriculture, Forest Service, Business Economics and Management, 20(5),
Northeastern Forest Experiment Station. 920-938.
https://doi.org/10.3846/jbem.2019.10190
Innocent, C.F., Frederick, O.A., Udofia, E.M.,
Obulezi, O.J. & Igbokwe, C.P. (2023). The Danh Phan. (2018). Housing price prediction
Estimation of the Parameters of the Power Size using machine learning algorithms: The case of Melbourne
Biased Chris-Jerry Distribution. International city, Australia. In 2018 International conference
Journal of Innovative Science and Research Technology, on machine learning and data engineering
8(5), 423–436. (iCMLDE). IEEE.
https://doi.org/10.5281/zenodo.7947382 https://doi.org/10.1109/iCMLDE.2018.00017
Jafari, A. & Akhavian, R. (2019). Driving forces Uzoma, U. & Jeremiah, O. (2016) An Alternative
for the US residential housing price: a predictive Approach to AIC and Mallow’s Cp Statistic-
analysis. Built Environment Project and Asset Based Relative Influence Measures (RIMS) in
Management, 9(4), 515 – 529. Regression Variable Selection. Open Journal of
https://doi.org/10.1108/BEPAM-07-2018- Statistics, 6, 70-75.
0100 https://doi.org/10.4236/ojs.2016.61009
Obulezi, O.J., Anabike, I.C., Oyo, O. G., & Yu, Y., Song, S., Zhou, T., Yachi, H. & Gao, S.
Harrison, E.O. (2023). Marshall-Olkin Chris- (2016). Forecasting house price index of China using

325

www.ejtas.com EJTAS 2023 | Volume 1 | Number 5


dendritic neuron model. In 2016 International Zulkifley, N., Rahman, S., Nor Hasbiah, U., &
Conference on Progress in Informatics and Ibrahim, I. (2020). House Price Prediction using
Computing (PIC), Shanghai, China. a Machine Learning Model: A Survey of
https://doi.org/10.1109/PIC.2016.7949463 Literature. International Journal of Modern Education
and Computer Science, 12, 46-54.
https://doi.org/10.5815/ijmecs.2020.06.04

326

www.ejtas.com EJTAS 2023 | Volume 1 | Number 5

View publication stats

You might also like