You are on page 1of 9

Prediction of House Price, Bank Campaigning Status and Bank Loan Status using

Machine Learning Algorithms.

Vinayak Vishnu Kolekar


x18185797
School of Computing
National College of Ireland
Dublin, Ireland
x18185797@student.ncirl.ie

Abstract—The residential property becomes an extremely for identification of desired customers of bank telemarketing
competitive field with the availability of an immense will pay the deposit or refuse. [1]
amount of data and assets. People need more accurate
information to avoid fraud and wanted to be satisfied Different types of loan can be taken from the bank with very
after paying a huge amount. This paper consists of a less guarantee of return. Simultaneously the number of
multiple linear regression model and its evaluation for transactions is expanding, and the amount of information is
prediction of housing price with the help of features massively increasing. This represents the customer financial
related to real estate. condition as well as risk around the loan is increasing. The
Many people take a loan from the bank for buying secured loans rely on assets those kept as security, but
property, enhancing business or any personal reason. As unsecured loans depend on credit score and income. There
the customer is concerned about their profit, Same way are various risk factors present in bank loans such as credit
banks also need to be careful about repayment of the loan. risk, liquidity risk and low-interest risk. In this paper Data
In this study, I have applied logistics regression and mining and machine learning used to analyses this
support vector machine learning algorithm to analyze transactional data and classify the outcome of the loan status.
and predict the loan status of customer by considering the Logistic regression and Support vector machine can identify
transactional behaviour of the customer. the difference between who return the loan in time and who
Telemarketing is one of the strategies to enhance will be in defaulter. whether provided loan to a customer will
business, it allows bankers to reach directly to the person result in “Charged off” status or “Fully paid” can be classified
and sell schemes where customers need to pay a security by logistic regression and support vector machine. [2]
deposit. This paper involves data mining and machine The price of residential houses usually depends on the area of
learning approach to make a prediction on a successful interest, available services and current market situation.
telemarketing call where a customer agrees to pay a There are many players included in the selling and buying
security deposit or not by applying a decision tree and K property processes such as brokers, land agents. The cost
Nearest Neighbor. In this paper, Multiple machine which seller get, or the cost customer pays may differ from
learning techniques are applied, evaluated and compared the actual cost of the property. Also, the property price may
on financial and real estate data mentioned above. change under different situations in the market and the
Keywords—SVM,KNN,KDD,Housing,Telemarketing common person is unaware of ongoing real estate situation
and its equivalent price, result in they lose their money. Thus,
I. INTRODUCTION the effective house price for consumers must be estimated as
Organizations use digital marketing to reach consumer per their requirement and financial plan. This estimation can
audiences by calling them for a reason. Centralization of be achieved with the help of data mining and machine
public interaction in one place makes the task easy for learning algorithms, by considering various features of real
organizational product marketing. such places communicate estate facilities available along with existing prices. This
with customers through various modes such as telephone, paper consists of multiple linear regression to estimate the
radio, newspaper, social media and other ways. One of the Melbourne price of the house by analyzing existing prices,
easiest and widely used mode is a telephone. Banker uses this location, number of rooms, car parking available, type of
strategy to reach out to customers by calling them on the house and year built. [3]
telephone. The given data set is the phone call-based
II. RELATED WORK
campaign of the Portuguese banking institute. It is a hectic
task to select the bunch of customers that will agree pay the [“Aswin Ravikumar”] Stated that Several algorithms are used
deposit. Nowadays the importance of this process has been to increase the accuracy and performance of prediction. By
increased as this way business can directly and easily reach considering researchers applied algorithms like “hedonic
out to the customers. Business people can understand regression”, “artificial neural network”, “AdaBoost”, “J48
customers need and maintain relationships with them. tree” [“Aswin Ravikumar”] implemented more advanced
Another side due to the large population it is very difficult to algorithms for price prediction like “Random forest”,
contact the customer via telephone. This study is to reduce “Multiple linear regression”, “support vector machine”,
the stressed task of selection of the correct customer who “gradient boosted tree”, “neural network” and “bagging”.
likely to purchase with classifying them based on given Among which random forest is more accurate compared to
attributes in data set such as duration of the call, income others. [3]
range, credit score, job experience etc. This study involves [“Nihar Bhagets”] Journal is intended to forecast successful
classification such as decision tree and K-Nearest Neighbor home prices in respect of plans and preferences of real estate

XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE


customer. The paper involves the operation of the webpage the KNN algorithm and concluded that at 30th level model is
which combines multiple linear regression model with the giving significant accuracy to avoid loss in the banking
help of data inserted by the customer. This application will be sector. [12]
helpful for customers who would like to invest in property In this study [“Başak Gültekin”] machine learning techniques
without taking the assistance of the agent. [4] applied to predict “good credit” or “bad credit”. Whereas
[“Danh Phan”] House price prediction is done with the help researcher concluded that for classification Neural network
of combine implementation of multiple machine learning performance was accurate than other models with low
algorithms and evaluation. The evaluation gives result as average square error and Random forest (84 % Accurate)
Step-wise and SVM is a competitive approach among all gave better result than logistic regression (81 % Accurate ).
other combinations with train MSE of 0.0558. [5] [13]
[“Sumit Chopra”] Implemented a new approach to [“Aboobyda Jafar Hamid1”] Implemented model based on
“Regression” in which consists of the dependent variable and banking data to predict loan defaulter status whereas another
features included for prediction also this approach depends applied three algorithms J48, BayesNet and NaiveBayes by
on the hidden parameters such as desirability which cannot using WEKA application. J48 algorithm is performing well
be measurable. This approach called as LME which is a for classification data with high accuracy and low mean
combination of trainable parametric model and non- square error. [2]
parametric manifold model to predict the dependent variable. [“Sérgio Moro”] Proposed and machine learning approach to
[“Sumit Chopra”] states that this model gives good predict successful telemarketing calls for selling deposit.
predictions than individual parametric and non-parametric The author compared four algorithms including decision tree,
models. [6] neural network, logistic regression and neural network.
[“Vasilios Plakandaras”] Propose novel on hybrid prediction He concluded that the neural network gives results where
methodology which is a combination of Ensemble Empirical AUC = 0.8 and ALIFT=0.7 and it gives 79% successful result
Mode Decomposition [EEMD] with Support vector of customers who subscribe to the deposit by selecting 50 %,
regression from machine learning. Researcher Examine positive classified clients. [14]
eleven dependent variables for US house price forecasting [“Elzhan”] researchers performed various classification
with the help of econometric BAR and BVAR techniques and model for prediction of bank telemarketing campaign result.
novel EEMD-AR-SVR model. The research detected The evaluation of algorithms has been done by “Receiving
accurate downturn from 2006-2009. According to [“Vasilios operator characteristics “ROC and “cumulative accuracy
Plakandaras”]at this system is best for early warning for the profile “CAP curve. According to [“Elzhan”] the best
upcoming sudden change in price. [7] classification models are random forest and deep artificial
According to [Vishal Venkat Raman], With a tremendous intelligence. [1]
amount of informal data and assets, the housing sector
becomes an intensely competitive field. By processing these III. METHODOLOGY
data developers can provide an advantage to the real estate A. KDD
industry in the form of future trend prediction and help them
There is a necessity of modern data computational theory and
to take a favourable decision. [Vishal Venkat Raman] at. El.
tools for today's generation to extract data from the
Applied linear regression for identifying a suitable area for
tremendously growing digital data. These modern
the customer in Delhi. [8]
computation theory and tools are part of Knowledge
[“Hujia Yu”] Performed real estate price prediction with
Discovery in Databases (KDD). Why do we need KDD? The
regression like “lasso”, “SVM regression”, “Ridge”,
usual method of converting data into knowledge is based on
“Random forest” and classification models such as “Naïve
manual observations and understanding of any system. For
based”,” logistic regression” and “SVM classification”.
example, any fashion trend is defined on data collected from
Further applied PCA successfully to improve the prediction
stores then this data is collected at some analysis team where
accuracy of the model. [“Hujia Yu”] concluded that
they check which type of cloths are sold in which month and
Classification performs better than regression. [9]
give their observations. In today's generation, there is a
[David Murphy] Implemented prediction model of loan
massive amount of data is generating day by day and human
defaulters using social media data such as mobile calls, SMS
needs its knowledge in less time. To avoid the wastage of
logs etc. [David Murphy] Concluded that the single random
time and money we need KDD theory and tools which gives
forest provides the best AUC and profit than other
best results in very less time. [15]
combination of a model trained (7% increase in overall
KDD is nothing but the extraction of hidden designs from
performance). [10]
massive unstructured observations or data of any system.
[“Xin Li”] States that accuracy for risk prediction algorithms
which helps to determine useful knowledge from collected
like KNN, Bayesian, DNN is not much accurate from the data
unstructured data. This is a frequently used data mining
expansion. Autor applied LSTM-SVM algorithm to improve
process which includes data selection, data cleansing and
the accuracy of prediction compared to single risk prediction
processing, Data analysis, Data mining, Interpretation.
algorithm. He concludes that there are still some limitations
Considering the facts mentioned above KDD is used to
are present due to the lack of most primitive data and needs
analyze the data sets from raw data sets. [16]
to be improved. [11]
[“G. Arutjothi”] Proposed and machine learning model which
predicts the loan status of the customer whether he or she is
valid or default customer with an accuracy of 75.08% with
KNN. The further author performed iteration on K values of
Stages of KDD
Data Selection: - Data for Telemarketing is collected from the
1. Understanding: -This is the first step of KDD where UCI library. The original data set has 41,188 rows and 21
the aim of the process is defined by the in-charge of identical columns. The data set consist of groups of attributes
the KDD process and it might be revised while such as “Bank client data”, “Data related to the last contact”,
going through the process. The further process “Social and economic attributes”, “other variables” and the
proceeds by keeping this goal in mind. “Desired target variable”.
2. Data Selection: -In this stage, the data is analyzed
deeply to get the samples for training, testing and Data cleaning and preprocessing: - The data need cleaning
before applying any model on it. The identification of missing
validation. The well managed and designed
data is checked with the help of “is.an” function as shown
sampling process differentiate the structured
below. There are no missing values present in the data set.
meaningful data and unstructured raw data.
3. Data cleaning and preprocessing: -This stage
consists of preprocessing. There are multiple
outliers, missing values, unexpected characters may
present in data set so we need preprocessing to clean
data and make ready meaningful cleaned data to the
model. Figure 1
4. Data reduction and transformation: - In this stage,
there are multiple data reduction and transformation Further, the outliers are detected and removed from the data
techniques are considered. This helps to process data set. Outliers are considered as values that conflicting with the
more accurately and precisely. As we will be dealing remaining values of the data set. Below diagrams are the
histogram and boxplot of age where we can conclude that age
with many unwanted data features which are not
has outliers.
making an impact on output, this stage removes
unwanted variables and transform it into a subset of
the main data set.
5. Exploratory analysis and Model selection: - The
exploratory analysis aims to gain more insights into
the problem statement. In this stage, one or more
models which can provide valuable insights should
be selected. You may select multiple models which
may be useful and perform exploratory analysis.
6. Data Mining-In this stage as best as model-fitting
has been done and make things extensible. Plotting Figure 2
of different patterns has been done and get checked
whether they are correct or not. The data mining
algorithm applied and verified repeatedly to reach
its performance up to the mark. For example, In
KNN the value of K will decide the percentage of
error and accuracy.
7. Interpretation, evaluation and delivery: - This step
does output parameter check, observation of data
mining patterns and verification of the accuracy
defined in the first step. Here preprocessing steps
can be done to check and improve the effect by Figure 3
adding an additional feature to the algorithm. From
this step, the obtained knowledge can be exported I have removed outliers by considering Interquartile range
and removed values present above 3rd quartile. I found that
and stored as a document. [16]
there are 469 outliers were present in the data set. In the same
B. In this study, the data has been collected from the way, the outliers from other variables are also removed. For
domain of “Real estate”, “Telemarketing” and “Loan example, the duration column has some 0 values. which lies
defaulter” and nurtured with the help of the KDD below 1st quartile and with common sense, we can guess that
process as mentioned above. is call duration is equal to 0 then definitely that person didn’t
subscribe. We have removed such instances present in the
1. Telemarketing data set.
Understanding: - The data for telemarketing has been
collected by considering the aim as, “Can we reduce the Data reduction and transformation: -In this step, the creation
stressful task of selecting the right clients in the process of of adequate data being prepared and developed for
Telemarketing with the use of machine learning?” requirements of data mining algorithms.
Reduction: -The data reduction techniques have applied to
remove the unexpected columns in the data set. This helps to
process data more accurately and quickly. As we will be
dealing with many unwanted data features which are not
making an impact on output, this stage removes unwanted Figure 5
variables and transform it into a subset of the main data set.
The remaining subset contains features such as “age”, “job”,
“marital”, “education”, “default”, “housing”, “loan”,”
duration”, “y”.
Figure 6
Transformation: -Remaining data after feature reduction
undergoes through transformation techniques which required
for data mining algorithms. Data types of features, the scale 2. Loan Defaulter
of feature, conversion of a continuous variable into categories
Understanding: - The loan status data set is collected by
has been done. For example, to perform a Decision tree we
considering the goal as, “How can we predict customer loan
need categories where at each root decision will be taken on
status using machine learning, taking into account customer’s
some condition. For this transformation features like “age”,
transaction behaviour?”
“job”, “marital”, “education”, “default”, “housing”, “loan”,”
duration”, converted into categories. The age column
Data Selection: - Data for loan status prediction is collected
categories into 4 categories like, “less than 32”,”32 to 47”,”47
from the Kaggle library. The original data set has 1,00,000
to 70”,” greater than 70”. This way all ages are now replaced
rows and 19 identical columns. The data set consist of
with categories like “1”,”2”,”3” and ”4”. The further data
attributes related to “Bank customer data”.
type of these categories needs to be changed from factor to
numeric. Similarly, the transformation has been done for
Data cleaning and preprocessing: - Before applying model
other mentioned columns as well. Below Fig. shows the
the data cleaned with the help of “is.na” function. There were
structure of the final data set after Reduction and
many NA values present in the data set as shown in the figure.
transformation.
There is a presence of outliers such as Credit score, higher
than 800 which is not acceptable by common sense. These
values are removed and replaced with a mean credit score by
a calculating mean of credit score. Same way in “Current
Loan Amount” there is a large value present “99999999” and
removed by outlier reduction technique.

Figure 4

Model selection: -In this stage different types of classification


models applied on data set. Before that, the data received
from Data Reduction and Transformation is divide into 80:20
ratio for training and testing. The preparation of the train and
test data set has performed to train Model and after that test
Figure 7
with the remaining 30% data. Later testing data output and
predicted data output compared for interpretation of Model. The output variable Loan status is in the form of character
On this data set, the “Decision tree” and “K -Nearest like “Fully paid” and “Charged Off”, needs conversion into 0
Neighbor” has been applied and evaluated. and 1 as in dichotomous form. From the below plot we can
see there are more “Fully paid” categories present than
Evaluation and Interpretation: - After applying Model on “Charged off”.
data set, the evaluation of each model has been done with the
help of the confusion matrix, Model summary and different
patterns received from a prediction on test data set. Whereas
the performance of Model has been checked by parameters
like Accuracy, Sensitivity, Specificity and Kappa value. For
KNN the Model evaluation is based on Error rate after
applying different K values to the model. For given K value
we will get different error rate and its respective accuracy.
Below are the Figures of confusion matrix obtained from
Decision tree and KNN.

Figure 8
Data reduction and transformation: -In this step, the creation Data cleaning and preprocessing: - The data need cleaning
of adequate data being prepared and developed for before applying any model on it. The identification of missing
requirements of data mining algorithms. The reduction is data is checked with the help of “is.an” function and removed.
achieved by removing unrelated columns, which are not Below plot shows the missing values present in the data set.
having much impact on the output variable. As previously
done in Telemarketing data set in this data set also the data
transformation is performed by categorizing continuous
variables such as, “Credit Score”, “Years in the current job”,
“Bankruptcies”, “Annual income”, “Current Loan
Amount”,” Number of credit problem” and” Years of Credit
History”.

Model selection: - This stage is the selection of an appropriate


model which will be applying on data set. Before applying
the data model, we have a split the data into 80:20 ratio. The
preparation of train and test data set performed to train the
model and remaining data is applied for prediction. Further Figure 11
performance verification has been done by comparing with Histogram of the price which is predicting variable is plotted
predicted data. On this data set Logistic regression and and the result is skewed. Therefore I have applied Log on the
Support vector machine has been applied and evaluated. price to make the price equally distributed. Below are the plots
before and after scaling.

Figure 9

Figure 12

Figure 10

Evaluation and Interpretation: - After applying Model on


data set, the evaluation of each model has been done with the
help of the confusion matrix, Model summary and different Figure 13
patterns received from a prediction on test data set. Whereas
the performance of Model has been checked by parameters There are some #N/A were present in dataset those were
like Accuracy, Sensitivity, Specificity, ROC curve and AUC. removed by first replacing with NA and then preprocessing
Below are the confusion matrixes for Logistic regression and has been applied.
Support vector machine. Data reduction and transformation:- The data reduction has
been done on the basis of correlation, as below we can see the
3. Real Estate correlation graph of the dependent variable and independent
variable. Whereas there are many less correlated variables
Understanding: - The data for Melbourne house price has present. So we have to reduce these columns from dataset.
been collected by considering the aim as, “How can we
forecast house prices using multiple linear regression taking
into account house characteristics?”

Data Selection: - Data for Melbourne house price is collected


from the Kaggle dataset library. The original data set has
34,857 rows and 21 identical columns. The data set consist of
attributes as features of the house like Suburb, Address,
Rooms, Type, Price, Method etc.

Figure 14
Updated correlation after reduction of features from data set is will not paying can be correctly predicted by 92%.whereas
as given below. Whereas there is more correlation between the Sensitivity and specificity are 0.99 and 0.02. The sensitivity
dependent variable and independent variable. of the model is the true positive rate of the model which is
correct positive cases identified by the classifier, in this case,
the positive class is “NO”. There are 6942 cases are true and
Model predicted also true, but 31 cases were actually positive
but Model predicted negative. Whereas there are 571 cases are
negative and model predicted positive and 12 cases are
negative and model predicted negative. The specificity of the
model is given by total negative predictions divided by a total
number of negative cases which is very less 0.02. From the
above evaluation, we can conclude that this model predicts
positive cases better than negative cases. also, the Kappa
value given by the model is 0.028.
Figure 15

Model selection:- This stage is the selection of an appropriate


model which will be applying on data set. Before applying the
data model we have a split the data into 70:30 ratio. The
preparation of train and test data set performed to train the
model and remaining data is applied for prediction. Further
performance verification has been done by comparing with
predicted data. On this data set, multiple linear regression has
been applied and evaluated. Before model execution, I have
done scaling to make all features lies on the same scale.
Evaluation and Interpretation- After applying Model on data Figure 17
set, the evaluation of model has been done with the help of the
Model summary, different patterns received from a prediction Summary:-After performing model on train data set we can
on the test data set and parameters like “RMSE”,” MAE”,” check a summary of the model. Where we can see the primary
MSE” and “R2”. Below is the summary obtained from the splits available at every node. The summary for each node is
model. explained by model summary as shown below.

Figure 18

Model tree:- The model tree diagram of Decisions as shown


below, where we can check the condition on each leaf. For
example, if Duration < 2.5, a model predicting particular
Figure 16 instance as “NO”, furthermore decisions takes place to predict
test data instances.
IV. EVALUATION OF MODELS
1. CAN WE REDUCE THE STRESSFUL TASK OF SELECTING
THE RIGHT CLIENTS IN THE PROCESS OF TELEMARKETING
WITH THE USE OF MACHINE LEARNING?
Decision Tree:- Decision tree enumerate the characteristics of
the data set such as “age”, “job”, “marital”, “education”,
“default”, “housing”, “loan” and ” duration”. The model
represents these features as a form of the tree where are each
node in the tree is decision towards prediction. Evaluation of
Decision tree has been done with the help of parameters such
as confusion matrix, a summary of the model and Model tree
diagram. Figure 19
Confusion Matrix:-Confusion matrix with the help of the
“caret” library shows the predicational accuracy of the model, K Nearest Neighbor:-KNN is a suitable model for
In this study decision tree accuracy, is 92.03%. which shows classification which belong to the family of supervised
that the model predicts customers who will be paying and who learning. In knn, the output variable plotted on the
multidimensional plane, whereas planes are a number of
features present in the model. After that when we predict test
data with the help of model Euclidian distance is calculated to
find neighborhood instances. Depending on the values of K
the model gives prediction whether the customer will
subscribe or not.
Confusion Matrix: - Confusion matrix for the KNN algorithm
is as shown below in which True positive cases predicted are
7286 whereas false-negative predict cases are 4. There are 53
negative cases which predicted positively by model and 213
positive cases which predicted negatively by model.
Sensitivity and specificity of KNN are 0.9716 and 0.07.

Figure 22

Confusion Matrix:- As the data set is less balanced, more a


number of 1s present in data set and model less trained for
another category, due to which we received a lower False-
negative rate. Confusion matrix for This study has values
False-negative =2, True positive =12701, False positive = 1
and True Negative =2524. By considering this the accuracy of
Logistic regression is calculated as (TP+FN)/N=0.8341 ~
83.41 %. Logistic regression for loan status has 0.99
sensitivity and 0.007 specificity.
Figure 20

In this case, the accuracy is much better than the Decision tree
96.48%. The accuracy of KNN increases exponentially as the
value of K increases. Along with error rate decreases as the K Figure 23
value increases.
ROC:-Below curve shows the Receiver operating
characteristic curve for the logistic regression curve plotted
with the help of Sensitivity and Specificity. This shows the
area under the curve which defines predicting a range of the
model

Figure 21

2. HOW CAN WE PREDICT CUSTOMER LOAN STATUS USING


MACHINE LEARNING, TAKING INTO ACCOUNT CUSTOMER’S
TRANSACTION BEHAVIOUR ?
Logistic Regression:- Linear regression is applied in this study Figure 24
to predict the output of whether Customer loan status “Fully
Pay ”Or “Charged OFF”. The features are included in the AUC:- Area under the curve for this model is 60.23%.This
model such as “Current Loan amount ”,” Term”,” Credit score show capability of model classification. Higher the value
”,” Annual income ”,” Homeownership ”, ”Number of higher percentage of prediction power of 0 as 0 and 1 as 1.
credit”,” Problems bankruptcies” and “Current job year”. The
Evaluation of logistic regression has been done with the help
of Summary of Model, Confusion matrix and ROC curve.
Summary:- The summary received from the logistic
regression model is as shown below, where we will get the Figure 25
significant features of the model. Variables having 0 value are
more significant whereas the p values near to 1 are less Support Vector machine: - SVM is one of the supervised
significant. We can remove less significant values and run the machine learning algorithm, commonly used for classification
model again for best results. as well as regression. In this model, the classification of the
dependent variable is achieved by plotting hyper plan. In this
paper for support vector machine, we have dependent
variables such as “Current Loan amount ”,” Term”,” Credit
score ”,” Annual income ”,” Homeownership ”, ”Number of
credit”,” Problems bankruptcies” and “Current job year”. The
performance of SVM is measured by Confusion matrix, ROC i.e. Price. In this case, we can see there are almost all features
and AUC pass the significant level, a p-value less than 0
Confusion matrix:-Confusion matrix for SVM is as shown in
below where we can find the accuracy is the same as logistic
regression, which is 83.42%.In this case, the positive class is
0 which is “Charged off ”.The specificity of Support vector
machine is equal to 1 which means that SVM can predict
negative cases,” Fully paid ” predict more correctly than
“Charged OFF”. There is a very low value for sensitivity.

Figure 29

R-Square of Multiple linear regression is 0.61 which also


tell us the accuracy of the model. Mean absolute error is 0.49
and the Mean square error is 0.41. Which means that the final
predicted value is +-0.49(MAE) away from an actual value.
So we can conclude that Multiple linear regression gives 60 %
of accuracy on this data set.
Figure 26

ROC:- Roc curve for support vector machine is shown below,


where we cannot found curve means the power of prediction
is straight forward 50%.

Figure 30

There are some patterns drawn from the model as follows.


Residual Vs Fitted:- This plot shows residual values against
the fitted values of the model. This graph represents the
residual trend in model, Outliers and shows
heteroskedasticity. From this graph we can conclude that
model predicting correctly at lower values but overprediction
present at a higher level of values.
Figure 27

AUC:- Area under the curve for SVM is 50%From this we can
conclude that the Logistic regression has more AUC than
SVM.

Figure 28 Figure 31
Q-Q plot:-This shows the Normal distribution of residuals, as
3. HOW CAN WE FORECAST HOUSE PRICES USING all residuals are following the line on the graph which shows
MULTIPLE LINEAR REGRESSION TAKING INTO ACCOUNT HOUSE the residuals are from a normal distribution. Some of them at
CHARACTERISTICS? the lower side, are away from the standard line are not from
Multiple Linear Regression:-Multiple linear regression tries to distribution.
form a relationship between one or more informative variable
and dependent variable by applying the linear equation to
observed data. In this study, there are multiple features are
available for prediction of house price such as “Room”,”
Bathroom”,” Car parking”,” Type”, “Distance”, “year built”,
“building area”, “Latitude” and “Longitude”. These all
variables are more correlated to Price. The evaluation of
Multiple linear regression was done by “Root Mean Square
Error”,” Mean Square Error” and “R2”.
Summary:- The summary of Multiple Linear regression shows
the significance of features available with dependent variable Figure 32
Scale-location plot:- This plot is same as residual plot whereas VI. REFERENCES
the values are the square root of standardized residuals. From
this plot, we can find a residual trend.
[1] E. Zeinulla, k. Bekbayeva and A. Yazici,
"Comparative Study of the classification models for
prediction of bank telemarketing".
[2] A. J. Hamid and T. M. Ahmed, DEVELOPING
PREDICTION MODEL OF LOAN RISK IN BANKS
USING DATA MINING, 2016.
[3] A. S. Ravikumar, Real Estate Price Prediction Using
Machine, 2017/2018.
[4] N. Bhagat, A. Mohokar and S. Mane, House Price
Forecasting using Data Mining, 2016.
Figure 33
[5] D. Phan, Housing Price Prediction using Machine
Residual vs Leverage plot:-This plot is standardized residual Learning, 2018.
versus its leverage, this plot also consists of cooks distance [6] T. T. L. C. L. Sumit Chopra, Discovering the Hidden
limit. Any value present outside of this limit is an outlier as Structure of House Prices with a Non-Parametric
we can observe there is one outlier. But there will not be much Latent Manifold Model.
effect on model prediction. If there are more outliers then we
[7] V. Plakandaras, R. Gupta, P. Gogas and T.
have to apply suitable outlier reduction technique.
Papadimitriou, Forecasting the U.S. real house price
index, 2014.
[8] V. V. Raman, S. Vijay and S. Banu K, Identifying
Customer Interest in Real Estate Using, 2014.
[9] H. Yu and J. Wu, Real Estate Price Prediction with
Regression and Classification, 2016.
[10] D. Murphy, Prediction of Loan Defaulters in.
[11] X. Li, X. Long, G. Sun, G. Yang and H. Li, Overdue
Prediction of Bank Loans Based on.
[12] . A. G. and C. Senthamarai, Prediction of Loan Status
Figure 34
in Commercial Bank, 2017.
V.CONCLUSION AND FUTURE WORK [13] B. Gültekin and B. E. Şakar, Variable Importance
Analysis in Default Prediction using Machine.
In summary, in this paper different Machine learning
models have been applied in different sector like [14] S. Moro, . P. Cortez and P. Rita, A data-driven
“Telemarketing”, “Housing” and “Loan defaulters”. The approach to predict the success of bank telemarketing.
original data sets collected, preprocessed and transformed into [15] G. P.-S. a. P. S. Usama Fayyad, From Data Mining to
clean datasets. On cleaned data, different models are then Knowledge Discovery in Databases.
applied, analyzed and evaluated to achieve the aim of the
research. [16] N. Dasgupta, V. B. Lanzetta and R. A. Farias, Hands-
On Data Science with R, Packt Publishing, 2018.
Evaluation of Models implies Multiple linear regression is [17] M. M. a. Y. Kodratoff, FROM MACHINELEAR" G
predicting house prices with moderate accuracy of 60%. In
TOWARDS bOWLEDGE DISCOVERY IN
future, another regression model can be applied to increase
DATABASES.
accuracy such as XGboost. The KNN is better in selecting
favorable customers who will subscribe and pay security [18] B. V. Srinivasan, Domain-specific adaptation of a
deposit than a Decision tree. The KNN has an accuracy of partial least squares regression model for loan
96% whereas the decision tree has an accuracy of 92%. But defaults prediction.
the rate of specificity is less which can be increased in future. [19] D. w. H. T. Gareth james, An Introduction to
In future, the accuracy of the decision tree classifier model can statistical Learning with applications in R, Springer.
be improved with pruning. The loan status of the customer is
predicted by Logistic regression and SVM equally at [20] Z. W. ,. Z.-P. M. JIN-LONG AN, AN
83.41%.but the AUC for logistic regression is 10%better than INCREMENTAL LEARNING ALGORITHM FOR
SVM. By changing, kernel type of SVM accuracy can be SUPPORT VECTOR MACHINE, 2003.
increased and achieve better AUC.

You might also like