Professional Documents
Culture Documents
Abstract—The residential property becomes an extremely for identification of desired customers of bank telemarketing
competitive field with the availability of an immense will pay the deposit or refuse. [1]
amount of data and assets. People need more accurate
information to avoid fraud and wanted to be satisfied Different types of loan can be taken from the bank with very
after paying a huge amount. This paper consists of a less guarantee of return. Simultaneously the number of
multiple linear regression model and its evaluation for transactions is expanding, and the amount of information is
prediction of housing price with the help of features massively increasing. This represents the customer financial
related to real estate. condition as well as risk around the loan is increasing. The
Many people take a loan from the bank for buying secured loans rely on assets those kept as security, but
property, enhancing business or any personal reason. As unsecured loans depend on credit score and income. There
the customer is concerned about their profit, Same way are various risk factors present in bank loans such as credit
banks also need to be careful about repayment of the loan. risk, liquidity risk and low-interest risk. In this paper Data
In this study, I have applied logistics regression and mining and machine learning used to analyses this
support vector machine learning algorithm to analyze transactional data and classify the outcome of the loan status.
and predict the loan status of customer by considering the Logistic regression and Support vector machine can identify
transactional behaviour of the customer. the difference between who return the loan in time and who
Telemarketing is one of the strategies to enhance will be in defaulter. whether provided loan to a customer will
business, it allows bankers to reach directly to the person result in “Charged off” status or “Fully paid” can be classified
and sell schemes where customers need to pay a security by logistic regression and support vector machine. [2]
deposit. This paper involves data mining and machine The price of residential houses usually depends on the area of
learning approach to make a prediction on a successful interest, available services and current market situation.
telemarketing call where a customer agrees to pay a There are many players included in the selling and buying
security deposit or not by applying a decision tree and K property processes such as brokers, land agents. The cost
Nearest Neighbor. In this paper, Multiple machine which seller get, or the cost customer pays may differ from
learning techniques are applied, evaluated and compared the actual cost of the property. Also, the property price may
on financial and real estate data mentioned above. change under different situations in the market and the
Keywords—SVM,KNN,KDD,Housing,Telemarketing common person is unaware of ongoing real estate situation
and its equivalent price, result in they lose their money. Thus,
I. INTRODUCTION the effective house price for consumers must be estimated as
Organizations use digital marketing to reach consumer per their requirement and financial plan. This estimation can
audiences by calling them for a reason. Centralization of be achieved with the help of data mining and machine
public interaction in one place makes the task easy for learning algorithms, by considering various features of real
organizational product marketing. such places communicate estate facilities available along with existing prices. This
with customers through various modes such as telephone, paper consists of multiple linear regression to estimate the
radio, newspaper, social media and other ways. One of the Melbourne price of the house by analyzing existing prices,
easiest and widely used mode is a telephone. Banker uses this location, number of rooms, car parking available, type of
strategy to reach out to customers by calling them on the house and year built. [3]
telephone. The given data set is the phone call-based
II. RELATED WORK
campaign of the Portuguese banking institute. It is a hectic
task to select the bunch of customers that will agree pay the [“Aswin Ravikumar”] Stated that Several algorithms are used
deposit. Nowadays the importance of this process has been to increase the accuracy and performance of prediction. By
increased as this way business can directly and easily reach considering researchers applied algorithms like “hedonic
out to the customers. Business people can understand regression”, “artificial neural network”, “AdaBoost”, “J48
customers need and maintain relationships with them. tree” [“Aswin Ravikumar”] implemented more advanced
Another side due to the large population it is very difficult to algorithms for price prediction like “Random forest”,
contact the customer via telephone. This study is to reduce “Multiple linear regression”, “support vector machine”,
the stressed task of selection of the correct customer who “gradient boosted tree”, “neural network” and “bagging”.
likely to purchase with classifying them based on given Among which random forest is more accurate compared to
attributes in data set such as duration of the call, income others. [3]
range, credit score, job experience etc. This study involves [“Nihar Bhagets”] Journal is intended to forecast successful
classification such as decision tree and K-Nearest Neighbor home prices in respect of plans and preferences of real estate
Figure 4
Figure 8
Data reduction and transformation: -In this step, the creation Data cleaning and preprocessing: - The data need cleaning
of adequate data being prepared and developed for before applying any model on it. The identification of missing
requirements of data mining algorithms. The reduction is data is checked with the help of “is.an” function and removed.
achieved by removing unrelated columns, which are not Below plot shows the missing values present in the data set.
having much impact on the output variable. As previously
done in Telemarketing data set in this data set also the data
transformation is performed by categorizing continuous
variables such as, “Credit Score”, “Years in the current job”,
“Bankruptcies”, “Annual income”, “Current Loan
Amount”,” Number of credit problem” and” Years of Credit
History”.
Figure 9
Figure 12
Figure 10
Figure 14
Updated correlation after reduction of features from data set is will not paying can be correctly predicted by 92%.whereas
as given below. Whereas there is more correlation between the Sensitivity and specificity are 0.99 and 0.02. The sensitivity
dependent variable and independent variable. of the model is the true positive rate of the model which is
correct positive cases identified by the classifier, in this case,
the positive class is “NO”. There are 6942 cases are true and
Model predicted also true, but 31 cases were actually positive
but Model predicted negative. Whereas there are 571 cases are
negative and model predicted positive and 12 cases are
negative and model predicted negative. The specificity of the
model is given by total negative predictions divided by a total
number of negative cases which is very less 0.02. From the
above evaluation, we can conclude that this model predicts
positive cases better than negative cases. also, the Kappa
value given by the model is 0.028.
Figure 15
Figure 18
Figure 22
In this case, the accuracy is much better than the Decision tree
96.48%. The accuracy of KNN increases exponentially as the
value of K increases. Along with error rate decreases as the K Figure 23
value increases.
ROC:-Below curve shows the Receiver operating
characteristic curve for the logistic regression curve plotted
with the help of Sensitivity and Specificity. This shows the
area under the curve which defines predicting a range of the
model
Figure 21
Figure 29
Figure 30
AUC:- Area under the curve for SVM is 50%From this we can
conclude that the Logistic regression has more AUC than
SVM.
Figure 28 Figure 31
Q-Q plot:-This shows the Normal distribution of residuals, as
3. HOW CAN WE FORECAST HOUSE PRICES USING all residuals are following the line on the graph which shows
MULTIPLE LINEAR REGRESSION TAKING INTO ACCOUNT HOUSE the residuals are from a normal distribution. Some of them at
CHARACTERISTICS? the lower side, are away from the standard line are not from
Multiple Linear Regression:-Multiple linear regression tries to distribution.
form a relationship between one or more informative variable
and dependent variable by applying the linear equation to
observed data. In this study, there are multiple features are
available for prediction of house price such as “Room”,”
Bathroom”,” Car parking”,” Type”, “Distance”, “year built”,
“building area”, “Latitude” and “Longitude”. These all
variables are more correlated to Price. The evaluation of
Multiple linear regression was done by “Root Mean Square
Error”,” Mean Square Error” and “R2”.
Summary:- The summary of Multiple Linear regression shows
the significance of features available with dependent variable Figure 32
Scale-location plot:- This plot is same as residual plot whereas VI. REFERENCES
the values are the square root of standardized residuals. From
this plot, we can find a residual trend.
[1] E. Zeinulla, k. Bekbayeva and A. Yazici,
"Comparative Study of the classification models for
prediction of bank telemarketing".
[2] A. J. Hamid and T. M. Ahmed, DEVELOPING
PREDICTION MODEL OF LOAN RISK IN BANKS
USING DATA MINING, 2016.
[3] A. S. Ravikumar, Real Estate Price Prediction Using
Machine, 2017/2018.
[4] N. Bhagat, A. Mohokar and S. Mane, House Price
Forecasting using Data Mining, 2016.
Figure 33
[5] D. Phan, Housing Price Prediction using Machine
Residual vs Leverage plot:-This plot is standardized residual Learning, 2018.
versus its leverage, this plot also consists of cooks distance [6] T. T. L. C. L. Sumit Chopra, Discovering the Hidden
limit. Any value present outside of this limit is an outlier as Structure of House Prices with a Non-Parametric
we can observe there is one outlier. But there will not be much Latent Manifold Model.
effect on model prediction. If there are more outliers then we
[7] V. Plakandaras, R. Gupta, P. Gogas and T.
have to apply suitable outlier reduction technique.
Papadimitriou, Forecasting the U.S. real house price
index, 2014.
[8] V. V. Raman, S. Vijay and S. Banu K, Identifying
Customer Interest in Real Estate Using, 2014.
[9] H. Yu and J. Wu, Real Estate Price Prediction with
Regression and Classification, 2016.
[10] D. Murphy, Prediction of Loan Defaulters in.
[11] X. Li, X. Long, G. Sun, G. Yang and H. Li, Overdue
Prediction of Bank Loans Based on.
[12] . A. G. and C. Senthamarai, Prediction of Loan Status
Figure 34
in Commercial Bank, 2017.
V.CONCLUSION AND FUTURE WORK [13] B. Gültekin and B. E. Şakar, Variable Importance
Analysis in Default Prediction using Machine.
In summary, in this paper different Machine learning
models have been applied in different sector like [14] S. Moro, . P. Cortez and P. Rita, A data-driven
“Telemarketing”, “Housing” and “Loan defaulters”. The approach to predict the success of bank telemarketing.
original data sets collected, preprocessed and transformed into [15] G. P.-S. a. P. S. Usama Fayyad, From Data Mining to
clean datasets. On cleaned data, different models are then Knowledge Discovery in Databases.
applied, analyzed and evaluated to achieve the aim of the
research. [16] N. Dasgupta, V. B. Lanzetta and R. A. Farias, Hands-
On Data Science with R, Packt Publishing, 2018.
Evaluation of Models implies Multiple linear regression is [17] M. M. a. Y. Kodratoff, FROM MACHINELEAR" G
predicting house prices with moderate accuracy of 60%. In
TOWARDS bOWLEDGE DISCOVERY IN
future, another regression model can be applied to increase
DATABASES.
accuracy such as XGboost. The KNN is better in selecting
favorable customers who will subscribe and pay security [18] B. V. Srinivasan, Domain-specific adaptation of a
deposit than a Decision tree. The KNN has an accuracy of partial least squares regression model for loan
96% whereas the decision tree has an accuracy of 92%. But defaults prediction.
the rate of specificity is less which can be increased in future. [19] D. w. H. T. Gareth james, An Introduction to
In future, the accuracy of the decision tree classifier model can statistical Learning with applications in R, Springer.
be improved with pruning. The loan status of the customer is
predicted by Logistic regression and SVM equally at [20] Z. W. ,. Z.-P. M. JIN-LONG AN, AN
83.41%.but the AUC for logistic regression is 10%better than INCREMENTAL LEARNING ALGORITHM FOR
SVM. By changing, kernel type of SVM accuracy can be SUPPORT VECTOR MACHINE, 2003.
increased and achieve better AUC.