Cmjs Full Date 2022 11 23 56266079

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/366390151
Housing Price Prediction by Divided Regression Analysis
Article in Chiang Mai Journal of Science · December 2022

DOI: 10.12982/CMJS.2022.102
CITATIONS READS
0 116
4 authors:
Yann Ling Goh Yeh Huann Goh

Universiti Tunku Abdul Rahman Tunku Abdul Rahman University College
23 PUBLICATIONS 31 CITATIONS 26 PUBLICATIONS 96 CITATIONS
SEE PROFILE SEE PROFILE
Yip Chun Chieh Kooi Huat Ng

Universiti Tunku Abdul Rahman Universiti Tunku Abdul Rahman
34 PUBLICATIONS 111 CITATIONS 16 PUBLICATIONS 26 CITATIONS
SEE PROFILE SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Computational Intelligence in Thermal-Energy-Storage Air-Conditioning Facility Management View project
Linked Building data for optimization of energy efficiency processes View project
All content following this page was uploaded by Yann Ling Goh on 18 December 2022.
The user has requested enhancement of the downloaded file.

Chiang Mai J. Sci. 2022; 49(6): 1669-1682
https://doi.org/10.12982/CMJS.2022.102
Journal homepage : http://epg.science.cmu.ac.th/ejournal/
Research Article
Housing Price Prediction by Divided Regression Analysis
Yann Ling Goh*[a], Yeh Huann Goh [b], Chun-Chieh Yip [a] and Kooi Huat Ng [a]
[a] Lee Kong Chian Faculty of Engineering and Science, Universiti Tunku Abdul Rahman, Jalan Sungai Long,
Cheras, 43000 Kajang, Selangor, Malaysia
[b] Faculty of Engineering, Kolej Universiti Tunku Abdul Rahman, Jalan Genting Kelang, Setapak, 53300 Kuala
Lumpur, Malaysia
*Author for correspondence; e-mail: gohyl@utar.edu.my
Received: 17 May 2022
Revised: 28 August 2022
Accepted: 23 September 2022
ABSTRACT
Regression analysis is a statistical methodology to investigate the relationship between the
dependent variable and the independent variables. In current era with the trend of big data, we
might face some problems when performing statistical analysis for the massive volume of data. For
example, the heavy burden of the computing load will cause the computation to be time consuming,
the accuracy of the results might be affected in view of the vast volume of data. Hence, divided
regression analysis is proposed to reduce the burden of the computing load. This approach performs
subdivision of the dataset into several unique subsets, then the multiple linear regression is fitted into
each subset. The results obtained from each subset are then combined to obtain a divided regression
model which is treated as the original overall dataset. The dataset used in this paper is KC Housesales
Data, obtained from the Kaggle website. The dataset contains statistics information about the housing
price, for example, size of lot, size of living area and selling price of the house. The goal of this paper
is to predict the selling price of a house from the given attributes. The dataset is partitioned into five
subsets. Consequently, multiple linear regression is fitted for each subset. Then, some model adequacy
checking will be applied on the models. The test in determining the existence of multicollinearity in
the models is rather important as well because the collinearity among the independent variables will
affect the overall results. Hence, the variance inflation factor (VIF) approach is used to determine the
existence of multicollinearity. Finally, the divided regression model is obtained by combining results
from all the subsets and the validity of divided regression model is verified.
Keywords: divided regression, multicollinearity, big data
1. INTRODUCTION
Statistics is a tool for modelling data collections, analysis is a statistical methodology for modelling
interpreting, analysing and presentation of data. and investigating the relationship between a
Statistics has the ability to estimate a population response variable and one or more independent
through the analysis of a sample extracted from variables. Also, it measures the impact on the
the population. Meanwhile, it is able to analyse response variable according to each independent
numerical data with huge data volume. Regression variable. Multiple linear regression will be more
1670 Chiang Mai J. Sci. 2022; 49(6)
preferable in real life as it is rare that a response a lot of researchers are able to receive the results
variable depends on only one variable. of regression immediately by the statistical tools
Big data is defined as a massive dataset which from computer so that the use of regression
is relatively complex to be analysed. With the becomes prevalent [7-12].
extremely large datasets, we assume that the big Divide-and-Conquer which compatible with
data set is closed to the population and it may Divided Latent class regressionanalysis method
analysis is todivides
overcome a huge thedata burden of big data
be impossible to analyse the whole big data set data set into independent with
into 𝑛𝑛 sub-datasets subsets more of appropriate
equal size tosize which can
due to the massive amount of data. In addition, overcomes produce athe problemnumber
minimum of applying of subsetsstatistical analysis to the mas
[13-15].
computing such extremely large and complicated of computing limitations
In addition, Divide and Conquer to control huge data using statistical met
Dividedkernel regression ridgeanalysis is to o
Multicollinearity is a problem to be emphasized when the mo
data is very time consuming and it burdens the regression is to divide huge data data set into 𝑛𝑛 subsets sub-datasets with mo
variables. Multicollinearity is defined as the statistical phenome
computing load. Divided regression analysis is to immensely of equal size. Then, the overcomes the ridge
independent problem of applyin
correlated to each other in kernel a multiple regression mod
overcome the burden ofDivided big data analysis.analysis
The isvariables
regression of computing limitations to contro
regression to overcome areestimator
predictable
the burden of eachfrom
of bigsubset
the other
data is computed.
analysis. correlated
The approach variables is to
interpretation. This value
condition Multicollinearity is a problem
approach is to divide big data
data into 𝑛𝑛 sub-datasets
sub-datasets with Lastly,
more the average
appropriate sizeofwhichthemay leadbetoistreated
solutions
can the false
denoted as interpretation
samples. Thi
Variance inflation factor (VIF) variables. Multicollinearity is def
with more appropriate size overcomes
which can the be
problem
treatedof applying
as a globalstatistical
predictoranalysis [16].to the is used todata.
massive determine
This is the becauseexistenc the
the predictor or using
independent immensely
variables. The correlated to
presence of multico each othe
of computing
as samples. This approach overcomes the problemlimitations to control huge data
Machine learning methods statistical methodssuch are [1-5].
as linear
variance of emphasized
the estimatedwhen variables
coefficient andcontains
hence predictable
affecting from
the valu t
Multicollinearity is a problem to be the model
interpretation. This more than
condition one
may i
of applying statistical analysis to the massive implies regression, that ridge regression, lasso regression,
variables. Multicollinearity is defined asthere is a multicollinearity
the statistical phenomenon
Variance
problem
where
inflation
[6].
more than
factor (VIF) is one va
data. This is because there are a lot of
immensely computing
correlated to each decision
other in atree,
Multiple etc. are
regression
multiple used
analysis
regression for bighighly
is
model. data analysis.
useful intoexperimental
According the circums
the predictor or independent varia
limitations to control huge data using
variables statistical from
are predictable toThe
control
the advantages
the predictor
other of linear
correlated regression
variables.
variables method
Unfortunately,
and hence arethe economists
may cause the
variance of the estimated coefficie
methods [1-5]. interpretation. This conditionneed may tolead
regression waittoabout
the false
models 24 which
hours inare
interpretationorderbuilt to
of obtain
the
upon the results
estimated
basic
implies that there is a multicollinea
of one rec
regression
Variancetoinflation
Multicollinearity is a problem be emphasized are
factor (VIF) able
is used
statistical to principles,
receive
to determine thesuchresults of regression
theasexistence
correlation of and immediately
possible by the st
least-multicollineari
Multiple
[7-12]. regression analysis i
the predictor or independentuse of regression
variables. The presencebecomesofprevalent multicollinearity will cause the infla
when the model contains more than one square error. Regression models to can controlinclude the all the
predictor variables.
variance of the estimated coefficient and hence affecting
Divide-and-Conquer which thecompatible
value of VIF. withALatentVIF ofclass value anal10
independent variables. Multicollinearity is defined variablesproblem that one[6]. need
wants to include in the model. to wait about 24 hours in ord
implies that there is a multicollinearity
𝑛𝑛 independent subsets of equal are sizeable to produce a minimum numbe
as the statistical phenomenon where regression
more thananalysis TheConquer to receive the results of re
Multiple and isdisadvantages
highly kerneluseful are in
ridgethatregression
the regression
experimentaluse ofsituationsis to models
regression dividewhere huge
becomes thedataexperim
prevale set
one variables are immensely correlated
to control to eachvariables.
the predictor workUnfortunately,
independent with datasets
kernel ridge containing
the economists
regression numeric whovalues
estimator are of atandold days
each subset (asisearco
Divide-and-Conquer which co
need tomodel.
other in a multiple regression wait about 24 hours in
According order
solutions
not toisobtain
with denoted
categoricalthe asresults of one
avariables.
global Itregression.
predictor [16]. Nowadays,
is sensitive to a lot of
by 𝑛𝑛 independent subsets of computer
equal siz
to the circumstances, theare able toare
variables receive the results of
predictable both regression
Machine immediately
outlierslearning
and methods
multicollinearity. the
such
and
statistical
Ifasthe
Conquer
linear tools
number from
regression,
kernel ridge
ridge re
regress
use of regression becomes prevalent etc. are [7-12].
used for big data analysis. The advantages of linear re
from the other correlated variables and hence of observations are less, it leads independent to overfittingkernel ridge regressio
Divide-and-Conquer which whichcompatible
are builtwith upon Latent
basicclass analysisprinciples,
statistical method divides such as a huge
correla da
may cause the misleading interpretation. This models and is considered noise. Divided solutions regressionis denoted is an to as a global pre
𝑛𝑛 independent subsets of equal size tocan includea all
produce the variables
minimum number that ofonesubsetswants include
[13-15]. in th
In addit
condition may lead to the false interpretation of regression
approximation
regression method of regression Machine
analysis learning
where methods su
and Conquer kernel ridge ismodels
to divide work huge with datadatasets
set into containing
𝑛𝑛 subsets numeric
of equal values
size
the estimated regression coefficients. Variance it assumes to all variables etc. are used for big data analysi
independent kernel sensitive
ridge regression estimator both each to
ofoutliers be
andmultivariate
subset ismulticollinearity.
computed. normal.
Lastly,If the theaverage
numbev
overfitting and is considered which are built upon basic is anstatist
inflation factor (VIF) issolutions
used to isdetermine
denoted asthe a global predictorregression
Divided [16]. can benoise. applied
models
Divided
to regression
financial
can includenormal. all the variable
appr
Machine learning
existence of possible multicollinearity between the where
methods such it assumes
as linear all variables
regression,
forecasting, sales and promotions to
ridge be multivariate
regression,
forecasting, lasso regression, Divide de
forecasting, sales and promotions regression
forecasting, models work withtestin
automobiles data
etc. are used for big
predictor or independent variables. The presence series data analysis.
automobiles The advantages
testing, weatherof linear
analysis regression
sensitive and prediction,
to method
both are
outliers and regress m
which are built upon basic statistical forecasting,
principles, etc.such
Jittawiriyanukoon
as correlation C.
and and Srisarkun
least-square (2018)
error.
of multicollinearity will cause the inflation of the analysis time series by one forecasting,
using divided etc. overfitting
Jittawiriyanukoon
regression and is considered
C. regression me noise
models can include all the variables that wants to include in theit[17].
where model.
assumes
Ridge
Thealldisadvantages
variablesistolo
variance of the estimated coefficient and hence and Srisarkun
overfitting but
regression models work with datasets containing numeric values (2018)
increases have
bias conducted
and the studies
model
and notsales on
interpretability
withand categorical var
shrinking coefficient forecasting, promotions
affecting the value of VIF. A VIF
sensitive to of
bothvalue 10 and
outliers themulticollinearity.
big data analysistowards
Ifbytheusing zero
divided
number
series
andofavoids
regression overfitting.
observations
forecasting, are
etc. Jittawiriyan
Howe
less,
and above implies that thereoverfitting and is considered biased
is a multicollinearity noise. and can
Divided
[17]. Ridge be very method
regression
regression deviate for
is an analysis different
approximation
trades variance
by
bootstrapped
using method
fordivided data. T
of regressi
regress
where it assumes all variablesbias, ridge regression.
to beprevents
multivariate Decision
normal.buttree does
Divided not require
regression standardization
can be bias applied
problem [6]. work forecasting,
but doesn’t work overfitting when variables increases
overfitting bias
but and
increases
are uncorrelated, has high and
forecasting, sales and promotions automobiles testing,
shrinking weather
coefficient analysis and
towards pred
zero
Multiple regression analysis is highly useful in the model
become complex.
series forecasting, etc. Jittawiriyanukoon
interpretability
C. andBySrisarkun
is
considering low.
(2018)
Lasso
the have regression
pros and cons ofstudies
conducted these mac onfort
experimental situations where the experimenter ables regression
selects isfeatures
method foundRidge bybe
to shrinking biased
coefficient
more suitable and
in this can
towardsbe very
workvariance deviate
due to numeric
analysis by using divided [17]. regression ridge method regression.trades Decision tree bia
for do
to control the predictor overfitting but increases biasand
variables. Unfortunately, andnothe
zero andmulticollinearity
avoidsinterpretability
model overfitting. effects in low.
However,
workis this
but
study.
theLasso Divided
selected
doesn’t regression
work
regression
when selects
varia
the economists who areshrinking
at old days coefficient as subsets
(as earlytowards zero and
features inwill
this bework
avoids highly can be obtained
overfitting.
biased can[18-20].
However,
andbecome be very the deviate
complex. selected features wil
By considering
biased and can
1970) need to wait about 24 hours in order to be very deviate for different bootstrapped
for different bootstrapped data. data.
method The prediction
Theisprediction performance
found to be more suita is
ridge regression. Decision tree does not require standardization and and normalization,
no multicollinearity less data
effects in
obtain the results of onework regression.
but doesn’tNowadays, performance
work when variables is worse than
are uncorrelated, has highridge regression.
variance due to greedy strate
subsets in this work can be obtaine
become complex. By considering the pros and cons of these machine learning methods, linear
method is found to be more suitable in this work due to numerical dataset, no serious outlier
and no multicollinearity effects in this study. Divided regression model combining the result
subsets in this work can be obtained [18-20].
2. MATERIALS AND METHODS In this 2. paper,
3 MATERIALS the multipleAND linearMETHODS regression and divided regression analy
analyse the large volume of data. The analysis is carried out by using the
3
In this paper, the multiple linear regression dataset and divided is partitioned regression into analysis 𝑛𝑛 sub datasets will bewith implemented an appropriate to size. The fitte
In this paper, the multiple linear regression and divided regr
METHODS analyse the large volume of data. The analysis model is carried with analyse 𝑘𝑘out by using variables
independent the R-programming. is given by 𝑦𝑦
̂ The = 𝛽𝛽̂
entire + 𝛽𝛽̂ 𝑋𝑋 + 𝛽𝛽̂2 𝑋𝑋out 2+⋯
Chiang Mai J. Sci. 2022; 49(6)
dataset is partitioned into 𝑛𝑛 sub datasets with an appropriate size. The fitted multiple linear regression
the large volume of data. The analysis 0 1671 1 is carried
1 by
2. MATERIALS AND METHODS dataset 2. is partitioned
MATERIALS into AND 𝑛𝑛 sub METHODS datasets with an appropriate siz
tiple linear regression model and with divided 𝑘𝑘 independent regressionvariables analysisiswill given be by implemented
𝑦𝑦̂ =The 𝛽𝛽̂0 + value to
𝛽𝛽̂1 𝑋𝑋1 + 𝛽𝛽̂2 𝑋𝑋2 + ⋯ + 𝛽𝛽̂𝑘𝑘 𝑋𝑋𝑘𝑘 .
ofwith variance inflation variables factor (VIF) is by calculated to 𝛽𝛽̂detect
model 𝑘𝑘 independent is given 𝑦𝑦̂ = 𝛽𝛽̂0 + 1 𝑋𝑋1 +
of data. The analysis is carried out by using the R-programming. The
multicollinearity entire
In this paper, the multiple linear regression and dividedbetween regression In this
theanalysis
paper,
independent
the will multiple be variables.
implemented
linear
The tolerance
regression to and
needs t
divide
𝑛𝑛 sub datasets with anThe appropriate
value ofsize. variance The fitted 2. data.
inflation multiple
MATERIALS factor linear (VIF) AND regression METHODS 33
isis calculated byto detect the 𝑅𝑅existence of
𝑅𝑅𝑗𝑗2 possible
2
analyse the ̂large volume of The analysis obtain the carried VIF,out 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡
analyse
The value usingtheof =large
the 1variance
−R-programming.
volume 𝑗𝑗 where inflation of data. is
The
The
factor theentire determination
analysis
(VIF) isis calculate carried coefo
variables is given by 𝑦𝑦̂ = 𝛽𝛽
Decision ̂0 +
tree 𝛽𝛽1 𝑋𝑋between
does + 𝛽𝛽̂2require
not 𝑋𝑋the +⋯ + 𝛽𝛽̂𝑘𝑘 𝑋𝑋𝑘𝑘 . variables. The
standardization
multicollinearity
dataset is partitioned into 𝑛𝑛 sub datasets with
1 2 independent
variable an of
appropriate
𝑗𝑗 on2. the
tolerance
MATERIALS
all dataset
multicollinearity
regression
the othersize. needsThe
is variables. model
ANDto
fitted
betweenAfter
partitioned
be METHODS and
calculated
multiple constant
intoindependent
the that,linear in
𝑛𝑛 subthedatasets
order variance
regression
VIFvariables.
to
is with formulated anThe appropria as 𝑉𝑉𝑉𝑉
tolera
and obtain the VIF,
normalization,
model less data
𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 1 −In𝑅𝑅𝑗𝑗this
preparation
=variables 2
iswhere paper, work 𝑅𝑅 𝑗𝑗the
2 multiple
𝑦𝑦̂isor =the 𝛽𝛽̂0obtain
of errors linear
determination̂1model 1by +regression̂2 𝑋𝑋coefficients
observing ⋯and +plot divided
𝛽𝛽̂𝑘𝑘of of a. residuals
regression analysis
of 2 will be
against impl̂
2.2. with MATERIALS
MATERIALS 𝑘𝑘 independent AND
AND METHODS
METHODS
analyse the
given
large volume
by
than equal + 𝛽𝛽
to 𝑋𝑋10the
of data. The analysis is carried
𝛽𝛽
with
VIF,
implies 2𝑘𝑘 +independent
𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡
the presence 𝑋𝑋= 𝑘𝑘 1 variables
− 𝑅𝑅
1 out by using the R-programming.A
of
2
is
where
multicollinearity
𝑗𝑗 given 𝑅𝑅 𝑗𝑗 by is 𝑦𝑦
̂ the
=
problem. 𝛽𝛽̂0 +
determ 𝛽𝛽
ce inflation factorvariable (VIF) 𝑗𝑗isoncalculated to detect the existence of possible
but doesn’t work allwhen the other variables variables. are uncorrelated,
After that, the VIFfitted
correlation isamong formulated
variable
Invalues.this the paper,
on A asnormal
independent
all
the multiple
𝑉𝑉𝑉𝑉𝑉𝑉
the
= probability
other variables. linear
variables.
. VIF VIFsregression
plot
After
greater of
between
that,
and4divided
the the and
VIF 10
is
regres
sugg
formu
the independent variables. The tolerance needsdataset to be iscalculated partitioned in into order 𝑛𝑛 to
analyse
sub datasets with an appropriate size. The fitted multiple linear
𝑗𝑗
theproblem. large volume
of data.
hasthan high The
orvariance
equal InInvalue
this todue
this 10 ofto
paper,
paper, variance
impliesgreedy
the 2. the
thecoefficients
multiple MATERIALS
inflation
strategy
multiple presence linear
linear andfactor ofrequired.
can
regression AND (VIF)METHODS
multicollinearity is calculated to value detect
byA value the existence
+ofwill 1𝑋𝑋The indicates analysis
of̂ the possible
𝑋𝑋2 no
is carried out by calcu
𝑡𝑡 = 1 − 𝑅𝑅𝑗𝑗2 where 𝑅𝑅𝑗𝑗2 is the determination model with of
𝑘𝑘 aregression
independent regression and
and
residuals
variables
dataset of
than
divided
divided
or
is
can regression
regression
The
ispartitioned
given
equal be used
to 10𝑦𝑦̂intoanalysis
to
=analysis
of check
implies𝛽𝛽𝑛𝑛̂variance
0 sub
will
𝛽𝛽whether
̂the be
1 1 presence
datasets
be +implemented
inflation implemented
𝛽𝛽2with error
+factor
of
an +to
⋯multicollinearity
appropriate
to
(VIF)
𝛽𝛽̂𝑘𝑘 𝑋𝑋𝑘𝑘 . is size.
multicollinearity
correlation analyse
analyse among thethe the
large
largebetween independent
volume
volume the of ofindependent
variables.
data.
data. The
The variables.
VIFs
analysis
analysis betweenis
Someisassumptionsis The
carried
carried tolerance
4 and out
out 10
by
by needs
suggest
using
using
which justify to
the
the be
further calculated
R-programming.
R-programming. investigation in order
The
The is to
entire
entire
variables. Afterbecome complex. By considering the pros 1 and normally
correlation multicollinearity
𝑘𝑘distributed.
among the between Ifthethe
independent usethe of linear
resulting independent
variables. regression
plot 𝑦𝑦̂ VIFs isvariables. models
between Thefor4 tta
that,
required. thedataset
obtain VIF
dataset theisis isformulated
VIF, partitioned
partitioned𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 asinto
into𝑉𝑉𝑉𝑉𝑉𝑉
=𝑛𝑛𝑛𝑛In1=subsub
−this 2
𝑅𝑅datasets where
𝑗𝑗datasets
paper, . the
linear VIF
with
with greater
𝑅𝑅𝑗𝑗2multiple anmodel
is
an
relationship the
appropriate
appropriate with
determination
linear between
obtain independent
size.
size.
regression
the The
Thecoefficients
independent
VIF, andfitted
fitted variables
divided multiple
𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 multiple
variables =is
a given
of regression 1linearand𝑅𝑅by
regression
linear
− regression
regression
analysis
response
2
where of 𝛽𝛽̂0variable,
= +
will
𝑅𝑅 2 𝛽𝛽̂be
is 1 𝑋𝑋 theimpl
1 +de
con
cons ofmulticollinearity
these machine learning methods,
The
value linear ofvolume variance required.
approximately inflation factor linear, (VIF) we assume
is calculated that thedetect
to 𝑗𝑗 error the 𝑗𝑗existence
lies the presence ofvariable
Somemodel
model with
with
assumptions 𝑘𝑘𝑘𝑘 problem.
independent
independent
𝑗𝑗 on all the other multicollinearity which analyse A
justify value
variables
variables the
the
variables. After that, large
useof
isis of1
given
given indicates
normality
linear by
by
the VIF of
regression
𝑦𝑦
̂𝑦𝑦
̂ = of
= no 𝛽𝛽̂
data.
the
𝛽𝛽 ̂ +
is Some + The
error ̂
models
𝛽𝛽
formulated𝛽𝛽 ̂ 𝑋𝑋𝑋𝑋 analysis
distribution.
+ + for𝛽𝛽̂̂
𝛽𝛽 the
𝑋𝑋
as all𝑋𝑋 is +
𝑉𝑉𝑉𝑉𝑉𝑉the+ carried
purpose
⋯ These
⋯
= The + + 𝛽𝛽̂
𝛽𝛽 ̂ out
of assumptions
1𝑋𝑋 𝑋𝑋 by
prediction
. . using
. theVIFuse arethe
are
greater R-programming.
yet to be validate
between 𝑛𝑛thesub
00 variable 1 1 11
assumptions on 2 2
𝑗𝑗 variables. 2 2 which otherjustify𝑘𝑘 𝑘𝑘 variables.
𝑘𝑘 𝑘𝑘 After of lineartothat, the
regression VIF is m
ependent variables. regression
linear VIFs method 4between
between
relationship isand found to dataset
10independent
suggest be more is
further suitable
partitioned
variables investigation
perform and intoresponse terms
model is independent
The are
datasets
adequacy
variable,normally
value with checking
of constant distributed.
an
variance appropriate
2A variance to 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡
ensure
inflation
tolerance
ofsize.
the errors
factor The needs
adequacy andfitted
(VIF) of bethe
multiple calculated
is calculated model linear b
than or equal to 10 implies obtainThese the presence
the 𝑘𝑘VIF, of multicollinearity linear relationship 2 problem.
where between is𝛽𝛽 valuethe independent of
determination 1 indicates variables no and response a rev
in normality workof
thiscorrelation due
The
The theto error
value
valuenumerical distribution.
of
of variancemodel
variancedataset, with no
inflation
inflation independent
assumptions
serious 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡
linearity
factor
factor (VIF)
(VIF)of variables
are =yet
the
multicollinearity
isAfter
is
than
1regression
− tois𝑅𝑅performing
calculated
calculated be orvalidated
𝑗𝑗given equal
model
by
between
toto 𝑦𝑦𝑅𝑅 𝑗𝑗to
̂detect =in
the
detectand 10
̂the
the
model +
the
implies
constant
themodel. ̂adequacy
independent
𝛽𝛽 𝑋𝑋1 +
existence
existence
the𝛽𝛽̂ presence
variance
Hence, 𝑋𝑋of
checking,
of +coefficients
we of
variables. ⋯errors
possible
possible +of𝛽𝛽̂The 𝑋𝑋 of
multicolline
by𝑘𝑘 toleranc
.observi
among the independent variables. VIFs between normality 4 and
correlation of the 10among suggest
error distribution.
the
0 furtherindependent
1 investigation
These variables.
2 2assumptions is VIFs 𝑘𝑘 are1betwee yet to
ich justify the use perform
outliersof required.
linearproblemsmodel
regression
multicollinearity
multicollinearity adequacy
andmodels nobetween checking
for variable
multicollinearity
between thethe to
thepurpose 𝑗𝑗 ensure
on effects
independent
independent offitted
all the variables.
prediction values.
adequacy
other
variables. obtain
transformation AThe
variables.
are
perform
normal
of
The the the After
VIF,
tolerance
tolerance
model
probability
model that,
on built. theplot
needs
needs
adequacydataset We
VIF
to = of
tochecking
bebe
will 1the
can
is − beresiduals
calculatedconclude
formulated
calculated
2
𝑗𝑗 towhere
𝑅𝑅performed ensure inincan
the
as 𝑅𝑅
order
2beisused
order
𝑗𝑗theif = to todetermin
check.w
the𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡
to
adequacy of
n independent variables linearity and ofresponsethethe regression variable, model constant and constant
variance distributed.
variance
of errors 2of and Ifinflation
errors required.
the determination
resulting
byfactor observing plot is plot approximately
of residuals linear,
against weAthe assume that1 th
in this obtain
study. obtain
Some Divided the
assumptions VIF,
VIF, regression 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡than
which model Theor
=
justify= 11 equal
value
−−
combining
the
2to
𝑗𝑗 of
𝑅𝑅𝑅𝑅𝑗𝑗2use where
where10
of variance implies
linear 𝑅𝑅𝑅𝑅𝑗𝑗2the is
𝑗𝑗variable is
linearity
regression the
the 𝑗𝑗presence
determination
ofmodels
on the
allviolates of (VIF)
regression
the for othermulticollinearity
the is
coefficients
coefficients
purpose calculated
model
variables. of of
andofAfter aaproblem.
to
constant
prediction detect
regression
regression
that, the
arevariance VIF value
ofof existence of
isofformula errors ino
ribution. These assumptions fitted values.are A yet normal to be probability
validated correlation plot
in the of amongthe
model. distributed.
residuals theHence, can
independent we model
beindependent
used variables. builtSome
to check assumptions
whether
VIFsprobability
the assumptions
which
the error
between justify
1isand
4plot normally required
10 the use
suggest of linear
further regres
inve
linear relationship between multicollinearity
independent variables between and the
than fitted
response values.
orlinear
isisequal variable, A variables.
normal
tomodel 10 constantimplies The variance
=the tolerance 1
presence of of needs
errors the and
of variablesto
residuals be
multicollinearity calculated
can be use
the variable
thedistributed.
results variable
of Ifallthe on
𝑗𝑗𝑗𝑗the on all
all the
subsets the other
other
in variables.
this variables.
work canAfter
After bethat, that, the
the
in VIF
VIF
linear formulated
formulated
relationship as
2as 𝑉𝑉𝑉𝑉𝑉𝑉
between = independent .. VIF VIF greater
greater and respop
checking to ensure
normality
adequacy
ofequal
of resulting
the error
the model plot
distribution.
is
built.
required.
obtain approximately
the WeVIF,
These
can conclude
assumptions
linear,
After the we
performing 1 assume
distributed.
=
correlationare −yet 𝑅𝑅regression
2 the
to
𝑗𝑗 among be
that
where
If the validated
the
theresulting
𝑅𝑅 model.
error
is intheterms
adequacy
𝑗𝑗 independent theplot The aretransformed
checking,
determination isvariables.
model.
normally
approximately
Hence,
transformation
coefficients
VIFs we linear,
between
on
of wea4data reg
ass
an
model and constant distributed.
obtained variancethan
than
[18-20]. or
or
of equal
errors to
byto 1010
observingimplies
implies plot the
the of presence
presence
residualsmodel of
of against
built multicollinearity
multicollinearity
violates normality the of
problem.
problem.
assumptions the error A A distribution.
value
value
required ofof in 11 linear These
indicates
indicates assumptions
regression no
no model. are
perform model adequacy checking
variable Some 𝑗𝑗 to
on assumptions
ensure
all the the
other which regression
adequacy
required.
variables. justify
distributed. of model
the
Afterthe usemodel
that, is
of formulated
linear
the built. VIF regression
Weis as
can
formulated below:
models
conclude as for the
𝑉𝑉𝑉𝑉𝑉𝑉 the = purpose 1 of pr
.V
bability plot of the residuals correlation
After correlationcan
performing be used among
among to check
the the
the
model independent
independent
whether
adequacy thevariables.
variables.
error
checking,model is normally isVIFs
VIFs
formulated
transformation between
between perform as 4below:
on model
4variables
and
anddataset 1010 suggest adequacy
suggest
will response be further checking
further
performed investigation to
investigation
if the ensure the adequacy
isis𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡
linearity of the regression linear
model relationship
and constant between variance independent
of errors
After by
performing observing and
the plot
model of residuals
adequacyvariable, againstconstant
checking, variance of
transformat
g plot is approximately model required. required.
linear,
built violates we assume the assumptionsthatthan the or error equal terms
required to error 10 linear
are
in implies
normally regression Some linearity
the presence assumptions
model. of the
ofThe regression
which
multicollinearity
transformed justify model the and
use
problem. ofconstant
linear A value variance
regression of of
1 in m
fitted values.AND A normal normality
probability plot of ofthe the residuals distribution. can be These assumptions aretheyet toregression
be validated in3the(𝜆𝜆) model. 𝑌𝑌
MATERIALS
2. model is formulatedSome
Some assumptions METHODS
as below:correlation
assumptions which
whichjustify justify among the
the use
usetheof ofindependent
linear linear
linear
model (𝜆𝜆)
=used
regression
built
fitted
relationship
𝑌𝑌regression variables. to
𝛽𝛽0values.
+ check
violates
models
models 𝑋𝑋A
𝛽𝛽1between 1+
VIFs whether
the𝛽𝛽between
normal
for
for the
assumptions
2the + ⋯error
probability
independent
𝑋𝑋2purpose
purpose 4+ and 𝛽𝛽of𝑘𝑘is
of
required
𝑋𝑋 normally
plot
variables of𝜀𝜀 the
𝑘𝑘 +suggest
10
prediction
prediction
in
and
where linear
residuals
areresponse
are further𝑌𝑌 regressi can
=inve {b
var
distributed. If the resultingrequired. perform model
plot is approximately linear, adequacy checking
model we is to
assume ensure
formulated that the the
as adequacy
error
below: terms of the
are model
normally built. We 3 can co
lo
In this linear
linear paper, relationship
relationship the multiple between
between linear willregression
independent
independent variables
variables ifnormality
and
and response distributed.
response of the variable, error
variable, If 𝜆𝜆the constantresulting
distribution.
constant variance
varianceplot
These is of approximately
assumptions arelinear, yet to wb
model adequacy checking, distributed. transformation on dataset
linearity ofbe theperformed regression the
model and constant 𝑌𝑌 variance , 𝜆𝜆 ≠ of 0 errors byof errors
errors and
observing and plot of residu
ssumptions required in 2.
normality
normality
linear MATERIALS
(𝜆𝜆)
𝑌𝑌regression of =the
of 𝛽𝛽0 error
the +error
model.𝛽𝛽1 distribution.
AND 𝑋𝑋distribution.
+Some
METHODS
The 𝛽𝛽2transformed
𝑋𝑋2assumptions
+These ⋯ + 𝛽𝛽
These 𝑋𝑋𝑘𝑘which
assumptions
assumptions
regression +perform 𝜀𝜀 where justify aredistributed.
are model yet𝑌𝑌 (𝜆𝜆)
3yet
the toto
use =adequacy
bebe { validated
of validated
linear checking in
regressionin 𝑒𝑒the
the . to model.
model. ensure
models Hence,
Hence,
for thetheadequacy wewepurpose of pre
of th
and divided Afterregression performing analysis the will1
fitted
model bevalues. implemented
adequacy A normal Lastly,
checking,
𝑘𝑘
probability
thelinearity standardized
transformation plot of𝑌𝑌
of the
the(𝜆𝜆)
residuals,
on
regression
residuals
= 𝑒𝑒𝛽𝛽(𝑌𝑌),
log
dataset 0𝑑𝑑+will =𝜆𝜆can
model
𝛽𝛽1=𝑋𝑋be be
10+
and
used
are
performed to
𝛽𝛽2 𝑋𝑋calculated
constant 2+ check ⋯ if +the whether
variance 𝑘𝑘 +the
𝛽𝛽to𝑘𝑘 𝑋𝑋determine
of
𝜀𝜀 wher
errors
error thb
ow: perform
perform model model adequacy adequacy linear checking
checking
distributed. relationship Ifto tothe ensure
ensure
between
resulting the
theindependent adequacy
adequacy
plot is approximately After
of ofvariables
the performing
the model model and built.
linear, built. the
√ We
response 𝑀𝑀𝑀𝑀𝑀𝑀
we model
Weassume can adequacy
can conclude
variable, conclude
that thethe
constant checking,
the
error variance transf
terms of ar
to analyse model the
linearity built
linearity
2.large MATERIALS
violates
of
of
volume
the
the theof
regression
regression
data.
AND
assumptions
normality 𝜆𝜆model
model
The
METHODS
and
of
analysis
required
and 𝑒𝑒the observations.
constant
constant
error indistribution.
linear
variancefitted
variance regression
values.
model
of
of errors
Theseerrors Abuilt model.
normal
byby
assumptions violates
observing
observing The
probability transformed
the
are plot
plot yet plot
assumptions
ofof to of
residuals
residuals
be regression
the 3residuals
validatedrequired
against
against in can
in
the be
linear
model. used re
Lastly, the In
standardizedthis paper, the
residuals, multiple
distributed.
𝑌𝑌 𝑑𝑑 = linear
, 𝜆𝜆 ≠ 0 regression
are calculated and divided
to determine regression the analysis
existence will
of be
outliers implemented of
𝑒𝑒 𝑌𝑌METHODS to
AND METHODS
𝛽𝛽1 𝑋𝑋1 + 𝛽𝛽2 𝑋𝑋2 + is ⋯ carried model
+ 𝛽𝛽𝑘𝑘fitted 𝑋𝑋out is
𝑘𝑘 +values.
fitted formulated
by
𝜀𝜀 where
values. using AAnormal theas (𝜆𝜆)
𝑌𝑌 volume
normal below:
R-programming.
=perform
{ of data.
probability
probability model√plotplot The
ofof theFor
.analysis
the residuals
residuals distributed.
Lastly,
divided Lastly, can
cantheregression
model the
bebe If the
standardized
isby
standardized
used
used totoresulting
formulated analysis,
check
check 2. MATERIALS
plot
residuals,
as we
residuals,
whether
whether below: isthe
extract approximately
the 𝑑𝑑the = 𝑛𝑛model
error
error AND subsets
is
isThe are
are
3normally
normally linear,
fromWeawecan
calculated assum
large toco d
analyse the large The 0adequacy checking
is carried toout ensure using the the adequacy R-programming. of built.
entire
𝑀𝑀𝑀𝑀𝑀𝑀
observations. In(𝜆𝜆) this paper, the After
𝑒𝑒 (𝑌𝑌), 𝜆𝜆
logmultiple performing
= linear the
regression model
distributed. and adequacy
divided checking,
regression 𝜆𝜆 transformation
analysis willby be √on dataset
implemented
𝑀𝑀𝑀𝑀𝑀𝑀 will to be perfo
entire distributed.
distributed.
datasetdatasetis is If
partitionedIf
partitioned the
the resulting
resulting
𝛽𝛽0 + model into
linearity
into 𝑛𝑛plot
𝛽𝛽1 𝑋𝑋1 +built plot
sub
sub of is
is approximately
datasets
the
datasets parameters
approximately
regression with model
𝛽𝛽𝑘𝑘 𝑋𝑋calculated
an 3 for
linear,
linear,
appropriate
observations. each
and we we sub-dataset
assume
assume
constant
torequired
determine
size.(𝜆𝜆) The that
that
variance𝑌𝑌 are
fittedthe
the
the𝛽𝛽the computed
,
error
of 𝜆𝜆
error
existence
multiple ≠
errors 0
terms
terms to are
ofmodel.
linear obtain
are
observing
outliersnormally
normally
regression 𝛽𝛽 , plot
𝛽𝛽 , … of , 𝛽𝛽 .
residu A
analyse 𝑌𝑌AND the =METHODS large volume 𝛽𝛽2 𝑋𝑋
of violates
+⋯ + the assumptions
𝑘𝑘 + 𝜀𝜀is where 𝑌𝑌out by =𝑌𝑌{in (𝜆𝜆) linear regression .1 + The 0 transformed
1 𝑛𝑛
he multiple linear 2. MATERIALS
𝑒𝑒 regression For divided
distributed.
distributed.
model andwith divided
regression
𝑘𝑘is regression
independent analysis,
fitted analysis
values.
variables wedata. 2
A iswill
extractnormalThe be 𝑛𝑛analysis
combining
given implemented
by subsets
probability
𝑦𝑦̂of = 𝛽𝛽̂After
𝑛𝑛 𝑌𝑌parameters
carried
from
+plot to a𝑋𝑋
of1large
𝛽𝛽̂performing the
to+estimate 𝛽𝛽 volume
residuals
𝑋𝑋2log
̂2regression using
the =
+ 𝑒𝑒the ⋯model
(𝑌𝑌),Inof
can
+𝛽𝛽.
+
0 this ̂R-programming.
=𝛽𝛽used
data.
𝜆𝜆𝛽𝛽be
𝛽𝛽 10𝑋𝑋formulated
paper,
̂adequacy
is Then, to 𝛽𝛽
the 2 𝑋𝑋
check the
checking,2+
multiple The
whether
as follow:
entire
⋯ +transformatio 𝛽𝛽𝑘𝑘the
linear 𝑋𝑋𝑘𝑘error + 𝜀𝜀 w
regres
siduals, 𝑑𝑑 = with an
are appropriate
calculated dataset to size.
determine The
partitioned fitted
model
the multiple
is
existence
into formulated
𝑛𝑛 sub linear
ofdatasetsoutliers as below:
with of an 0 For
observations.
appropriate divided
1 size. The fitted analysis, 𝑘𝑘𝑐𝑐𝑋𝑋 𝑘𝑘 .
multiple we extract
linear regression 𝑛𝑛 subsets fro
olume of data.√𝑀𝑀𝑀𝑀𝑀𝑀 The analysis After
2.parameters
MATERIALS isfor carried each
After performing AND
performing the out METHODS
sub-dataset by using
distributed.
the model arethe
model adequacy R-programming.
computedIf
adequacy the to
resultingchecking,obtain
checking, The
model plot 𝛽𝛽 entire
is
,built
𝛽𝛽
transformation
transformation violates
approximately
, … , 𝛽𝛽 . After
on the
on dataset analyse assumptions
that,
linear,
dataset 𝛽𝛽
̂̂the
we
will
will is large
assume required
formed ̂ volumêthat by inthe of̂linear )theobtain 𝛽𝛽ana
data.
error The
regression
terms ar
parameters 0 1 for 𝑛𝑛 each sub-dataset 𝛽𝛽 𝑐𝑐𝑐𝑐 =be be𝑓𝑓𝑐𝑐performed
are performed
(𝛽𝛽 1 , 𝛽𝛽2 , … , if
computed 𝛽𝛽if𝑛𝑛𝜆𝜆to
the ,w 𝛽𝛽
d into 𝑛𝑛 sub regression
datasets In with
this
Lastly,
combining model
model
modelThean
paper,
the
𝑛𝑛model
appropriate with
the
standardized
parameters
built
built with multiple
violates
violates 𝑘𝑘 to independent
independent
size. The
linear
distributed.
residuals,
estimate
the
the assumptions fitted
regression
the
assumptions 𝑑𝑑 =
𝛽𝛽. variables
variables multiple
𝛽𝛽 ̂ is
(𝜆𝜆)𝑒𝑒
and
required is given
are
formulated
required linear
divided
calculated
inin by
model
𝛽𝛽linear𝑦𝑦
̂
regression For
=
regression
as
linear is
follow:𝛽𝛽
to
̂
divided +
formulated
regression determine
regression 𝛽𝛽̂
analysis𝑋𝑋 regression
+ as
model.
model. the𝛽𝛽̂dataset
will
below:𝑋𝑋 be
existence + analysis,
⋯ is
implemented+ 𝛽𝛽
of
̂
partitioned we
𝑋𝑋
outliers extract
. (𝜆𝜆)to into of 𝑌𝑌 𝑛𝑛
𝑒𝑒𝑌𝑌 sub , 𝜆𝜆
datasets ≠ 0 0
THODS 𝑘𝑘The
The 𝜀𝜀transformed
transformed ̂regression
regression
𝑌𝑌√𝑐𝑐 𝑀𝑀𝑀𝑀𝑀𝑀 = 𝛽𝛽0𝑓𝑓(VIF) + 0 1 1 2 2 𝑘𝑘 𝑘𝑘
value of variance inflation where
factor isout 1 𝑋𝑋
acombining
combined
1is +Lastly, 𝛽𝛽2 𝑋𝑋𝑛𝑛2function
calculated +
the ⋯standardized
parameters to + 𝛽𝛽 𝑘𝑘 𝑋𝑋to
which
detect + combines
the
estimate where
residuals,
existence the the 𝑌𝑌𝛽𝛽.𝑑𝑑results
=
of
𝛽𝛽 =ispossible
{formulated
from are 1 calculated
𝑠𝑠𝑠𝑠
sub-dat
as foll.
n analysis,
endent variablesweisisanalyse
given
extract
givenIn this the
byby
model
model large
𝑛𝑛 subsets
𝑦𝑦̂paper,
=isis𝛽𝛽 ̂
volume
0thefrom
formulated+ 𝛽𝛽
formulated ̂
multiple
1 𝑋𝑋 of
a1 large+data.
as ̂
linear
𝛽𝛽below:
asbelow: 2 𝑋𝑋 The
volume analysis
+regression
2After ⋯ performing
+ 2.
𝛽𝛽of ̂
̂𝑐𝑐 𝛽𝛽=𝑘𝑘data. MATERIALS
is
𝑋𝑋𝑓𝑓𝑘𝑘and carried
. ̂Then, 𝑐𝑐
, 𝛽𝛽̂2subsets
divided
the …the
,model AND
by
, 𝛽𝛽regression using
from
̂𝑛𝑛 )adequacy METHODS the
a(𝜆𝜆) large
analysis R-programming.
checking, model
volume bewith
willtransformation of
implemented The
𝑘𝑘 independent
data. entire
Then,onin todataset
𝑐𝑐 √ thelog
variables
𝑀𝑀𝑀𝑀𝑀𝑀 (𝑌𝑌), 𝜆𝜆 =
beis perfor
0
given
̂1𝑘𝑘will
𝑒𝑒
observations.
multicollinearity 𝑐𝑐 (𝛽𝛽1variables.
combined function isThe determined 𝛽𝛽0as +the 𝛽𝛽to1mean value,
dataset are dataset
computedanalyse The is
to thepartitioned
obtain
value large of volume
The
𝛽𝛽 , into
variance𝛽𝛽 value
, … 𝑛𝑛between
of
, sub
𝛽𝛽 data.
of.
inflation datasets
After
model
the independent
The
variance that,
factor
built with
analysis inflation
𝛽𝛽
violates an
̂(VIF) is isappropriate
carried
formedfactor
the assumptions
parametersout
by
(VIF)
The
size.
by
observations.
using
is
tolerance
𝑌𝑌fitted
calculated
required
for the
needs
=R-programming.
multiple
in to linear detect 𝑋𝑋be1 +calculated
linear
regression the 𝛽𝛽regression
2 𝑋𝑋
The 2+
existence 𝛽𝛽̂⋯
entire𝑐𝑐 =
model.
+order 𝛽𝛽𝑐𝑐𝑘𝑘(𝛽𝛽
𝑓𝑓of
The
,to+
𝑋𝑋possible 𝛽𝛽̂2𝜀𝜀, …where
transformed
, 𝛽𝛽̂𝑛𝑛 )
e linear regressionwhere and divided is adivided
𝑓𝑓𝑐𝑐For regression
combined 0 regression
1 analysisanalysis,
(𝜆𝜆)function
𝑛𝑛 will
which be implemented
combineswe 𝑐𝑐 2 extract the results
𝑛𝑛𝑅𝑅 tosubsets
2
is from ̂from
1 𝑠𝑠𝑠𝑠
aeach
sub-dataset large sub-dataset
volume to𝑌𝑌𝑌𝑌
{𝑘𝑘 .𝑛𝑛
𝜆𝜆
𝜆𝜆 𝑡𝑡ℎ
of,are
sub-dataset.
,𝜆𝜆to
𝜆𝜆data.
≠ ≠ computed
00a Then, 𝑛𝑛The the to existence
isobtain the VIF, where the determination coefficients of . . regression ofresults
𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 = 1+ 𝑒𝑒
estimate model
̂𝑐𝑐 is
𝛽𝛽dataset with 𝑘𝑘 independent
partitioned
multicollinearity
(𝜆𝜆)
𝑌𝑌𝑌𝑌into = =
variables
𝑛𝑛thesub 0Lastly,
𝛽𝛽𝛽𝛽0between
+ +datasets
𝛽𝛽
is is𝑋𝑋1the
𝛽𝛽11given
𝑋𝑋 the+− 𝛽𝛽2𝑅𝑅2𝑋𝑋
standardized
𝛽𝛽
by
with 𝑋𝑋
independent
𝑗𝑗𝑦𝑦 2In
an + this
+appropriate
̂existence
= ⋯𝛽𝛽⋯̂0below:
++ residuals,
paper, 𝑗𝑗𝛽𝛽̂𝑘𝑘𝑘𝑘where
𝛽𝛽 1𝑋𝑋
variables. 𝑋𝑋 𝑘𝑘1𝑘𝑘the
+
size.
+ 𝛽𝛽𝑑𝑑
+𝜀𝜀𝜀𝜀multiple
𝑓𝑓 =For
is
where
where
The
The
𝑋𝑋 2a+
divided
fittedlinear
combined
tolerance
⋯ 𝑌𝑌𝑌𝑌+ are
(𝜆𝜆)
(𝜆𝜆)
𝛽𝛽̂= =
multiple {regression
calculated
𝑘𝑘regression
function
𝑋𝑋needs linear
tovalue and
which
be analysis,
determine
divided
regression
calculated 1combines we inextract
the
regression the order 𝑛𝑛 subset
analysis
to of
from o
variance
data. The the 𝛽𝛽.is
inflation
analysis formulated
isfactor
calculated
combined carried
parameters (VIF) outas
to
function forisfollow:
by
detect calculated
using
is
each the
determined sub-dataset model
to as detect
R-programming.
existence the 1formulated
of
are mean the
possible
computed
2The
value, as entire
to of
obtain
obtain possible 𝛽𝛽
𝑐𝑐2
, 𝛽𝛽 , √
… 𝑀𝑀𝑀𝑀𝑀𝑀
, 𝛽𝛽 . After
After log
log The
that,
that, (𝑌𝑌),
(𝑌𝑌), 𝛽𝛽̂𝜆𝜆
𝛽𝛽̂𝜆𝜆𝑐𝑐=1=is
= 0 of
formed
0 formed ∑ variance ̂𝑖𝑖 by
𝛽𝛽 .by inflation facto
̂ model with
̂ variable
̂ obtain
independent ̂𝑗𝑗 on the all VIF, the
variablesother
observations.variables.
is given analyse
by After 2 the that,
̂
where large the ̂ volume
combined
VIF2
isparameters
is
0 ̂
the of
function
formulated
1 data.
determination 𝑛𝑛for The
is aseach
̂ analysis
determined
multicollinearity sub-dataset
𝑒𝑒𝑒𝑒
=
coefficients is
as 𝑐𝑐 carried
the are
mean
of .
𝑒𝑒 out
VIF
a computed by
value,greater
regression using to the
of obtainR-pr
tween
ub datasetsthe independent
𝛽𝛽𝑐𝑐 =
with 𝑐𝑐 (𝛽𝛽variables.
an 𝑓𝑓combining 𝑘𝑘
1 , 𝛽𝛽2 , …𝑛𝑛
appropriate 𝛽𝛽𝑛𝑛The
, size. ) The tolerance 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡
needs to = be 1 − 𝑦𝑦
̂ 𝑅𝑅
calculated= 𝛽𝛽 +
0 𝑛𝑛 Lastly, 𝛽𝛽
in 1order 𝑅𝑅𝑋𝑋 𝑗𝑗1 the + 𝛽𝛽 𝑋𝑋
to2standardized + ⋯ + 𝛽𝛽 𝑋𝑋
𝑘𝑘residuals, . 𝑑𝑑 = between 𝑛𝑛 are𝑌𝑌the calculated independent
, 𝜆𝜆 ≠to0 det
thefitted multiple linear 𝛽𝛽̂𝑐𝑐regression 2 𝑘𝑘 𝜆𝜆
multicollinearity
2 Thê than 2̂ orof
parameters
between to estimate
independent the variables.
𝛽𝛽.(𝜆𝜆)
dataset
𝑌𝑌 = is𝑒𝑒𝑗𝑗is
𝑒𝑒𝛽𝛽
formulated
1 partitioned
+ combining
𝛽𝛽1 𝑋𝑋 asinto
1 +combining
follow:
𝛽𝛽to2 𝑋𝑋𝑛𝑛detect +subparameters
⋯ 𝑛𝑛datasets
+ parameters
𝛽𝛽obtain 𝑋𝑋 𝑘𝑘 with +to𝜀𝜀the to
estimate
anestimate
where appropriate √ the
(𝜆𝜆)
the =𝛽𝛽.{size.
𝑌𝑌𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡
𝑖𝑖=1
𝑀𝑀𝑀𝑀𝑀𝑀 𝛽𝛽̂volume The
isno 𝑛𝑛 fitted
formulated .mu a
𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡
iables
nction which = 1combines
is given −by𝑅𝑅𝑦𝑦 𝑗𝑗̂ = where 𝛽𝛽
the value
+ 𝑅𝑅
results𝛽𝛽 is
variable
𝑋𝑋 equal
the
+
fromvariance
𝛽𝛽̂ on
𝑋𝑋 to
determination
𝑠𝑠𝑠𝑠 + 10
all ⋯ the
sub-dataset implies
inflation
+ ̂
other
𝛽𝛽 For𝑋𝑋 to the
factor divided
coefficients
variables.
. 𝑡𝑡ℎ presence
(VIF) ̂
sub-dataset.regression
of After isa
0 of multicollinearity
calculated
regression
that, Thê analysis,
the VIF of we
2
is problem.
extract
formulated the 𝑘𝑘existence
𝑛𝑛 as A
subsets value VIF,
of from ofpossible 1 1 a indicates
large . VIF
log 𝑐𝑐 = 1 1
greater
(𝑌𝑌), − 𝑅𝑅𝑗𝑗2 0
of whe
data.
The tolerance
Lastly,
0Lastly,
correlation
𝑗𝑗 1 1 the
needs
the standardized
standardized
among
𝑗𝑗
to
1
2 2
be calculated
the
residuals,
residuals,
𝑘𝑘 𝑘𝑘
independent in
𝑛𝑛 𝑑𝑑
𝑑𝑑
model
order
=̂=𝛽𝛽 The
𝑐𝑐 𝑐𝑐=
𝛽𝛽variables. =
√√to with
𝑀𝑀𝑀𝑀𝑀𝑀
𝑀𝑀𝑀𝑀𝑀𝑀𝑓𝑓𝑐𝑐values
𝑛𝑛
are
∑ are
̂
(𝛽𝛽𝑘𝑘1VIFs ,is calculated
calculated
observations.
𝛽𝛽̂ 𝑖𝑖of.
𝛽𝛽2formulated
independent regression
, …between ̂
, 𝛽𝛽𝑛𝑛 ) variables toto
4
determine
determine
as
coefficients
and follow: 10is given
the
the
suggest
existence
existence
ofby𝑛𝑛𝑦𝑦̂ subsets
further
=
=
ofof outliers
outliers
are
̂0 investigation
𝛽𝛽 +
𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 𝛽𝛽̂1 𝑋𝑋 taken
𝛽𝛽̂𝛽𝛽̂𝑐𝑐+ =
of
=
of
𝛽𝛽
𝑌𝑌𝑌𝑌
𝑒𝑒̂to compare
𝑓𝑓∑
is 𝑋𝑋(𝛽𝛽 ̂𝜆𝜆+
𝛽𝛽̂,𝑖𝑖=⋯
𝛽𝛽̂.̂2 ,+… w 𝛽𝛽
, ̂
emined
otherasvariables. multicollinearity
the meanAfter Thethat,
value, value than
the between
oforvariance
VIF isequal the
formulated independent
parameters
toinflation
10 as implies factor variables.
forthe(VIF) each 1
approach
presence The
sub-dataset
is𝑖𝑖=1 is
VIF tolerance
. calculated
ofto multicollinearity
determine
greater are needs
to computeddetect
the to be
the
achievability calculated
toexistence
variable
problem. obtain of𝑗𝑗𝑡𝑡ℎon
A 𝛽𝛽in
the of, all
value order
𝛽𝛽divided
, the
…ofextract
possible to
, 𝛽𝛽other . indicates
1
𝑐𝑐After
1𝑛𝑛regression variables.
2 that,
𝑐𝑐
𝑛𝑛subsets
2 1
no After
model. 𝛽𝛽𝑐𝑐from is
where observations.
observations.
𝑓𝑓 is a combined function 𝑉𝑉𝑉𝑉𝑉𝑉
which = combines the results For from divided 1 𝑠𝑠𝑠𝑠 regression
sub-dataset toanalysis,
𝑛𝑛
0
sub-dataset.
1we The 𝑛𝑛
obtain the required.
VIF, where isispossible where is 4aais combined afunction which combines theis results
2 2 𝑒𝑒 ̂𝑐𝑐 to
obtain the
multicollinearity VIF, between
𝑛𝑛𝑐𝑐 of 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 the = independent
1Lastly,
−the𝑅𝑅𝑗𝑗independent
combining where
the 𝑛𝑛𝑛𝑛variables.𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡
𝑅𝑅
parameters the The todetermination
tolerance
estimate the =needs 𝑓𝑓coefficients
𝛽𝛽. 𝑐𝑐 𝛽𝛽 be
formulated calculatedofsuggest regression
as infurther
tofollow: order of toimplies 𝑖𝑖=1
10inflation
impliesfactor The values
(VIF)
the presence combinedis1ofcalculated correlation
function
Some
regression
multicollinearity
For
For divided
divided toisamong
assumptions
detect coefficients
regression
regression
determined the
problem.
which
existence
analysis,
as Astandardized
of
2analysis,
the
justify value
mean
subsets
the
𝑗𝑗 variables.
of
wewe
The of
2value,
use extract
extract are
1of
value
residuals,
linear
taken
parameters
indicates 𝑛𝑛 𝑛𝑛VIFs
The
of subsets to
subsets
values
variance
𝑑𝑑no
combined
regression
compare
between forfrom
from
√ each
of𝑀𝑀𝑀𝑀𝑀𝑀
̂
are
awith
regression
inflation
function
models
and
large
large
calculated
for
̂
the
sub-dataset
than 10volume
factor
is
entire
or equal
volume
coefficients
the
̂ determined (VIF)
purpose ̂
are of
determine
dataset.
of computed
to
data.
is data.
as
of
10𝑛𝑛This
ofcalculated
the Then,
predictionThen,
the
investigation
subsets
mean
to the existence
obtain
the
to
value,
are
the taken
are detect
𝛽𝛽0of, 𝛽𝛽to
presence
the
ou,
1
independent theobtain
variables.
the independent variables.
̂
𝛽𝛽
variable = the
determination
approach 𝑗𝑗The ∑
on isVIF,
VIFs to ̂
𝛽𝛽 determine
required.
all
tolerance .
the coefficients
between other needs the
4 =
variables.
andto 1 of−
achievability
be
observations.
10 a𝑅𝑅
After
calculated
suggest
𝑗𝑗 where
regression of
that, the 𝑅𝑅
furtherthe
in of
𝑗𝑗 divided is
VIF
order the is
investigationto determination
regression
formulated
combining is model.
𝑛𝑛 as 𝛽𝛽 coefficients
parameters
𝑐𝑐 = 𝑓𝑓
= 𝑐𝑐 (𝛽𝛽 ,
correlation
to
1 𝛽𝛽 1 of ,
estimate
2 … a , . regression
𝛽𝛽 VIF
𝑛𝑛 )
among
the ̂̂ greater
𝛽𝛽. ̂
𝛽𝛽the of
is independent
formulated as variabl
follow
𝑐𝑐 parameters
parameters
𝑛𝑛 𝑖𝑖 for
for each
each sub-dataset
sub-dataset are
are computed
computed
multicollinearity to to
𝑛𝑛 obtain
approachobtain
between 𝛽𝛽is
𝛽𝛽 ,to
,𝛽𝛽𝛽𝛽 determine
the variable,, ,… …
independent, ,𝛽𝛽𝛽𝛽 . . After
After
the that,
that,
achievability
1variables. 𝛽𝛽 𝛽𝛽 The is
is formed
of formed
the divided byby regressio
linear relationship between independent variables and response constant variance 𝑠𝑠𝑠𝑠 oftolerance
errors andneeds to be
00 1 1 𝑛𝑛 𝑛𝑛𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 𝑐𝑐𝑐𝑐 𝑐𝑐 𝑛𝑛
= 1 − 𝑅𝑅𝑗𝑗2 where 𝑅𝑅𝑗𝑗2 or
variable is combining
the on determination
all the the other coefficients
variables.
where 𝑓𝑓𝑐𝑐 of
After
is athat, regression
that,
combined 1of
function required. 𝑛𝑛1of𝑡𝑡ℎ
variable
than on
𝑗𝑗 𝑖𝑖=1
equal
combining
normality
all
to Some
10 other
implies
𝑛𝑛parameters
𝑛𝑛of assumptions
parameters
the
variables.
error the topresence
to For which
After
estimate
estimate
distribution. divided justify
of
the
the
obtainThese 𝛽𝛽. the
𝛽𝛽. 𝛽𝛽̂𝛽𝛽the =isisuse
̂̂𝑐𝑐VIF
multicollinearity
regression
𝑐𝑐𝛽𝛽𝑐𝑐assumptions
the VIF,
is
formulated
formulated
∑ 𝛽𝛽̂𝑖𝑖 .which
offormulated
linear
analysis, problem.
are
𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 asregression
asyet wecombines
as extract
follow:
follow:
= to 1 A−models
be
=2the
value
validated
𝑅𝑅 𝑛𝑛 where
results
offor1the
subsets
𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 in 𝑅𝑅
.from
2 VIF
the
purpose
indicates
from
is model.the 𝛽𝛽̂a𝑐𝑐 sub-dataset
1greater =of prediction
no
large ̂1 ,̂𝛽𝛽̂2to
𝑓𝑓𝑐𝑐 (𝛽𝛽volume
Hence,
determination 𝛽𝛽 we =, …are , 𝛽𝛽∑
coefficie
sub-d
̂𝑛𝑛data.
) 𝛽𝛽̂ .
coefficients
ons which of
justify 𝑛𝑛 subsets
correlation
the
than use
or ofare
equalamong
linear taken
linearto the to
relationship
regression
10 compare
independent
implies models with
between
combined
parameters
the for
presencethe
variables. 1
the entire
independent
function
for purpose
of VIFs
each dataset.
is determined
between 𝑛𝑛
variables
sub-dataset
of
multicollinearity This
prediction 4 and
as
and are
are theresponse
problem.10 mean
computedsuggest A variable,
value, 𝑗𝑗
further
to
value obtain constant
Some
of investigation
1
𝑗𝑗
assumptions
indicates variance is
no . of
which
After errors 𝑐𝑐
justify
that,and 𝑛𝑛 ̂
the 𝑖𝑖
isus
riables. After that,
e achievability
the
of the
VIF
the VIF
divided
is formulated
is
perform formulated
regression model as as 𝑉𝑉𝑉𝑉𝑉𝑉 = checking .toVIF
adequacy
model.
VIF ̂ ̂
ensure
𝛽𝛽𝛽𝛽 = greater
= 𝑓𝑓 𝑓𝑓thewhere
(𝛽𝛽(𝛽𝛽
𝑖𝑖=1̂
where ̂ ̂̂
adequacy
, ,𝛽𝛽𝛽𝛽 , ,…
𝑓𝑓 … is
, ,𝛽𝛽̂
𝛽𝛽̂a )combined
combined
of
) the model function
function built. 𝛽𝛽
which which
We 0 , 𝛽𝛽 can
1 , …
combines , 𝛽𝛽
combines
conclude
𝑛𝑛 the results
the 𝛽𝛽
from
𝑖𝑖=1 =1
𝑐𝑐
normality of the error distribution. variable 𝑐𝑐
These𝑐𝑐 𝑗𝑗 on 𝑐𝑐𝑐𝑐 all
assumptions 1 1 the 22 other
𝑐𝑐 𝑛𝑛
𝑛𝑛
are variables.
yet toformulated
be After that, the VIF is formulated as 𝑉𝑉𝑉𝑉𝑉𝑉
1validated in𝑡𝑡ℎ the model. Hence, we
𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 𝑛𝑛
between independent required. variables and ofresponse 𝛽𝛽̂11 linear therelationship between independent var
the presence correlation
greater
of The than
multicollinearity
where
where
among
values
linearity or
perform
of
𝑓𝑓𝑓𝑓𝑐𝑐𝑐𝑐equal
isis
the
aregression
the
aproblem.
combined
combined
model 10variable,
toindependent
regression Acombining
coefficients
implies
adequacy value
function
function model constant
variables.
the of 𝑛𝑛and
which
which
checking 1parameters
of
presence
than
variance
𝑛𝑛 VIFs
indicates subsets
constant
combines
combines
or to
between
ensure
equal
toof
no are
thethe
the
errors
estimate
combined
variance
to
taken
results4 The
results
the results
10
and
and
ofthe
adequacy
implies
to
from
from
10
function
errors compare
𝛽𝛽.
values
from
suggest
𝑐𝑐by isis
𝑠𝑠𝑠𝑠
of
the ̂of
𝑠𝑠𝑠𝑠 withfurther
determined
observing
regression
sub-dataset
sub-dataset
sub-dataset
the
presence model of
investigation
as
entire
̂𝑖𝑖plot
toto
built.
as 𝑛𝑛follow:
𝑛𝑛the𝑡𝑡ℎ dataset.
of
coefficients We
mean
residuals
sub-dataset.
sub-dataset.
multicollinearity sub-dataset.can
isofThis
value, against
𝑛𝑛 The
conclude subsets
The
problem. theThese are tak
A vaa
ror distribution. These
required.
Some assumptions
approach assumptions
fitted values. is to are
determine
andA yet
which
normal to the be
justify validated
the
achievability
probability use inof
plot ofthe
linearthe model.
of thevalue, regression
divided
residuals Hence, regression models
can we be used for
̂
model. the𝛽𝛽 =
purpose normality
̂ ∑ ̂ 𝛽𝛽
of . of
prediction
̂ the error are distribution.
endent variables. VIFs between 4 function 𝛽𝛽 𝑐𝑐 isto
= tocheck
𝑓𝑓 (𝛽𝛽 1𝑛𝑛 whether
, 𝛽𝛽 , … , 𝛽𝛽 the error is normally
)
of10 suggest further investigation is variance 𝑐𝑐 𝑛𝑛
combined
combined function isisregression
determined
determined as
ascorrelation
the
theof mean
mean value, approach 𝑐𝑐determine the𝑛𝑛 achievability of1the divided regr
quacy checkingof to multicollinearity problem. Athe value 1can The combined function is errordetermined as 4the
2
linear ensure
relationship
Some the linearity
assumptions adequacy between the
which ofindependent
the justify model model
built.
variables
use of and
We and
linear constant among
responseconclude
regression the the
variable,
models independent of constant
errors
for the by
variables.
performobserving
variance
purpose of VIFs
model
of plot
errors
prediction of residuals
between
adequacy
𝑠𝑠𝑠𝑠 and arê andagainst
checking 10̂ suggest to ensu
distributed. If the resulting
whereprobability plot is
𝑓𝑓𝑐𝑐 is arequired. approximately
combined linear, we assume that the
𝑖𝑖=1 terms are normally
𝛽𝛽𝑐𝑐 =is normally ∑ to𝛽𝛽and𝑛𝑛𝑖𝑖 . sub-d
𝑡𝑡ℎ
of function which combines thecheck results offrom thesub-dataset
1regression
𝑛𝑛𝑛𝑛
ession model and indicates
constant
normality no
of variancefitted
correlation
the error values.ofdistribution.
errors Aindependent
among normal
byThe the
observing
These independent
values ofplot
assumptions plot
regression
ofand residualsthe
are mean residuals
coefficients
11 against
yet tovalue, can ofbe𝑛𝑛constant used
subsets intolinearity are whether
taken ofthe to compare error withmodel
𝑛𝑛 the entire const d
justify the use oflinear linearrelationship
distributed.
regression distributed.
between
modelsIffor the the
combinedpurposevariables
resulting of prediction
function
plot is is
approximately ̂̂𝑐𝑐𝑐𝑐response
determined
𝛽𝛽
𝛽𝛽 are
= = ∑ ∑ as𝛽𝛽̂be
variable,
̂𝑖𝑖𝑖𝑖the
𝛽𝛽
linear, . .
validated
mean
we value,
assume
the variance
that
model.
the
Hence,
error
errors
terms
we
and
are normally
𝑖𝑖=1
mal probability variables. plot
perform of
normality the residuals
model
VIFs
of the adequacy
between
error can be used
checking
4 and
distribution. approach
to check
10 suggestto
These ensure is
whether tofurtherdetermine
assumptions the the
Some adequacyerror the
assumptions
are is achievability
normally
of
𝑛𝑛𝑛𝑛yetvalues the
to be of whichmodelvalidated of
2.justifythe MATERIALS
built. fitted
divided
the We use values.
can regression
of AND
conclude
linear A normal
model.
METHODS
regression the probabilitymodels plot
for theof th
p
dependent variables and response After performing
variable, constant the model
variance adequacy of errors checking,and The transformation regression onindataset the 𝑛𝑛 model.
coefficients will be Hence, ofperformed 𝑛𝑛 we subsetsif are the taken to c
esulting plot islinearity approximately
perform of thedistributed.
model
model linear,
regression
adequacy
built we model
violates assume
checking the and that
assumptions the
toconstant
ensure error
linear variance
the terms
relationship
required adequacy are
of inapproach normally
errors 𝑖𝑖=1
linear
𝑖𝑖=1
between
of by
the observing
regression
is model
to independent
determine built.
model.plot distributed.
1We
the ofThe
variables residuals
can
achievability
If
andthe
conclude
transformed against
response
of
resulting
the
the regression variable,
divided
plot is constant
regression
approxi
ution. These assumptions investigation
fitted values.The are
The A is
yet
values
values required.
normal
to
After be
of validated
ofprobability
regression
regression
performing in
plot
the
coefficients
coefficients
theof model.
model
the of Hence,
of 𝑛𝑛𝑛𝑛 subsets
adequacy
residuals subsets
can
we checking,
be are
are taken
used taken
to to
transformation
check to compare 𝛽𝛽̂
compare
whether = with
∑with
distributed.
on the
̂
𝛽𝛽
dataset 𝑖𝑖the
error thethe
. isentire
entire
will normally be dataset.
dataset.
performed This
This ifvalidated
the
linearity of
model the regression
is formulated model asbuilt. and
below: constant normality variance of the
of error
errors distribution.
by observing 𝑐𝑐 These
In this
plot 𝑛𝑛 ofassumptions
paper, residuals multiple are
against yet linearto be regression andin
ecking to ensure theSome adequacy
approach
approach ofis
assumptions which
model isthe toto
built model
determine
determineviolates the
the We
achievability
achievability
justify
the can
assumptions the concludeuse ofof thethe
the
required divided
divided in regression
regression
linear regression model.
model. model. The transformed regression
ng the model distributed.
adequacy
fitted checking,
values. If A thenormal resulting
transformation probability plot is on approximately
plot dataset
of theperform
willresiduals be linear, model
performed
can we be assume
adequacy used if the to that
checking
check the
analyse whether error tothe terms
𝑖𝑖=1
ensurethe After
large error are
the is normally
adequacy
performing
volume normally of data.theof model the model
The adequacy
analysis built.is c
odel and constant variance of model errors isbyformulated observing Theasplot values of residualsof regression against coefficients of 𝑛𝑛 subsets are 𝜆𝜆
𝑌𝑌 taken , 𝜆𝜆to≠compare with
below: 0 of the bythe entire an da
of theof linear regression 𝑌𝑌models 𝛽𝛽0 for the purpose
s the plotassumptions distributed.
required inbethe linear (𝜆𝜆) linearity of the The regression
𝑘𝑘 𝑋𝑋values of modelregression
dataset 𝑌𝑌and(𝜆𝜆) model
is= constant
coefficients
partitioned built variance
areviolates
into of 𝑛𝑛. subsets
suberrors assumptions
datasets observing
with requ ap
bility distributed.
residuals can If used toregression
resulting check =plot whether model.
+is 𝛽𝛽
approach is to 1 𝑋𝑋
the 1The
approximately + error 𝛽𝛽2transformed +linear,
𝑋𝑋is2 normally
determine
⋯+
the
regression
𝛽𝛽we 𝑘𝑘assume
achievability
+ 𝜀𝜀 where that
of theplot
the error
divided
{ terms
logregression (𝑌𝑌), 𝜆𝜆 𝜆𝜆 = model.
normally
0 as
as below: ofdistributed.
After performing thethat model adequacy fitted
checking, values. A normal
transformation probability modelofbe 𝑘𝑘the 𝑒𝑒residuals
is formulated , 𝜆𝜆can be
0 below:usedistogiven check bywheth
lot is approximately prediction
linear, weare assume linear 𝑌𝑌relationship
(𝜆𝜆)the = error 𝛽𝛽0 + terms 𝛽𝛽1between are𝛽𝛽normally
𝑋𝑋distributed.
+ 𝑋𝑋2 +If⋯ are+taken 𝛽𝛽 𝑋𝑋 to+oncompare𝜀𝜀
dataset
model will
where with
with
𝑌𝑌 (𝜆𝜆) the
=
performed
independent
{
𝑌𝑌
entire dataset. ifvariables
≠the
.This 𝑦𝑦̂ =
modelAfter builtperformingviolates thetheassumptions model adequacy required 1
𝜆𝜆 checking,
𝑌𝑌 𝑑𝑑 =, 𝜆𝜆 ≠ 0 are in 2
linear𝑒𝑒 transformation the
regression resulting
𝑘𝑘 𝑘𝑘
model.
on plot
dataset is
The approximately
transformed
will be log
performed linear,
regression
(𝑌𝑌), 𝜆𝜆 if= we0
the assume that the er
𝛽𝛽0 + 𝛽𝛽1 𝑋𝑋1 + independent 𝛽𝛽variables and response { variable, . approach is to determine theexistence achievability 𝛽𝛽of the𝛽𝛽of
𝑒𝑒
𝛽𝛽model
model2 𝑋𝑋2 + ⋯
isbuilt Lastly,
+ violates
formulated 𝑘𝑘 𝑋𝑋the 𝑘𝑘 + asstandardized
𝜀𝜀the where
below: residuals,
𝑌𝑌 (𝜆𝜆) = required distributed. calculated determine the 𝑌𝑌 (𝜆𝜆)of=outliers 0+ 1 𝑋𝑋1 𝑌𝑌 + 𝛽𝛽2 𝑋𝑋2 +
del adequacy checking, transformation assumptions
on dataset will be performedlog (𝑌𝑌), in 𝜆𝜆√𝑀𝑀𝑀𝑀𝑀𝑀
=linear 0
if 𝑒𝑒the regression model. The transformed
The value of variance inflation factor (VIF) regression
constant variance
is observations. Lastly, ofas the errors standardizedand normality residuals,
𝑒𝑒
of After the = performing divided are calculated regression
the(𝜆𝜆)model to model.
𝑌𝑌 𝜆𝜆adequacy
determine , 𝜆𝜆 ≠checking, the existence transformation outliers on of dataset
𝑌𝑌variablew
mptions requiredmodel in linear formulated
regression
𝑌𝑌 (𝜆𝜆)
= 𝛽𝛽
below:
model.
+ 𝛽𝛽 𝑋𝑋 The
+ 𝛽𝛽 transformed
𝑋𝑋 + ⋯ +
𝑑𝑑regression
𝛽𝛽 𝑋𝑋 √+ 𝑀𝑀𝑀𝑀𝑀𝑀𝜀𝜀 where 𝑌𝑌 = multicollinearity
{
0 between
. in linear theofindependent
error distribution.
dized residuals, 𝑑𝑑 = 𝑀𝑀𝑀𝑀𝑀𝑀 areobservations.
𝑒𝑒 For
calculated
(𝜆𝜆)
These
divided 0 assumptions
regression
1 1 2 are
2 analysis, model
to determine the existence of outliers of 𝑌𝑌 (𝜆𝜆) obtain yet we to
𝑘𝑘built extract
𝑘𝑘 violates 𝑛𝑛 the
subsets assumptions
from log a
𝑌𝑌 𝑒𝑒 (𝑌𝑌),
𝜆𝜆 large
the required
Lastly, ,𝜆𝜆VIF, volume
𝜆𝜆=≠0the 0𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 of
standardized data. regression
=1− Then, residuals, the
2
𝑅𝑅𝑗𝑗 where model. 𝑑𝑑 =The 𝑅𝑅𝑗𝑗√2𝑀𝑀i
𝑒𝑒
√ parameters 𝑌𝑌 = 𝛽𝛽
for +
0 each 𝛽𝛽 𝑋𝑋 +
sub-dataset 𝛽𝛽 𝑋𝑋 +
2 model ⋯ +
are computed 𝛽𝛽 𝑋𝑋
is𝑘𝑘 formulated + 𝜀𝜀 where as below: 𝑌𝑌 = { .
that, 𝛽𝛽̂𝑐𝑐 ofis data.
be validated in the Formodel. divided
(𝜆𝜆)
1Hence,1
𝑌𝑌regression
𝜆𝜆
2 we
, 𝜆𝜆 ≠𝑒𝑒performanalysis,
0
𝑘𝑘
we extract 3.to RESULTS obtain 𝑛𝑛 subsets 𝛽𝛽0AND , 𝛽𝛽1 ,from
variable
…
log , 𝛽𝛽(𝑌𝑌),
DISCUSSION a𝑗𝑗. on
𝑒𝑒 𝑛𝑛observations.
After
large𝜆𝜆 = all0the volume other variables.
formed Then, by
Afterthe that, 𝜆𝜆the
𝑋𝑋1 + 𝛽𝛽2 𝑋𝑋2 + ⋯Lastly, + 𝛽𝛽𝑘𝑘 𝑋𝑋the combining
+ 𝜀𝜀
𝑘𝑘 standardized where 𝑛𝑛 𝑌𝑌 parameters = { to estimate the
. 𝛽𝛽. 𝛽𝛽 ̂ is formulated as follow: (𝜆𝜆) by 𝑌𝑌
egression analysis, model weadequacy extractparameters 𝑛𝑛checkingsubsetsresiduals, for to each
from ensure
log 𝑑𝑑large
a𝑒𝑒 (𝑌𝑌), =the
sub-dataset
𝜆𝜆√= adequacy
volume
𝑒𝑒0
areare calculated
of computed
𝑐𝑐
data. 𝑌𝑌 (𝜆𝜆) Then,to=KC determine
to 𝛽𝛽 Housesales
obtain
the + 𝛽𝛽 𝑋𝑋 the+
𝛽𝛽
than 0 , 𝛽𝛽or
existence
1𝛽𝛽Data
,2…𝑋𝑋equal .⋯of
, 𝛽𝛽+𝑛𝑛obtained
For After
to + outliers
divided
10𝛽𝛽 𝑋𝑋 from
that,
implies + of𝛽𝛽
regression
𝜀𝜀 𝑌𝑌 is formed
̂Kaggle
𝑐𝑐wherethe presence analysis,
𝑌𝑌 = {ofwe
logmu
𝑀𝑀𝑀𝑀𝑀𝑀 0 1 1 2 𝑘𝑘 𝑘𝑘
Lastly, the standardized 𝑛𝑛𝛽𝛽residuals, 𝑑𝑑 = arethe 𝛽𝛽̂calculated
𝑐𝑐 ̂= 𝑓𝑓 (𝛽𝛽 ̂ ,to𝛽𝛽̂2determine
, … , 𝛽𝛽̂𝑛𝑛 )as the existence of outliers of sub-dataset
𝑌𝑌 𝑒𝑒 (
h sub-dataset𝑒𝑒 are observations.
of the
computedmodel combining
built.
to obtain We can parameters
0 , 𝛽𝛽conclude
1 , … , 𝛽𝛽𝑛𝑛 . the
toAfter estimate
√𝑀𝑀𝑀𝑀𝑀𝑀linearity that, 𝛽𝛽𝑐𝑐𝛽𝛽. is𝛽𝛽̂𝑐𝑐website is1formulated
𝑐𝑐 formed by follow:
(https://www.kaggle.com/swathiachath/
correlation𝑠𝑠𝑠𝑠
parameters among the for independent
𝑡𝑡ℎ
each variables. are comp VIFs
uals, to
eters = 𝑀𝑀𝑀𝑀𝑀𝑀 observations.
𝑑𝑑 estimate are 𝛽𝛽.
the calculated
For where
𝛽𝛽̂𝑐𝑐divided
is formulated is a combined
𝑓𝑓𝑐𝑐regression
to determine as follow: the function
analysis, existence we which of outliers
extract combines
𝑛𝑛 subsets ̂𝑐𝑐 =
𝛽𝛽of 𝑌𝑌the 𝑓𝑓𝑐𝑐fromresults
(𝛽𝛽 ̂1 , 𝛽𝛽̂a2 ,from …large 𝛽𝛽̂𝑛𝑛1)volume
, required. sub-dataset
combining of𝑒𝑒 data. to𝑛𝑛𝑛𝑛Then, parameters sub-dataset. the to estimate The the 𝛽𝛽. 𝛽𝛽̂
√
parameters For
̂ combined
for
divided ̂ each function
regression
̂ sub-dataset̂ isanalysis,
determined
are computed
we asLastly,the mean
extract to
the value,
𝑛𝑛 obtain
standardized
subsets 𝛽𝛽 ,from
𝛽𝛽 , …
residuals,
a, 𝛽𝛽 .
large After
𝑑𝑑 = 𝑀𝑀𝑀𝑀𝑀𝑀
volume that,
𝑠𝑠𝑠𝑠 of𝛽𝛽̂ are calculated to determine the exi
data.is formedThen, 𝑡𝑡ℎ by
the 𝛽𝛽̂𝑐𝑐
𝛽𝛽𝑐𝑐 = 𝑓𝑓𝑐𝑐 (𝛽𝛽 where 1 , 𝛽𝛽2 ,𝑓𝑓… 𝑐𝑐 is , 𝛽𝛽a𝑛𝑛 )combined function which combines0 the 𝑛𝑛 results fromSome
1 𝑛𝑛 1 sub-dataset assumptions
√ 𝑐𝑐 towhich 𝑛𝑛 sub-dataset. justify the use Theof line
combining
parameters 𝑛𝑛 parameters
for each to
sub-dataset estimate the
are computed
𝛽𝛽. 𝛽𝛽 ̂ observations.
is formulated to obtain as 1 follow:
𝛽𝛽 , 𝛽𝛽 , … , 𝛽𝛽 . After that, 𝛽𝛽̂ is formed by
analysis,
ned function we which
extractcombines 𝑛𝑛 subsets combined
thefrom results afunction
large 1volume
from is sub-dataset
𝑠𝑠𝑠𝑠 determined of 𝑐𝑐data. as 𝑛𝑛Then,
to the mean
𝑡𝑡ℎ thevalue,
sub-dataset. 𝑛𝑛linear relationship where 𝑓𝑓𝑐𝑐𝑐𝑐 isbetween a combined independent function variables which com a
∑The
0 1̂
combining 𝑛𝑛 parameters to𝑛𝑛 .estimate the 𝛽𝛽. 𝛽𝛽̂𝑐𝑐𝛽𝛽̂𝛽𝛽̂ is𝑓𝑓 formulated For ̂1 , 𝛽𝛽divided̂2𝛽𝛽̂,𝑐𝑐…=, 𝛽𝛽 ̂ regression 𝛽𝛽𝑖𝑖𝑛𝑛. analysis, we extract subsets from
𝑛𝑛distribution. a large volu
aset are computed
s determined as the to obtain
mean value, 𝛽𝛽0 , 𝛽𝛽1 , … , 𝛽𝛽 After that, =
𝑐𝑐 𝑐𝑐 is 𝑐𝑐 formed
(𝛽𝛽 by as𝑛𝑛𝑛𝑛 )follow: 1 normality combinedof the errorfunction is determined These as assumpti
the mea
1672 Chiang Mai J.6Sci. 2022; 49(6)
kc-housesales-data) on 23th June 2022 is used the dependent variable for 4 all subsets. For every
as the population. There are 21597 instances in subset, the residuals drift apart which violates
AND DISCUSSION the dataset and four attributes are chosen to be the assumption of linearity. The residual plots
analysed. The independent variables are X1 (size do not indicate a linear regression function but a
of lot), X2 (size of living area), X3 (condition)
esales Data obtained from Kaggle website (https://www.kaggle.com/swathiachath/kc- curvilinear regression function. Furthermore, the
) on 23th Juneand 2022 theisresponse
used as the variable is Y (selling
population. There price
are 21597 of instancesresidualin plots
the appear
dataset to and be a funnel shape which
re chosen to be theanalysed.
house). The The equation
independent variables
of the multiple arelinear
X1 (size opensof lot),to (size
X2the indicating that Figure
of living
right 8. q-q Plot of Subse
the residuals
ition) and the response
8. q-q Plot of regression variable
Subset 3 is stated as below: is Y (selling price
Figure 9. of
Figurethe
q-q Plot house).
8. q-q The
Plotapart
ofdrift
Subset equation
of 4Subset
from left of the
3 to right instead of showing Figure 9. q-q Plot o
egression is stated as below: a regular spread centred around 0. This implies
𝑌𝑌̂ = 𝛽𝛽̂0 + 𝛽𝛽̂1 𝑋𝑋1 + 𝛽𝛽̂2 𝑋𝑋2 + 𝛽𝛽̂3 𝑋𝑋3 . the inconsistent variances of error.
Then, the assumption of normality of errors
cal Analysis of Subsets
3.1 Statistical Analysis of Subsets is checked by quantile-quantile (q-q) plot for all
is partitioned intoThe fivedataset
subsets.is Consequently,
partitioned intomultiplefive subsets. linear regression
the subsets. is fitted for each
According to Figures 2 (a-e), the q-q
rall dataset (21597 instances). The statistical summaries
Consequently, multiple linear regression is fitted for are shown in Table 1. Table
plots illustrate that the normality 1 of errors does
e is only slighteachdifference between the mean values
subset and overall dataset (21597 instances). of regression coefficients and values of
not exist in all the subsets as the plots do not
icients obtained from the overall dataset. This shows that the divided regression method
The statistical summaries are shown in Table 1. show a linear behaviour.
Table 1 shows that there is only slight difference Figures 1 (a-e) and Figures 2 (a-e) indicate that
between
Table the mean values
1. Values of regression
of Regression coefficients
Coefficients the model obtained is not appropriate as it violates
and values of regression coefficients obtained the assumptions of linearity. Hence, transformations Figure 10. q-q Plot of Su
e 10. q-q𝛽𝛽Plot of Subset
0 from 5 𝛽𝛽1 dataset. This shows
the overall 𝛽𝛽2 that Figure
the 10. on q-q𝛽𝛽these
3
Plot of Subset
data 5
by Box-Cox method are used to
-2.148e5divided regression 297.7 method works-0.1683 well. 4.12e4the skewness of the distribution Figures
amend of errors,1-10 indicate
-2.392e5
ures 1-10 indicateTable 297.5
that 2the showsmodel thatobtained is-0.2059
all the variance inflation
not appropriate 4.82e4
Figuresasinconsistent
1-10 indicatevariance
it violates that
the of model
the
assumptions errors ofandlinearity. Hence,
non-linearity
obtained is not transform
appropriate as
-1.567e5 factor (VIF) 271.7
values are small -0.5864
enough linearity.
and close
. Hence, transformations on these data by Box-Cox method are used to amend the skewness of Hence,
3.904e4
of transformations
regression model. on
The these datathe
transformed by distribution
Box-Cox
multiple of
method errors,
are
ribution to 1.00.
of errors,
-2.099e5 Hence,
inconsistent
286.5 the variance
problem of -0.3673the and distribution
of multicollinearity
errors non-linearitylinear
4.708e4of oferrors,
regression inconsistent
regression model model. variance
of each transformed
The subset givenmultiple
of iserrors linea
and non-line
1
med multiple
-1.825e5can linear regression
be excluded
273.4in the model modelsof each thetransformed
allsubset
of-0.2584 multiple
by 𝑌𝑌̂ = (𝛽𝛽
is 1given 4.813e4
subsets. linear
̂0 + 𝛽𝛽regression
̂1 𝑋𝑋1 + 𝛽𝛽̂2 𝑋𝑋2model+ 𝛽𝛽̂3of each
𝑋𝑋3) subset
̅ .. The
𝜆𝜆 The is given
estimated valu
. The -2.006e5
estimated values Figures 𝜆𝜆1̅ of
of285.4 (a-e)
each show
subsetthatobtained
there isfrom
-0.3173 nô linear
𝛽𝛽3 𝑋𝑋3) ̅
R 𝜆𝜆are. The estimated
estimated
4.473e4
stated in Table values
3. of
values ̅
of 𝜆𝜆 of ofeach
eachsubset
subsetobtained
obtainedfrom R are sta
-1.986e5 relationship between
285.0 the independent variables
-0.2926 and from
4.411e4 R are stated in Table 3.
Table 3. Estimated Values of 𝜆𝜆 ̅ Table 3. Estimated Values of 𝜆𝜆̅
that all the variance inflation factor (VIF) values̅ are small enough and close to 1.00.
Subset 𝜆𝜆̅
Subset 𝜆𝜆
lem of multicollinearity can be excluded in the models of all the subsets.
1 0.16 1 0.16
22. Variance Inflation Factor -0.02 2 -0.02
Table Table1. Values of regression coefficients. 3 0.28
3 0.28
VIF β0 β1 β2 4 β3 -0.06
Dataset4 -0.06
X1 X2 X3
Subset51 -2.148e5 0.01 297.7 -0.16835 4.12e4 0.01
1.019535 1.018988 1.000746 The statistical summaries of
Subset 2 -2.392e5 297.5 -0.2059 4.82e4
1.043432
istical summaries of Subset
the transformed 1.038224
model for each The
subset1.005702
statistical
are shown summaries
in Table of
4. the transformed model for each subset are sho
1.058573 3 -1.567e5
1.057413 271.7
1.003035 -0.5864 3.904e4
1.034384Subset Table
4 4. Values 1.031992 of Regression Coefficients
-2.099e5 1.002394
286.5 Table 4. Values of4.708e4
-0.3673 Regression Coefficients
Dataset 𝛽𝛽0
1.030604Subset 5 1.025867
-1.825e5 1.006500
273.4
Dataset 𝛽𝛽 -0.2584 𝛽𝛽 4.813e4
Subset 1 𝛽𝛽 6637
aset 𝛽𝛽0 𝛽𝛽1 𝛽𝛽2 𝛽𝛽30 1 2
set 1 66370 Mean 5.45 -2.006e5 Subset
-0.000557 285.4 1 66370
935.3 -0.3173 5.45 Subset
4.473e4 2 -0.000557 7868
5 show that there isOverall no linear relationship between the independent
Subset
285.0 2
variables and
7868 -0.2926 the -0.0622 4.411e4 Subset 3 0.00001215 26740
set
able2 for all subsets.
7868 For every -0.0622 -1.986e5
subset, the residuals 0.00001215 drift apart which -10.09 violates the
set 3 Subset 3 267400 46.26 -0.08218
inearity. The 267400
residual plots do not 46.26 indicate a linear -0.08218 regression function but 7856a curvilinear
ion. Furthermore, the residual plots appear to be a funnel shape which opens to the right
the residuals drift apart from left to right instead of showing a regular spread centred
mplies the inconsistent variances of error.
Chiang Mai J. Sci. 2022; 49(6) 1673
Table 2. Variance inflation factor.

VIF
Subset
X1 X2 X3
1 1.019535 1.018988 1.000746
2 1.043432 1.038224 1.005702
3 1.058573 1.057413 1.003035
4 1.034384 1.031992 1.002394
5 1.030604 1.025867 1.006500
Figure 1. Residual Plot ofFigure

Subset1.1 Residual Plot of Subset
Figure
1 2. Residual Plot ofFigure
Subset2.2 Residual Plot of Subset 2
(a) Subset
Figure 1. Residual Plot of1.Subset 1 (b) Subset
Figure 2. Residual Plot of2.Subset 2
(c) Subset
Figure 3. Residual Plot of3.
Subset3.3 Residual Plot of Subset
Figure (d) Subset
3 4. Residual
Figure 4.Subset4.4 Residual Plot of Subset 4
Plot ofFigure
Figure 3. Residual Plot of Subset 3 Figure 4. Residual Plot of Subset 4
Subset5.5 Residual(e)Plot
Figure 5. Residual Plot ofFigure Subset 5. 5
of Subset
Then, the assumption of Then, normality of errors is of
the assumption checked by quantile-quantile
normality (q-q)byplot
of errors is checked for all the
quantile-quantile (q-q)
FigureAccording
subsets. 5. ResidualtoPlot of Subset
Figures 6-10, 5the q-q plots illustrate that the normality of errors does not exist in all
subsets. According
Figure 1. Residual plot of subset (a-e). to Figures 6-10, the q-q plots illustrate that the normality of errors does
Then, as
the subsets thethe
assumption
plots dothe ofsubsets
not normality
show aas theofplots
linear errors
doisnotchecked
behaviour. show a by quantile-quantile
linear behaviour. (q-q) plot for all the
subsets. According to Figures 6-10, the q-q plots illustrate that the normality of errors does not exist in all
the subsets as the plots do not show a linear behaviour.
Figure 5. Residual Plot of Subset 5
1674 Chiang Mai J. Sci. 2022; 49(6)
Then, the assumption of normality of errors is checked by quantile-quantile (q-q) plot for all the
subsets. According to Figures 6-10, the q-q plots illustrate that the normality of errors does not exist in all
the subsets as the plots do not show a linear behaviour.
(a) Subset 1. 1 (b) Plot

Subset 2. 6
Figure 6. q-q Plot of Subset Figure 7. q-q of Subset 2
bset 3 Figure 9. q-q Plot of Subset 4
Figure 8. q-q Plot of Subset 3 Figure 9. q-q Plot of Subset 4

(c) Subset
Figure 8. q-q Plot 3. 3
of Subset (d)Plot
Figure 9. q-q Subset 4.
of Subset 4
Subset 5
Figure 10. q-q Plot of Subset 5

te that the model obtained is not appropriate as it violates the assumptions of
rmations on these data by Box-Cox method are used to amend the skewness of
s, inconsistent varianceFigures of errors
1-10and non-linearity
indicate
Figurethat 10. the
q-q(e)ofmodel
regression5. model.
of obtained
Subset
Plot Subset 5 is notThe appropriate as it violates the assumptions o
near regression model linearity.
Figure of10. Hence,
each
q-q Plottransformations
subset given5 by 𝑌𝑌̂on=these
ofisSubset (𝛽𝛽̂0 + 𝛽𝛽̂1 𝑋𝑋1
data by + 𝛽𝛽̂2 𝑋𝑋2 +method are used to amend the skewness o
Box-Cox
Figurethe2. distribution
q-q plot of subset (a-e). inconsistent variance of errors and non-linearity of regression model. Th
of errors,
values of 𝜆𝜆̅ of each subset obtained from R are stated in Table 3.
transformed multiple linear Figures regression
1-10 indicate modelthat of each
the modelsubsetobtained = (𝛽𝛽̂0 + 𝛽𝛽̂1 as
is givenisbynot𝑌𝑌̂ appropriate 𝛽𝛽̂2 𝑋𝑋2 +
𝑋𝑋1it+violates
Figures 1 1-10 indicate that the model obtained is not appropriate as it violates the assumptions of
linearity. Hence,
̂3Estimated
̅ . The estimated values ̅ oftransformations
of 𝜆𝜆these each
on these data by Box-Cox method are used to am
Table
Table 𝛽𝛽3.
3.
linearity. 𝑋𝑋3)
Estimated
Hence,
𝜆𝜆
Values
values of
transformations
the
̅.
of 𝜆𝜆distribution
on datasubset
of errors,
obtainedmethod
byinconsistent
Box-Cox from Rare are used
variance of
statedtoinamend
Table the
3. skewness of
errors and non-linearity of reg
the distribution of errors, inconsistent
transformed multiple variancelinear ofregression
errors andmodel non-linearity
of each of regression
subset is given model.
by ̂
𝑌𝑌 The
= (𝛽𝛽̂
Subset
transformed multiple 𝜆𝜆̅ Subset
linear regression
1 Table 3.
model ofEstimated
each subset Values of 𝜆𝜆̅ by 𝑌𝑌̂ = (𝛽𝛽̂0 + 𝛽𝛽̂1 𝑋𝑋1 + 𝛽𝛽̂2 𝑋𝑋2 +
is given
1 1 0.16 𝛽𝛽̂3 𝑋𝑋3)𝜆𝜆̅ . The̅ estimated values of 𝜆𝜆̅ of each subset obtained from R are stated in Table
𝛽𝛽̂3 𝑋𝑋3)𝜆𝜆̅ . The estimated1values of 𝜆𝜆Subset of each subset obtained from 𝜆𝜆̅ R are0.16
stated in Table 3.
2 -0.02
2 1 0.16 -0.02
3 0.28 Table 3. Estimated Values of 𝜆𝜆̅
3 Table 3. Estimated Values of 𝜆𝜆̅ 0.28
4 -0.06 2 -0.02
Subset 𝜆𝜆̅
5 0.01
4 Subset3 𝜆𝜆̅ 0.28 -0.06
1 0.16
5
1 4 0.16-0.06 0.01
2 -0.02
s of the transformed model for each subset are shown 2 5in Table 4. -0.020.01
3 0.28
3 0.28
4 -0.06
Table 4. Values The of Regression
statistical Coefficients
summaries of the
4 transformed model for -0.06each subset are shown in Table 4.
5 0.01
Chiang Mai J. Sci. 2022; 49(6) 1675
The statistical summaries of the transformed 3 are row 839, 1796 and 4125. Hence, we decide
model for each subset are shown in Table 4. to remove these outliers from subsets 1 and 3 to
After transformations made on the dependent improve the result. 10
variables, the residual plots in Figures 3 (a-e) The residuals and q-q plots obtained after
show that the residuals lie within an area centred removing the three potential outliers from subsets
Figures 23-26 show
at around 0 which implies that model obeys the the range of the residuals and 1 and sample 3 arequantiles shown inofFigures subsets61(a-d). and 3 are improved
10
after removing outliers. Meanwhile, new values of ̅ for subsets 1 and 3 are given and 0.02 10
assumption of linearity. However, there are a few 𝜆𝜆 Figures 6 (a-d) show the range ofbythe 0.01 residuals 10
respectively. The new statistical summaries of subsets 1 and 3 are shown in Table 5.
outliers which drift apart from 0, for example the and sample quantiles of subsets 1 and 3 are improved
gures 23-26 show outlierstheFigures
range
in subset of23-26
the
3 are residuals
far apart
show theand from
range sample 0.
ofAnyway,the quantiles
Figures
residuals ofafter
23-26 subsets
and showremoving
sample 1the and range3 areofimproved
outliers.
quantiles the
of residuals
Meanwhile,
subsets 110 and and
new 3samplevalues
are improvedquantiles of s
moving outliers. Meanwhile,
Figures 23-26 new
show values
the rangeof Table
𝜆𝜆
of ̅ for
the 5. Values
subsets
residuals 1of Regression
and
and 3
sample are givenCoefficients
quantiles by 0.01
of and
subsets 0.021 and 3 are ̅
improved
theafter removing outliers.
transformations do improve Meanwhile,
the linearity after new removing
of the Figuresof
values outliers.
of23-26 ̅
𝜆𝜆 forshow Meanwhile,
subsets the range 1 and of new 33 are thevaluesresiduals
are givengivenby of 𝜆𝜆
by0.01 and for
0.01sample subsets
and and 0.02 1 and 3
quantiles
vely. The new statistical summaries of subsets 1 and 3 are shown in ̅ Table 5. 10
after removing outliers. Meanwhile, new respectively.
values
after of
removing The
𝜆𝜆 for new subsets statistical 1 and summaries
3 are given of subsets
by 0.01 of 𝜆𝜆 for subsets 1inand
1 and ̅ 30.02 are shown T
model Dataset
respectively.
in a largeThe extent.newIn 𝛽𝛽statistical
0addition,summaries the residuals 𝛽𝛽1 of subsets 0.021outliers. and 3𝛽𝛽are
respectively. 2 Meanwhile,
shown The in new new
Table values
𝛽𝛽5.
statistical3 summaries
respectively. The new statistical summaries of subsets
respectively. 1 and 3 are shown in Table 5.
Figures 23-26 become showstructureless
Subset the
1 range ofinstead
Table 5. Values
the residuals
11270 of
ofvalues forming
Regression
and0.04637 sample
funnelquantiles
aCoefficients ofThe ofnew subsets
-0.000001981
subsets statistical
1 and1 3and are summaries
3shownare improved inof
8.151 subsets
Table 5. 1 and 3 are shown i
ter removing
igures 23-26 show outliers.the Meanwhile,
range of the new
residuals and of
Table sample𝜆𝜆 ̅ 5.for subsets
Values
quantiles of 1 and
Regression
of subsets 3 are 1 given
Coefficients
and Table
3 by
are 5.
0.01 Values
improved and of
0.02 Regression Coefficients
shape Subset
which 3 implies that 12700 the Table
transformed 0.106model -0.000146 18.63
spectively.outliers.
removing The new statistical summaries
Meanwhile, new values ofof subsets 5.
𝜆𝜆̅ forValues
1 subsets
and 3ofare Regression
1 shown
and 3 in
are Coefficients
Table given 5.by 0.01 Table and 5. 0.02Values of Regression Coefficients
taset 𝛽𝛽 𝛽𝛽 𝛽𝛽
Dataset 3.2 Check30for 𝛽𝛽
ctively.
satisfies the assumption
0 Dataset 1 of 𝛽𝛽constant variance 2 of
are𝛽𝛽1shown in Table8.151 𝛽𝛽2Outliers 𝛽𝛽1 𝛽𝛽3 𝛽𝛽2
set 1 The new statistical summaries 𝛽𝛽0of subsets 1 and 𝛽𝛽31 Subset 𝛽𝛽5.
0
Dataset
11270
errors. 3.2 Check for 0.04637
Outliers -0.000001981 Dataset1 First, 11270 the outlying
𝛽𝛽 Y 0.04637
observations 𝛽𝛽 𝛽𝛽 are determined -0.000001981 𝛽𝛽2
Subset 1 Table 5. Values 11270 of Regression 0.04637 Coefficients -0.000001981 2 0 3 8.151
1
set 3 Subset
12700
First, The 1 q-q
the plots 11270
0.106
inYFigures 0.04637
4 (a-e)areshow-0.000146 Subset a 3 1by
Subset -0.000001981
byusing 18.63
12700the11270 standardized0.106
standardized 8.151
residuals,
0.04637
𝑒𝑒
-0.000146
-0.000001981
Subset 3outlying
Table 5. Values observations
12700 of Regression determined
0.106
Coefficients using the
-0.000146 residuals, 18.63𝑑𝑑 = √𝑀𝑀𝑀𝑀𝑀𝑀 . The
Dataset ideal Subset
linear 3
𝛽𝛽
behaviour. 12700
Hence, 𝛽𝛽
we
range for the standardized residuals are Subset
0 1 can conclude0.106 that 𝛽𝛽 The -0.000146
from -33to 3. The standardized
2 ideal range
12700 𝛽𝛽 3 for the 18.63
standardized
residuals residuals
0.106are obtained for-0.000146 all
2ataset
Subsetfor
Check 1 Outliers
the model
subsets 𝛽𝛽011270
to holds
check the
whether
3.2 Check for Outliers
𝛽𝛽10.04637
assumption
they fall of
in thenormality -0.000001981
𝛽𝛽2 Check
3.2
expected range. arefor from
Outliers
Moreover, -3𝛽𝛽3to8.151 3.
the The rule standardized
of thumb on residuals
ℎ 𝑖𝑖𝑖𝑖 is applied are in
𝑒𝑒
ubset the1 outlying
st,Subset ofY 11270
3 identifying
3.2errors12700
Check
observations the
after forapplying 0.04637
Outliers
outlying
are 0.106
X observations
determined
transformations by using-0.000001981the
on -0.000146
where
First, standardized
the
3.2theℎ Check is the
obtained
𝑖𝑖𝑖𝑖 outlying
𝑡𝑡ℎ8.151
residuals,
forusing 𝑖𝑖 diagonal
for
Outliers
Y observations 18.63
all 𝑑𝑑subsets
= element to . Theof
check
are determined the hat
whether matrix. they The
residuals, by using the standardiz
𝑒𝑒
First, the outlying Y observations are determined by the standardized √𝑀𝑀𝑀𝑀𝑀𝑀 𝑑𝑑 𝑒𝑒= . The
ubset
nge for 3 the standardized12700the
First,
dependent
outlying outlying However,
variable.
X residuals
observations 0.106
Y
are observations
having
from -3 the are
ℎto𝑖𝑖𝑖𝑖q-q
value
3. plots
The -0.000146
determined
moreand
standardized than by
2(𝑘𝑘+1)
fallusing in
residuals thethe
will 18.63standardized
expected
indicate
are obtained range.
that residuals,
ℎfor Moreover,
is
all large. 𝑑𝑑 = Tables the .
√The
rule
6 𝑀𝑀𝑀𝑀𝑀𝑀
and 7
ideal range for the standardized residuals ideal range areFirst, forthe
from -3𝑛𝑛outlying
the to 3. TheYstandardized
standardized observations
residuals 𝑖𝑖𝑖𝑖residuals are from
are determined -3 toobtained
are 3.byThe
√𝑀𝑀𝑀𝑀𝑀𝑀 using fortheall standar
standardized
3.2 Check
to check whether
idealforrange
residual
depict Outliers
they fall
plots
thatfor in
there
the of the expected
subsets
exists
standardized 1 and
outlying range. 3
residualsX Moreover,
show
and Y
are
subsetsthat the the
observations
from to -3
checkrule
to of 3. ofthumb
in
The
whether thumb
each on on
subset.
standardized
they ℎ
fall isis
Hence,
in applied
applied
residuals
the it is
expected inin identifying
important
are obtained
range. to the
find
for
Moreover, all out the rule o
subsets to check whether they fall in the ideal range for the
𝑡𝑡ℎ expected range. Moreover, the rule of 𝑒𝑒
standardized 𝑖𝑖𝑖𝑖 residuals are from
thumb on ℎ𝑖𝑖𝑖𝑖 is applied in-3 to 3. The standardiz
ingCheck
.2 the outlying
First, for
the Outliers
whether
subsets
outlying
range X toofobservations
Y the
check potential
whether
observations
the residuals where
identifying the outlying X observations outliers
they
are
and ℎ
fall
determined
sample
𝑖𝑖𝑖𝑖 is
are
in the
the 𝑖𝑖
influential
expected
by
quantiles
identifying diagonal
using
subsets are to
range.
theto the
the element regression
Moreover,
standardized
where ℎ𝑖𝑖𝑖𝑖 is 𝑡𝑡ℎ outlying
check outlying whetherof X Xthe
𝑡𝑡ℎ model.
the
residuals, hat
observationsrule
observations
the 𝑖𝑖 diagonal element they matrix.
fall The
of𝑑𝑑 in thumb
= The
plots
where
the where on of
.
expected The
ℎℎ residuals
isis applied
of the hat matrix. The the
𝑖𝑖𝑖𝑖
𝑖𝑖𝑖𝑖 range.the
the against
𝑖𝑖 𝑡𝑡ℎin
diagonal
Moreover, elem
ru
leverage are used to determine the 2(𝑘𝑘+1)influential cases. 𝑒𝑒 √𝑀𝑀𝑀𝑀𝑀𝑀
girst,
X identifying
observations
the outlying rather having
Y the
observations
big compared
ℎ outlying
value areto X
more observations
than
determined
the other by
subsets. where
will
using indicate
the
Hence,
identifying ℎ is
standardizedthat the
diagonal ℎ 𝑖𝑖
the outlying diagonal
is large.
residuals,
element X ofelement
Tables
𝑑𝑑 the
=
observations 6
hat of
and . the
The
matrix. 7 hat
The matrix.
2(𝑘𝑘+1) The𝑡𝑡ℎ
outlying X
eal range for theoutlying standardized residuals are from𝑛𝑛ℎ-3 outlying to 3. TheX𝑖𝑖𝑖𝑖standardized
observations 𝑖𝑖𝑖𝑖 residuals
having ℎare obtained
𝑖𝑖𝑖𝑖 value 𝑀𝑀𝑀𝑀𝑀𝑀more ℎ𝑖𝑖𝑖𝑖 where
for
is than all ℎ𝑖𝑖𝑖𝑖Tables is the will𝑖𝑖 indicatediagonal thate
2(𝑘𝑘+1)
X𝑖𝑖𝑖𝑖 observations having 𝑖𝑖𝑖𝑖 value more 2(𝑘𝑘+1) than will indicate √that large. 𝑛𝑛2(𝑘𝑘+1)6 and 7
ubsets
hat there
range to check
for exists
the we whether
outlying decide
outlying
standardized they
X observations fall Y
toXresiduals
remove
and in observations
the
having
some
are expected
from ℎ𝑖𝑖𝑖𝑖-3 value
potential to range.
in 3.each more
outliers
The Moreover,than
for
subset.
standardized
outlying Hence, the rule
observations
X 𝑛𝑛observations
𝑛𝑛 of
will
it
residuals is thumb
indicate
important having
are
having on
that
obtained toℎ ℎ find is is
value
𝑖𝑖𝑖𝑖 value
applied
for large.
out more
all in
Tables
than 6 and 7
depict that there exists outlying Table X and depict 6. YThe that there
observations
𝑡𝑡ℎ Determined
exists inYeach outlying
Outliers subset. X and ℎHence,
𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖
it more
Y observations is important thanin each to find
𝑛𝑛
will indicate
subset.
out Henc
rsentifying
the
to check the
potential
depictoutlying
whether outliers
subsets they
that
whether of X are
fall
there1theobservations
and influential
inpotential
the outlying
3.
exists expected towherethe
outliers X range. regression
ℎ𝑖𝑖𝑖𝑖 Y
and is
are Moreover,
whether the
observations
depict
influential
diagonal
𝑖𝑖model.
thethe
thatto The
rule
will
potential
in each
there
the ofelement
plots
indicate thumb
exists
regression ofthat
outliers
subset. of
outlying the
residuals
on ℎ𝑖𝑖𝑖𝑖 is
are
Hence,
model.
hat
X large.
influential
it
and
The
matrix.
against
applied is Tables in
important
Y
plots
The
to 6the
observations
of and residuals7find
depict
toregression in out
each model.
againstsubset. Th
H
e
fyingare used
the to
utlying X observationsdetermine
outlying
whether
leverage Xthe the
Subset
Figures having influential
observations
potential
are5used (a-b) value
ℎ𝑖𝑖𝑖𝑖 to cases.
where
outliers
are more
determine are
ℎ is
𝑖𝑖𝑖𝑖 than
enlargement the the
influential
Number
2(𝑘𝑘+1)
leverage𝑖𝑖
influential
𝑡𝑡ℎ
diagonal
ofwill
from
whether are
toObservation
the used
indicate
cases.the element
regression
that tothere
potential determine
that of
ℎmodel.
exists the
is large.
outliers hat
the The
outlying matrix.
influential
areNo. plots
Tables Xof and
influential The
6ofcases.
Potential
and residuals
Y 7to the against
observations regression model.
𝑖𝑖𝑖𝑖
leverage are3outlying
used 2(𝑘𝑘+1) 𝑛𝑛 cases.
ng X that
epict observations
there Figures
existshaving ℎ𝑖𝑖𝑖𝑖to
(a,c). From
Xdetermine
value and more
Figures the
than 5𝑑𝑑influential
Y observations > 𝑛𝑛3 wewill
(a-b), inleverage
notice
each indicate areinthat
subset. used <ℎto
𝑑𝑑each
Hence, determine
is itlarge.
𝑖𝑖𝑖𝑖subset.
−3 is Hence, Tables
important is6Outliers
the itinfluential to and
important
find 7 out cases. to find out
Table 6. The Determined Y Outliers
Table 6. The Determined Y Outliers Table 6. The Determined Y Outliers
hether the exists
that there potential
thatoutlying
thereoutliers
1areXthree are obvious
and influential
Y outliers
observations to12the in each
in regression
subsets 1 model.
subset. whether
Hence, 3Theit the is plots potential
important of residuals outliers
to find15 against
are
out influential to the
Table 6. The Determined Y Outliers Table 6. residuals
The Determined
verage
er the are
Subset used
potential and tooutliers
3determine 2 are
respectively.
Subset
the
Number influential
influential
The to cases.
of Observation
outliers thesubset
in 5regression 1 are row
Number
model.regression
Subset
of Observation
Theof
No. 7 plots Potential of residuals
model. The Number plots against
No. 12 of
of Observation
Potential against Y Outliers No
ge are used to1281, determine 40103𝑑𝑑the
Subset > influential
and 34021 whilecases. the outliers Number of Observation Outliersused to No. of Potential
Table 6. The Determined
𝑑𝑑𝑑𝑑<>−3
29 3in subset Y OutliersSubset leverage 𝑑𝑑2 <are −3 𝑑𝑑 determine
> 3 Number theof
31
Outliers influential
Observation 𝑑𝑑 <cases. −3
1 12
4 1 Table 6. The Determined 𝑑𝑑 > 3 3
5 12 Y Outliers 1 𝑑𝑑 < −3 2 3 15 12𝑑𝑑 > 3 7 15Outliers 3𝑑𝑑 < −3
2 Subset 1 5 5
2
Number of 12 Observation
5 57 2 1 3 3 No.
127
of Potential 5 12 15 8 12 7 3
Subset
3 2 29 Number
𝑑𝑑 > 3 of Observation 5 2 𝑑𝑑 < −3 No.
7 of31 Outliers
Potential 12
3 29 3 2 2 29 5 31 2 7
4 1 3 𝑑𝑑5> 312 29 Table 𝑑𝑑 2< −3 7.3The Determined 2 Outliers
X 7 Outliers 15 31
4 5 4 3 2 5 29 7 2 2
51 2 4 5 12 5
5 n of regression2(𝑘𝑘
5
53 3 7 5 4 2
38 15 12 5 5 7 3 2
Subset4. Values
Table + 1) 8 No. of observation such that ℎ𝑖𝑖𝑖𝑖 >8 𝐻𝐻
2 3 5 5 29 𝐻𝐻 = coefficients. 5 7= 2 5 3 12 31 5 8 3
𝑛𝑛 𝑛𝑛
3 4 DatasetTable 29 57. The Determined β0 2X Outliers
2 β
Table 7. The Determined X Outliers 31 7 β Table 7. The Determined β X Outliers
1 4320 0.00185185 1
208 2 3
4 n5 5 5 Table 7. 2 The 3 Determined X Outliers 7 8 Table 7. The Determined X Outliers

2 =Subset2(𝑘𝑘 43201+ 1) 8 No.
0.00185185
66370 of + observation
Subset such
5.45 n No. thatof ℎ𝑖𝑖𝑖𝑖observation
> 𝐻𝐻-0.000557 2(𝑘𝑘
228 +such 1) that8 ℎ > 935.3 No. of observation such th
5 Subset
𝐻𝐻 5 n = 2(𝑘𝑘 1) 8 𝐻𝐻
Subset 3 Subsetn𝑛𝑛 𝑛𝑛 𝐻𝐻 2(𝑘𝑘 = + 1) 3 8 = No. of observation 8 = such that ℎ= > 𝐻𝐻
𝐻𝐻 𝑖𝑖𝑖𝑖
4320
2
Table
𝐻𝐻 =7. The 0.00185185
7868Determined 𝑛𝑛= Subset -0.0622 n
X𝑛𝑛 Outliers 𝐻𝐻
2192(𝑘𝑘
0.00001215
=
𝑛𝑛 + 1) 𝑖𝑖𝑖𝑖 𝑛𝑛 8
= -10.09No. of observation suc
4320 0.00185185
4 1Subset 4319 𝑛𝑛
0.00185228 𝑛𝑛 1 208 4320 246
0.00185185 𝑛𝑛 𝑛𝑛 208
4320 7. The267400
3 Table 0.00185185
Determined X Outliers 46.26 -0.08218208 7856
Subset4320 n 1 5 0.00185185 4320 2(𝑘𝑘 + 1) 8
0.00185185 No. of observation
228 such that ℎ 208 > 𝐻𝐻
𝐻𝐻 =4318
2Subset 4 4320 = 0.00185271 0.001851852 1 -0.1157 4320 4320 𝑖𝑖𝑖𝑖 375
0.00185185
0.00185185
228 228208
et 4320 n 2 2(𝑘𝑘
4320+ 1)𝑛𝑛 8 0.00185185
0.00185185 𝑛𝑛4876No. of observation 3 2 219 such4320
4320 that ℎ𝑖𝑖𝑖𝑖 >228 0.00008332
𝐻𝐻 0.00185185
0.00185185
-18.34
219228
𝐻𝐻3Subset
= 4320=
50.00185185 0.00185185 219
1 4319 4320 3 4320
0.00185228
Figures4 27-31 show 𝑛𝑛 𝑛𝑛 0.00185185
11280 0.04376
246 208 219
-0.00004821 6.685
4319the data for0.00185228 all subsets4 fall 3 inside 4319 the
4320 Cook’s distance 0.00185228 246which implies that there are no 246219
0.00185185
2 4318 43204320 influential
4 0.00185185
cases
4319
0.00185271 in the model.
0.00185185 0.00185228 5 375 208 4318 228 2460.00185271 375246
5 4318 0.00185271 4 4319 0.00185228
375
3 43204320 5 0.00185185
4318 0.00185185 0.00185271 5 228 219 4318 375 0.00185271 375
4 4320
27-31 show4319 the Figures
data for 0.00185185
all 0.00185228
subsets fall inside the Cook’s distance 219 246
which implies that there are no
27-31 show the data for all Figures subsets fall 27-31 inside showthetheCook’s data for all subsets
distance which fallimpliesinside that the Cook’s
there aredistance no wh
ial5cases inFigures
4319 the model.
4318 27-31 0.001852280.00185271
show the data for all subsets influential
fall
Figures inside 246
cases
27-31the 375 in the model.
Cook’s
show distance
the data for which all subsets impliesfall that therethe
inside areCook’s no distance
Subset 4 4876 -0.1157 0.00008332 -18.34
After
Subset 5transformations
11280made on the dependent
0.04376 variables, the residual
-0.00004821 plots in Figures
6.685 11-15 show that
the
1676 residuals lie within an area centred
Subset 4 at around 0
4876which implies that
-0.1157 model obeys
Chiang Mai J. Sci. the assumption
0.00008332
2022; 49(6) of -18.34
linearity. However, there are a few outliers which drift apart from 0, for example the outliers in subset 3
After transformations made Subset
on the5 dependent11280 0.04376plots in Figures
variables, the residual -0.00004821
11-15 show that 6.685
are far apart from 0. Anyway, the transformations do improve the linearity of the model in a large extent.
the residuals lie within an area centred at around 0 which implies that model obeys the assumption of
In addition, the residuals become structureless instead of forming a funnel shape which implies that the
linearity. However, there are a few outliers which drift aparton from 0, for example the outliers in subset
plots3in Figure
transformed model satisfies the After transformations
assumption of constantmade
variance the
of dependent
errors. variables, the residual
are far apart from 0. Anyway, the transformations do improve the linearity of the model in a large
the residuals lie within an area centred at around 0 which implies that model obeys extent.
In addition, the residuals become
linearity.structureless instead
However, there are of forming
a few a funnel
outliers whichshape whichfrom
drift apart implies that
0, for the the o
example
transformed model satisfiesarethefarassumption of constant variance of errors.
apart from 0. Anyway, the transformations do improve the linearity of the mode
In addition, the residuals become structureless instead of forming a funnel shape whic
transformed model satisfies the assumption of constant variance of errors.
(a) Subset
Figure 11. Residual Plot1.of Subset 1 (b) Subset
Figure 12. Residual Plot2.of Subset 2
after Transformation after Transformation
Figure 11. Residual Plot of Subset 1 Figure 12. Residual Plot of Sub
(c) Subset
Figure 13. Residual Plot3.of Subset 3 (d) Subset
Figure 14. Residual Plot4.of Subset 4
Figure 13. Residual Plot of Subset 3 Figure 14. Residual Plot of Sub

after Transformation
(e) Subset
Figure 15. Residual 5. Subset 5
Plot of
Figure 3. Residual plot of subset after transformation (a-e).

8
The q-q plots in Figures 16-20 show a linear behaviour. Hence, we can conclude that the model
Chiang
holds Mai J. Sci. 2022; of
the assumption 49(6)
normality of errors after applying transformations on the dependent 1677 variable.
However, the q-q plots and residual plots of subsets 1 and 3 show that the range of the residuals
The q-q plots in Figures 16-20 show a linear behaviour. Hence, we can conclude that the model and
sample quantiles are rather big compared to the other subsets. Hence, we decide to remove some
holds the assumption of normality of errors after applying transformations on the dependent variable. potential
outliers
However,for the
subsets
q-q of 1 and
plots and3.residual plots of subsets 1 and 3 show that the range of the residuals and
The q-q plots
sample quantiles are rather big compared to theinother
Figures 16-20
subsets. showwea decide
Hence, linear behaviour. Hence,
to remove some we can conclu
potential
outliers for subsets of 1 andholds
3. the assumption of normality of errors after applying transformations on the de
However, the q-q plots and residual plots of subsets 1 and 3 show that the range of
sample quantiles are rather big compared to the other subsets. Hence, we decide to remo
outliers for subsets of 1 and 3.
Figure 16. q-q (a)

PlotSubset 1. 1
of Subset Figure 17. q-q(b) Subset
Plot 2. 2
of Subset
Figure 18. q-q(c)Plot
Subset 3. 3
of Subset Figure 19. q-q(d) Subset
Plot 4. 4
of Subset

Figures 21 and 22 are enlargement from Figures 11 and 13. From Figures 21 and 22, we notice that
there are three obvious outliers in subsets 1 and 3 respectively. The outliers in subset 1 are row 1281,
Figure 20. from
Figures 21 and 22 are enlargement
(e)
q-q Plot Subset 5. 5 13. From Figures 21 and 22, we notice that
of Subset
Figures 11 and
there are three obvious outliers in subsets 1 and 3 respectively. The outliers in subset 1 are row 1281,
Figures 21 and 22 are enlargement from Figures 11 and 13. From Figures 21 and
there are three obvious outliers in subsets 1 and 3 respectively. The outliers in subse
Figure 4. q-q plot of subset after transformation (a-e).
99
1678 Chiang Mai J. Sci. 2022; 49(6)
9
4010 and
4010 and 4021
4021 while
while the
the outliers
outliers in
in subset
subset 33 are
are row
row 839,
839, 1796
1796 and
and 4125.
4125. Hence,
Hence, we
we decide
decide to
to remove
remove
these outliers
these outliers from
fromsubsets
subsets 11 and
and 33 to
to improve
improve the
the result.
result.
4010 and 4021 while the outliers in subset 3 are row 839, 1796 and 4125. Hence, we decide to remove
these outliers from subsets 1 and 3 to improve the result.
Figure 21.
Figure 21. Residual
Residual Plot1.
(a) Subset
Plot of Subset
of Subset 11 Figure 22.
Figure (b) Subset
22. Residual
Residual of3.
Plot of
Plot Subset 33
Subset
after Transformation afterTransformation
Figure
Figure 21. Residual
5. Residual plotPlot of Subset
of subset 1 transformation (a,b).
after Figure 22. Residual Plot of Subset 3
Theafter Transformation
The residuals
residuals and
and q-q
q-q plots
plots obtained
obtained after
after removing
removing the
the three
three potential
after potential outliers
outliers from
Transformation fromsubsets
subsets 11 and
and 33 are
are
shown in
shown in Figures
Figures 23-26.
23-26.
The residuals and q-q plots obtained after removing the three potential outliers from subsets 1 and 3 are
shown in Figures 23-26.
Figure 23.
Figure 23. Residual
Residual Plot of
(a) Subset
Plot of Subset
1. Subset 11 Figure (b)q-q
Figure 24.
24. Subset
q-q Plot1.
Plot of Subset
of Subset 11
after Removing
after Removing Outliers
Outliers after Removing Outliers
Figure 23. Residual Plot of Subset 1 Figure 24. q-q Plot of Subset 1
after Removing Outliers after Removing Outliers
Figure 25.
Figure 25. Residual
Residual Plot
Plot of
of Subset
Subset 33 Figure 26.
Figure 26. q-q
q-q Plot
Plot of
of Subset
Subset 33
after Removing
Outliers after Removing
Outliers
Figure 25. Residual Plot of
(c) Subset 3. Subset 3 Figure 26.
(d)q-q Plot 3.
Subset of Subset 3
after Removing Outliers after Removing Outliers
Figure 6. Residual plot of subset (a,c) and q-q plot of subset (b,d) after removing outliers.
10
Figures 23-26 show the range of the residuals and sample quantiles of subsets 1 and 3 are improved
after removing
Figures 23-26 showoutliers. Meanwhile,
the range new values
of the residuals and sample ̅ for subsets
of 𝜆𝜆quantiles 1 and13and
of subsets are3given by 0.01 and 0.02
are improved
Chiang Mai J. Sci. 2022;
respectively.
after removing 49(6)
The new
outliers. statistical
Meanwhile, newsummaries
values ofof 𝜆𝜆subsets
̅ 1 and13and
for subsets are shown in Table
3 are given 1679
5. and 0.02
by 0.01
respectively. The new statistical summaries of subsets 1 and 3 are shown in Table 5.
Table 5. Values of Regression Coefficients
Table 5. Values of Regression Coefficients
Dataset 𝛽𝛽0 𝛽𝛽1 𝛽𝛽2 𝛽𝛽3
Table 5. Values of regression
Dataset
Subset 1 𝛽𝛽0coefficients.
11270 𝛽𝛽1 0.04637 𝛽𝛽2 -0.000001981 𝛽𝛽3 8.151
Subset 1 11270 12700 0.04637 0.106 -0.000001981 8.151
Dataset Subset 3 β0 β1 -0.000146
β2 18.63
β3
Subset 3 12700 0.106 -0.000146 18.63
Subset 1 11270 0.04637 -0.000001981 8.151
3.2 Check for Outliers
Subset3.2
3 Check for Outliers12700 0.106 -0.000146 18.63 𝑒𝑒
First, the outlying Y observations are determined by using the standardized residuals, 𝑒𝑒 𝑑𝑑 = . The
First, the outlying Y observations are determined by using the standardized residuals, 𝑑𝑑 = . The √𝑀𝑀𝑀𝑀𝑀𝑀
ideal range for the standardized residuals are from -3 to 3. The standardized residuals√𝑀𝑀𝑀𝑀𝑀𝑀 are obtained for all
ideal range for the standardized residuals are from -3 to 3. The standardized residuals are obtained for all
subsets to check whether they fall in the expected range. Moreover, the rule of thumb on ℎ𝑖𝑖𝑖𝑖 is applied in
subsets to check whether they fall in the expected range. Moreover, the rule of thumb on ℎ𝑖𝑖𝑖𝑖 is applied in
identifying
identifying the outlying
the outlying X observations
X observations where ℎwhere is the the 𝑖𝑖 𝑡𝑡ℎ diagonal
ℎ𝑖𝑖𝑖𝑖𝑖𝑖 𝑡𝑡ℎisdiagonal element of element
the hatofmatrix.
the hat matrix. The
The
Table 6. Theoutlying
determined Y outliers.
X observations
𝑖𝑖𝑖𝑖 2(𝑘𝑘+1)
having having valuethan
ℎ𝑖𝑖𝑖𝑖more more thanwill indicate will thatindicate
ℎ𝑖𝑖𝑖𝑖 isthat is large.
ℎ𝑖𝑖𝑖𝑖Tables Tables
7 6 and 7
2(𝑘𝑘+1)
outlying X observations ℎ𝑖𝑖𝑖𝑖 value 𝑛𝑛 large. 6 and
𝑛𝑛
depict that there exists outlying
depict that there exists outlying X and X and Y
Y observations
Number observations
of Observation in each subset. Hence, it is important to find out to find out
in each subset. Hence, it is important
whether
Subset
whether the potential
the potential outliersoutliers are influential
are influential to the regression
to the regression model. The model.
No. The
plots plots ofOutliers
ofofPotential
residuals residuals against
against
leverage
leverage aretoused
are used d >influential
to determine
determine the 3 the influential
cases. cases.d < −3
1 12 3 15
Table 6.Table 6. The Determined
The Determined Y Outliers Y Outliers
2 5 7 12
3 SubsetSubset 29NumberNumber of Observation
of Observation 2 No. of31
No. of Potential Potential
𝑑𝑑 > 3 𝑑𝑑 > 3 𝑑𝑑 < −3 𝑑𝑑 < −3 Outliers Outliers
4 5 2 7
1 1 12 12 3 3 15 15
5 2 55 7 3 7 12 8
2 5 12
3 3 29 29 2 2 31 31
4 4 5 5 2 2 7 7
5 5 3 8
5 5 3 8
Table 7. The determined X outliers.
Table 7. The Determined X Outliers
Table 7. The Determined X Outliers
Subset n 2(𝑘𝑘 + 1) 8 No. of observation such that ℎ𝑖𝑖𝑖𝑖 > 𝐻𝐻
Subset n
Subset n 𝐻𝐻 = 2(𝑘𝑘
= + 1) 8 No.ofofobservation
No. observationsuch
suchthat
that ℎ𝑖𝑖𝑖𝑖 > 𝐻𝐻
𝐻𝐻 𝑛𝑛= 𝑛𝑛 =
𝑛𝑛 𝑛𝑛
1 1 43204320 0.00185185
0.00185185 208 208
2 1 4320 4320 0.00185185
0.00185185 228 208
2 4320
2 4320 4320 0.00185185
0.00185185 228
228
3 0.00185185 219
3 4 3 4319
4320 4320 0.00185185
0.00185185
0.00185228 Table 8. Values of Regression 246 219
219
Coefficients
4 5 4 4318
4319 4319 0.00185228
0.00185271
0.00185228 375 246
246
Dataset
5 4318 0 𝛽𝛽 0.00185271 1 𝛽𝛽 375𝛽𝛽 𝛽𝛽
5 4318 0.00185271 375 2 3
Figures Subset 1
27-31 show the data 11270
for all subsets fall 0.04637
inside the Cook’s distance-0.000001981
which implies that there are 8.151
no
influential cases
Figures in theshow
27-31 model.
the data for all subsets fall inside the Cook’s distance which implies that there are no
Subset 2 7868 -0.0622 0.00001215 -10.09
influential cases in the model.
Subset 3 12700 0.106 -0.000146 18.63
Subset 4 4876 -0.1157 0.00008332 -18.34
Subset
Figures 7 (a-e) 5 the data for
show 11280
all subsets 0.04376
Table 8 shows -0.00004821
that the mean values of 6.685 the
Mean
fall inside the Cook’s 9598.8
distance which implies that 0.003646 -0.0000201442
regression coefficients are comparable with1.0072the
there are no influential
Overallcases in the9084.3
model. values of regression -0.000014855
0.00394 coefficients obtained from1.598the
entire dataset. As a conclusion, the implementation
3.3 Divided Regression Model
Table 8 shows that the mean values of theofdivided regression
the regression model is validated
coefficients and
are comparable with the
The values of regression
of regression coefficients
coefficients in each from
obtained thethe
divided regression
entire dataset. model
As a isconclusion,
obtained bythe
taking
implementation
divided regression
subset are computed model
to implement theisdivided
validated and
the the
meandivided
values regression
of regression model is obtained
coefficients and by taking th
values of regression coefficients
regression model. Then, the mean values are and the average
the average value of
of 𝜆𝜆̅ of all
of all the subsets. The divided reg
subsets. The
compared tomodel for forecasting
the values thecoefficients
of regression housing price divided
is stated regression
as below: model for forecasting the
1
obtained from the overall dataset. The results
𝑌𝑌 = (9598.8 housing price
+ 0.003646𝑋𝑋1 is stated as below:
− 0.000020144X2 + 1.0072𝑋𝑋3)−0.008
obtained are shown in Table 8.
4. CONCLUSIONS
Divided regression analysis is proposed to predict housing price based on the size of lot,
1680 Chiang Mai J. Sci. 2022; 49(6)
11
11
(a) Subsetvs
Figure 27. Residuals 1. Leverage Plot (b) Subset
Figure 28. Residuals 2.
vs Leverage Plot
of Subset 1 of Subset 2
Figure 27. Residuals vs Leverage Plot Figure 28. Residuals vs Leverage Plot
Figure 29. (c) Subset vs
Residuals 3. Leverage Plot (d) Subset
Figure 30. Residuals vs 4.
Leverage Plot
Figure 31. Residuals vs Leverage Plot

of Subset 5
Figure 31. Residuals vs Leverage Plot
of3.3 Divided
Subset 5 Regression Model (e) Subset 5.
The values of regression
Figurecoefficients
31. Residualsin each subset are
vs Leverage computed to implement the divided regression
Plot
model. Then, the mean values
of Subset
3.3 Divided Regression Modelare5 compared to the values of regression coefficients obtained from the
overall dataset. The results obtained are shown in Table 8.
The values of regression coefficients in each subset are computed to implement the divided regression
Figure 7.
model.Residulas vs mean
Then, the leverage
3.3 plot of subset
Divided
values are (a-e).
Regression
compared Model
to the values of regression coefficients obtained from the
overall dataset. The results
Theobtained
values ofareregression
shown in coefficients
Table 8. in each subset are computed to implement the divi
model. Then, the mean values are compared to the values of regression coefficients obta
overall dataset. The results obtained are shown in Table 8.
Chiang Mai J. Sci. 2022; 49(6) 1681
12
12
Table 8. Values of 8.regression
Table Values of coefficients.
Regression Coefficients
ofDataset 𝛽𝛽Dataset
Regression Coefficients 0 𝛽𝛽1 β0 𝛽𝛽2 β1 𝛽𝛽3 β2 β3
Subset 1 11270
Subset 1 0.0463711270 -0.000001981
0.04637 8.151-0.000001981 8.151
𝛽𝛽2 𝛽𝛽3
Subset 2 7868
Subset 2 -0.0622 0.00001215 -10.09
-0.000001981 8.151 7868 -0.0622 0.00001215 -10.09
Subset 3 12700 0.106 -0.000146 18.63
0.00001215Subset 3 -10.09 12700 0.106 -0.000146 18.63
Subset 4 4876 -0.1157 0.00008332 -18.34
-0.000146Subset 4 18.63 4876 -0.1157
Subset 5 11280 0.04376 -0.00004821 6.6850.00008332 -18.34
0.00008332 -18.34
Mean 9598.8
Subset 5 0.003646 11280 -0.0000201442
0.04376 1.0072-0.00004821 6.685
-0.00004821 6.685
Overall 9084.3
Mean 0.00394 -0.000014855 1.598
-0.0000201442 1.0072 9598.8 0.003646 -0.0000201442 1.0072
-0.000014855 Overall 1.598 9084.3 0.00394 -0.000014855 1.598
Table 8 shows that the mean values of the regression coefficients are comparable with the values
egression coefficients obtained from the entire dataset. As a conclusion, the implementation of the
he
dedregression
regressioncoefficients are comparable
model is validated and thewith the values
divided regression model is obtained by taking the mean
tire dataset. As a conclusion, the implementation
ues of regression coefficients and the average value of of
the𝜆𝜆̅ of all the subsets. The divided regression
ivided regression model is obtained by taking
del for forecasting the housing price is stated as below:the mean
ge value of 𝜆𝜆̅ of all the subsets. The divided regression 1
as below: 𝑌𝑌 = (9598.8 + 0.003646𝑋𝑋1 − 0.000020144X2 + 1.0072𝑋𝑋3) regression model for the entire data set is built by
−0.008
1 taking mean value on the regression coefficients
− 0.000020144X2 + 1.0072𝑋𝑋3)−0.008 of each subset. There is only small difference
CONCLUSIONS
in the values of regression coefficients between
Divided regression4. CONCLUSIONS
analysis is proposed to predict housing price based dividedonregression
the size ofmodel lot, size
andofthe model built by
ng area and condition of
Divided the house. This methodology of the study promotes the subdivisions of the data set, assum-
repredict
datasethousing
into a price basedregression
few unique on the size
subsets.
analysis
of lot,
Then,
issize
multiple
proposed
of to regression analysis on the entire
linear regression analysis is applied on each
thodology
set.
predict
studyhousing
Finally,ofthethedivided promotes
regression
price
the
model
based on the size
subdivisions
is obtainedofby
of lot, theing
thecombining that it
results ofisallable
the to find the actual model of the
subsets.
, Consequently,
multiple linear size regression
some modelanalysis
of living area andiscondition
adequacy applied
checking on each
ofaretheapplied
house. on the entire data set.
datasets. Hence,
During thethe performance of the
model
btainedchecking,
quacy by combining
This the results
themethodology
datasets areoffound
ofall the
thefail
subsets.
study promotes
to satisfy the
the assumptionsdivided regression
of linear modelmodel.
regression is verified.
cking are applied on by the
nce, transformations
subdivisions ofdatasets.
Box-Cox method
the entire During are the
dataset intomodel
applieda few onunique
the datasets to satisfy the assumptions of
to satisfy the
ar regression model. assumptions
Next, theof linear
standardized regression
residuals model.
e applied on thesubsets. Then,
datasets to multiple
satisfy linear
the regressionofanalysisis applied
assumptions
approach to check for the existence of
REFERENCES
ential outlying Y observations while the diagonal entries of hat matrix is observed to check for the X
esiduals
lying approach
observationsis applied
isusing
applied on to
the each ofsubset.
check
rule for theFinally,
thumb. existence
Influential theofcases
divided [1] Duan
are examined by the Y., plots
Edwards J.S. and Dwivedi Y.K.,
of residuals
onal entries
inst leverage. of hat matrix
regression is
modelobserved
is to
obtainedcheck by for the
combiningX the Int. J. Inform. Manage., 2019; 48: 63-71. DOI
Influential cases are examined
results oflinear
Finally, the multiple by the
all theregression plots of residuals
subsets. model for each subset is determined and the divided
10.1016/j.ijinfomgt.2019.01.021.
ression model for the entire data set
Consequently, is built
some model by adequacy
taking mean value on the regression coefficients of each
checking
model for each
set. There is onlysubset is determined theand the of divided [2] Janssen
betweenM., van der Voort H. and Wahyudi
ydeltaking mean are small
value applied
on the
difference
on the in
regression datasets. values
During
coefficients
regression coefficients
the model
of each
divided regression
and the model built by regression analysis on the entire data set, assuming J. Bus.
A., that it isRes.,
able to 2017; 70: 338-345. DOI
find the
uesmodel
ual of theadequacy
of regression entire datachecking,
coefficients set.between
Hence,the the
datasets
divided are found
regression
performance fail divided
of the to regression model is verified.
10.1016/j.jbusres.2016.08.007.
s on the entire satisfy
data set, theassuming
assumptions that itofislinear
able regression
to find the model.
rformance of the divided
Hence, regression model
transformations is verified.method are
by Box-Cox [3] Papadopoulos T., Gunasekaran A., Dubey
REFERENCES
applied on the datasets to satisfy the assumptions R., Altay N., Childe S.J. and Fosso-Wamba
Duan Y., Edwards J.S. and Dwivedi Y.K., Int. J. Inf. Manage.,S., 2019;
of linear regression model. Next, the standardized J. Clean.48: Prod.,63-71. DOI 1108-1118. DOI
2017; 142:
10.1016/j.ijinfomgt.2019.01.021.
Y.K., Int. J. residuals Inf. Manage., approach 2019; is applied
48: 63-71. to check DOIfor the 10.1016/j.jclepro.2016.03.059.
Janssen M., van der Voort H. and Wahyudi A., J. Bus. Res., 2017; 70: 338-345. DOI
existence of potential outlying Y observations while
10.1016/j.jbusres.2016.08.007. [4] Saggi M.K. and Jain S., Inf. Process. Manag.,
ahyudi A., J.theBus. 2017;of70:
Res.,entries hat 338-345. DOI
Papadopoulos T., diagonal Gunasekaran A., Dubey matrix
R., Altayis observed
N., Childeto S.J. and Fosso-Wamba
2018; 54(5): S., J. Clean.DOI 10.1016/j.
758-790.
Prod., 2017; 142: check for the XDOI
1108-1118. outlying observations using the
10.1016/j.jclepro.2016.03.059. ipm.2018.01.010.
R., Altay N., Childe S.J. and Fosso-Wamba S., J. Clean.
S a g g i M . Krule . aof n d thumb.
J a i n Influential
S., Inf. P r o c are
cases e s sexamined
. M a n a gby ., 2018; 54(5): 758-790. DOI
j.jclepro.2016.03.059. [5] Venkatraman S. and Venkatraman R., AIMS
10.1016/j.ipm.2018.01.010.
o c e s s . M a nthe a g .plots
, 2 0 of 1 8 residuals
; 5 4 ( 5 ) :against
7 5 8 - 7leverage.
90. DOI Math., 2019; 4(3): 860-879. DOI 10.3934/
Finally, the multiple linear regression model
for each subset is determined and the divided math.2019.3.860.
1682 Chiang Mai J. Sci. 2022; 49(6)
[6] Sadahiro Y. and Wang Y., Ann. Reg. Sci., [14] Goh Y.L., Goh Y.H., Raymond L.L.B. and Chee
2018; 61(2): 295-317. DOI 10.1007/s00168- W.H., Mal. J. Fund. Appl. Sci., 2019; 15(3):441-
018-0868-3. 446. DOI 10.11113/mjfas.v15n3.918.
[7] Antonacci Y., Astolfi L., Nollo G. and Faes [15] Xu G., Shang Z. and Cheng G., Int. Conf.
L., Entropy, 2020; 22(7): 732. DOI 10.3390/ Mach. Learn., 2018; 80: 5483-5491.
e22070732.
[16] Kang J. and Jhun M., Korean Data Inf. Sci.
[8] Elmaz F., Yücel Ö. and Mutlu A.Y., Soc., 2020; 31(1): 15-23. DOI 10.7465/
Energy, 2020; 191: 116541. DOI 10.1016/j. jkdi.2020.31.1.15.
energy.2019.116541.
[17] Jittawiriyanukoon C. and Srisarkun V., Int. J.
[9] Goh Y.L. and Pooi A.H., Int. J. Eng., 2013; Technol., 2018; 9(1): 192-200. DOI 10.14716/
1(3): 2305-8269. ijtech.v9i1.1509.
[10] Goh Y.L. and Pooi A.H., ScienceAsia, 2013; [18] Chern C.C., Lei W.U., Huang K.L. and Chen
39(SUPPL.1): 140-143. S.Y., Inf. Syst. E-Bus. Manag., 2021; 19(1): 363-
386. DOI 10.1007/s10257-021-00511-w.
[11] Kumari K. and Yadav S., J. Pract. Cardiovasc.
Sci., 2018; 4(1): 33-36. DOI 10.4103/jpcs. [19] Kushwaha A.K., Kumar P. and Kar A.K., Ind.
jpcs_8_18. Market. Manag., 2021; 98: 207-221. DOI
10.1016/j.indmarman.2021.08.011.
[12] Xue H., Wei Z., Chen K., Tang Y., Wu X., Su J., et
al., Evol. Bioinform., 2020; 16: 1176934320915707. [20] Lee W., Kim M. and Park J., KSII T. Internet
DOI 10.1177/1176934320915707. Inf. Syst., 2021; 15(10): 3482-3497. DOI
10.3837/tiis.2021.10.001.
[13] Abarda A., Dakkon M., Azhari M., Zaaloul
A. and Khabouze M., Procedia Comput. Sci.,
2020; 170: 1116-1121. DOI 10.1016/j.
procs.2020.03.059.
View publication stats

Cmjs Full Date 2022 11 23 56266079

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cmjs Full Date 2022 11 23 56266079

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Housing Price Prediction by Divided Regression Analysis

Article in Chiang Mai Journal of Science · December 2022

Yann Ling Goh Yeh Huann Goh

SEE PROFILE SEE PROFILE

Yip Chun Chieh Kooi Huat Ng

SEE PROFILE SEE PROFILE

Computational Intelligence in Thermal-Energy-Storage Air-Conditioning Facility Management View project

The user has requested enhancement of the downloaded file.

Keywords: divided regression, multicollinearity, big data

Table 2. Variance inflation factor.

Figure 1. Residual Plot ofFigure

Figure 3. Residual Plot of Subset 3 Figure 4. Residual Plot of Subset 4

(a) Subset 1. 1 (b) Plot

bset 3 Figure 9. q-q Plot of Subset 4

Figure 8. q-q Plot of Subset 3 Figure 9. q-q Plot of Subset 4

Figure 8. q-q Plot of Subset 3 Figure 9. q-q Plot of Subset 4

Figure 10. q-q Plot of Subset 5

4 n5 5 5 Table 7. 2 The 3 Determined X Outliers 7 8 Table 7. The Determined X Outliers

Figure 15. Residual Plot of Subset 5

Figure 3. Residual plot of subset after transformation (a-e).

Figure 16. q-q (a)

Figure 20. q-q Plot of Subset 5

Figure 31. Residuals vs Leverage Plot

View publication stats

You might also like