Professional Documents
Culture Documents
MARS and Truncated Spline Approach On Modelling Human Development Index (HDI) in Indonesia
MARS and Truncated Spline Approach On Modelling Human Development Index (HDI) in Indonesia
Ph.D. Student of Statistic Department, Faculty of Mathematics and Natural Sciences,Sepuluh Nopember Institute of Technology,
Sukolilo, Surabaya 60111, Indonesia
b
Lecturer of Statistic Deparment, Faculty of Mathematics and Natural Sciences,Sepuluh Nopember Institute of Technology, ,
Sukolilo, Surabaya 60111, Indonesia
c
Lecturer of STIS, Jakarta, Indonesia
Abstract
This study aims to compare the MARS and truncated spline estimator on modeling Human Development Index (HDI)
of Indonesia in 2012. Both of nonparametric approach is used because there were no obvious relationship pattern
between response variable (HDI) and predictor variables which influence them. The utilization of spline truncated
method allows its equation consist of 1st (linear), 2nd (quadratic), or 3rd (cubic) polynomial order of regression. While
MARS method, accommodate the presence of interactions possibility among predictor variables against its response .
Results show that according on its maximum coefficient of determination (R 2) and minimum mean square error
(MSE) value, MARS estimator gives better nonparametrics regression curve estimation of Indonesias 2012 HDI
model than truncated spline estimator.
1. INTRODUCTION
The success of development can be measured by several indicators, the popular among them is Human
Development Index (HDI), introduced by United Nation Development Program (UNDP) in 1990.
Compared to other indicators, HDI is more comprehensive because it could not only able to measure
economic growth, but also the development of social aspect and human welfare.
According to Pratowo [10], Yasmeen, Begum, and Mujtaba [13] and Adediran [2], HDI can be
affected by several variables, such as population and economic growth, dependency ratio (DR), open
unemployment rate, percentage of per capita food expenditures, and inflation. The relationship between
these variables can be explained using regression analysis. But when the assumption about the form of
regression curve is unknown, classic regression method could not be use, and one could take
nonparametric approach. There were some method that often used in nonparametric regression approach;
two of them was MARS and spline truncated method. The use of each of this method has its own
advantages and disadvantages, and should differ by its motivation to pattern the data. The utilization of
spline truncated method allows its equation consist of 1st (linear), 2nd (quadratic), or 3rd (cubic) polynomial
order of regression. Interaction between variables is possible to accommodate, but it would required far
complex mathematical model. Meanwhile, the interactions among predictor variables and its response are
easier to calculate using MARS method, but it is limit the polynomial order only to 1st (linear). Since each
method having their own advantages and disadvantages, this research aim to compare both methods on
obtaining Indonesias HDI models. We use R-square and MSE as the goodness criterion for comparing
both methods.
2. THEORY
2.1. Truncated spline estimator
Spline estimator basically gained from two optimization approach, i.e. penalized least squares (PLS)
optimization, Wahba [12] and least square (LS) optimization, Budiantara [5]. The main problem on PLS
optimization is to choice an optimal smoothing parameter (smoothing spline), while in LS optimization is
to choice optimal knots (spline truncated). Generally, the truncated spline regression equation can be
expressed as follows:
q
l 1
s 1
yi jl x lji js ( x ji t js ) q i
(1)
where
( x ji t js ) q , x ji t js
( x ji t )
q
js
(2)
, x ji t js
dan are real constant, t is knot, j is the number of predictor variabel, i is the number of data
observation, and q is polynomial degree (linear, quadratic, cubic). In matrix form, the equation above can
T
be written into : y H% %, where ( j 1 ,..., jq , j 1 ,..., jr ) and
%
%
1 x j1 L
x jq1 ( x j1 t j1 )q K ( x j1 t j r ) q
1 xj2 L
x jq2 ( x j 2 t j1 ) q L ( x j 2 t j r ) q
H
M M O
1 x K x q ( x t ) q K ( x t ) q
jn
jn
jn
j1
jn
j r
(3)
Truncated spline estimator is very depend on location and the number of knots. Among several
methods which ofted used to choice optimal knot is Generalized Cross Validation (GCV), which could be
expressed as follow:
MSE
GCV
(4)
2
1
n tr I A ( t )
%
1
T
MSE n ( y H) ( y H)
%
% T
% %
T
1
( H H ) H y
A (t ) H H H
(5)
(6)
(7)
Optimal knot is obtained by determine the knot value that minimizing the GCV value.
2.2. Multivariate Adaptive Regression Spline (MARS)Estimator
MARS method is a complex combination beetwen spline and recursive partitioning regression
(RPR), which introduced by Friedman [7]. This method is better to use when the data have many
predictor variables and its pattern is non-linear, Munoz and Felicisimo [8]. MARS model can be
expressed in the regression equation as follows :
M
Km
m 1
k 1
m 1
yi 0 m skm x ji ( k , m ) t j ( k ,m ) i 0 m Bm ( x ji ) i
where is real constant and
simplified to :
y B
%
% %
K
B=
s x
j1( k ,1)
k 1
K1
s x
k1
j 2( k ,1)
k 1
t j ( k ,1)
K1
k1
k 1
jn ( k ,1)
KM
t j ( k ,1)
kM
KM
k 1
O
t j ( k ,1)
t j (k ,M )
skM x j 2( k ,M ) t j ( k ,M )
skM x jn ( k ,M ) t j (k ,M )
s x
k 1
s x
k1
(8)
KM
k 1
j1( k , M )
(10)
The best model of MARS method can be obtained by choosing optimal basis function that
minimizing GCV values through stepwise procedure (forward stepwise and backward stepwise). GCV
formulation in MARS method can be expressed in following equation :
MSE
GCV
2
(11)
1 C ( M%)
MSE n 1 ( y B )T ( y B )
% % % %
(BT B) 1 BT y
(13)
C ( M%) C ( M ) d .M , C ( M ) M 1
(14)
(12)
The M value is the number of basis functions and the d value is penalty factor, which having best value
on 2 d 4 interval.
2.3. Model selection criterion
The commonly used criterion to select the best model is minimum mean square error (MSE) and
maximum coefficient of determination (R-square) value. The smaller values of MSE signify that the
model could obtain estimation which approximate its actually value. While the bigger values of R-square
signify better model is obtained because they can explain more data variances, Drapper and Smith [6]. Rsquare formulation can be expressed as follows :
n
R
2
SSR
SST
(y
i 1
n
yi ) 2
( yi y )2
i 1
(15)
In this research, we compare truncated spline estimator and MARS estimator to obtain 2012
Indonesias best HDI model, based on the criteria of MSE and R-square.
3. METHODOLOGY
The Data used in this study was secondary data from Central Bureau of statistics (BPS) publication.
We use all 497 districts in Indonesia as observation units with HDI as response variabel (Y) and set open
unemployment rate (X1), economic growth (X2), dependency ratio (X3), population growth (X4),
percentage of percapita expenditure for food (X5) and general inflation rate (X6) as predictor variables
(X). The data is processed using Matlab and MARS package software. The purpose of this research can
be achieved by procedure as follows :
1. Getting HDI model using spline truncated estimator .
2. Getting HDI model using MARS estimator.
3. Compare the value of R-square value and MSE value from both model.
4. RESULT AND DISCUSSION
General description and scatter plot from research variabel can be seen on tabel 1 and figure 1 :
Table 1. Description statistics of research variable.
Variabel
y
x1
x2
x3
x4
x5
x6
Min
48.80
0.15
-26.98
35.60
0.06
40.92
2.28
Max
80.24
19.21
31.50
93.49
7.73
80.03
11.82
Range
31.44
19.06
58.47
57.89
7.67
39.11
9.54
Mean Varians
71.66 27.62
5.44
9.96
6.55 10.87
56.30 99.33
1.56
1.05
60.82 39.22
6.21
3.10
Figure 1 shows the relationship pattern between response variables and predictor variables, which is very
difficult to estimate by parametric regression approach. Thus, nonparametric regression approach such as
MARS and truncated spline is used to modelling the data pattern.
The model can be explained on several interpretations i.e. the open unemployment rate (X1) data
behaviour change when its value reach 2.26, 4.38 and 6.50. Moreover, economic growth (X2) data
behaviour was also having change when its value reaches -20.49%, -13.99%, and -7.50%. And so on for
other variables interpretation.
4.4. HDI modelling using MARS
The minimum value of GCV could not only utilize on truncated spline estimator, but also on MARS
method. The difference is that MARS method uses minimum GCV value to determine maximum basis
function (BF), maximum interaction (MI) and minimum observation. The 2012 Indonesias HDI model
obtained by this method shows that minimum GCV value is 8,421 reached when BF = 24, MI = 3 dan
MO = 10. Thus, the estimation of HDI model can be written as follows:
Y = 59.221 - 1.352 *BF1 + 0.274 *BF2 - 0.871 *BF3- 0.211 * BF4 + 0.238 * BF5 + 0.098 * BF7+
0.068 * BF8 - 4.579 * BF11 + 3.667 * BF12- 0.501 * BF13 + 4.600 * BF15 - 0.203 * BF17 0.023 * BF19 + 0.129 * BF20 + 0.040 * BF21+ 0.935 * BF23 - 0.019 * BF24;
where :
BF1 = max(0, X5 - 70.330);
BF2 = max(0, 70.330 - X5 );
BF3 = max(0, X2 - 6.881);
BF4 = max(0, 6.881 - X2 );
BF5 = max(0, X1 - 1.268);
BF7 = max(0, X5 - 65.770) * BF3;
BF8 = max(0, 65.770 - X5 ) * BF3;
BF9 = max(0, X5 - 65.150) * BF4;
BF11 = max(0, X6 - 6.550);
BF12 = max(0, 6.550 - X6 );
Some interpretation of the basis function, differ between positive and negatif coefficients, could be
explained as follows: (1) BF20 coefficient value +0.129, means that one unit increase on BF20 will cause
HDI increase of 0.129. In other words, HDI value will rise if the per capita expenditure for food
percentage of above 69.33 percent, open unemployment rate fewer than 6.84 percent, and general
inflation rate fewer than 6.55 percent. (2) BF17 coefficient value -0.203, means that one unit increase on
BF17 will cause HDI decrease of 0.203 point. In other words, HDI value will decrease if open
unemployment rate of above 6.84 percent and general inflation rate is fewer than 6.55 percent.
4.2. Result comparison between truncated spline estimator and MARS.
The comparison of model goodness of fit criterion between truncated spline estimator and MARS
show that the latest has a relative higher R-square value than truncated spline method. Moreover, the
model obtained using MARS method resulting smaller MSE value than spline truncated method, as can
be seen as follows:
Table 7. Comparison of R-square and MSE value on 2012 Indonesias HDI Modeling between truncated
spline estimator and MARS method
Method
Spline Truncated
MARS
R2
0.6829
0.7510
MSE
8.7748
7.1690
According to Table 7, it can be conclude that MARS method performs better than truncated spline
estimator on 2012 Indonesias HDI modeling case. Some reason could affect those result is that the
selection of knots on truncated splines estimator is done manually. According to Smith [1], when we have
many number of predictor variables, it would be difficult to do knot selection manually. This situation
was contrast to MARS method since the estimator selection of knots is obtained automatically from the
data, Hastie, Tibshirani and Friedman [11]. Thus there is would be no problem to handling a large number
of predictor variables on MARS.
5. CONCLUSION
Based on comparison result and discussions, it can be drawn several conclusions as follows:
1. The relationshipship pattern between response variable (HDI) and predictor variable shows that there
were no obvious correlationship, so this research used nonparametric regression approach for
obtaining its models. Two popular method; MARS method and truncated spline, is used and their
estimators compared each other.
2. Based on two goodness of fit criterion; maximum R-square and minimum MSE, it could be conclude
that MARS method performs better HDI models than truncated spline method does.
References
[1]
Abraham, A. and Steinberg, D. (2001). MARS: Still an Alien Planet in Soft Computing?. School of Computing
and Information Technology, Salford System. Inc, USA.
[2]
Adediran, O.A. (2012). An Assessment of Human Development Index and Poverty Parameters in The
Millennium Development Goals: Evidence From Nigeria. Department of Economics, College of Social and
Management Sciences, Crescent University P.M.B. 2082, Sapon, Abeokuta, Ogun State, Nigeria.
[3]
[4]
Badan Pusat Statistik. (2013). Indeks Pembangunan Manusia, Tahun 2012. BPS. Jakarta.
[5]
Budiantara, I.N. (2006). Model Spline dengan Knots Optimal. Jurnal Ilmu Dasar 7(2), 77-85.
[6]
Drapper, N.R dan Smith, H. (1996). Applied Regression Analysis, 2nd edition, John Wiley & Sons, Chapman
and Hall, New York.
[7]
Friedman, J.H. (1991). Multivariate Adaptive Regression Splines (with discussion). Annual Statistics. 19:1141.
[8]
Munoz, J. dan Felicisimo, A.M. (2004). Comparison of Statistical Methods Commonly Used in Predictive
Modelling. Journal of Vegetation Science. 15:285-292.
[9]
Otok, B.W., Subanar dan Guritno, S. (2006). Faktor-Faktor Yang Mempengaruhi Volume Perdagangan Saham
Menggunakan Multivariate Adaptive Regression Splines, Jurnal Widya Manajemen & Akuntansi, Vol 6,
Nomor 3, UWM, Surabaya.
[10]
Pratowo, N.I. (2011). Analisis Faktor-Faktor Yang Berpengaruh Terhadap Indeks Pembangunan Manusia.
Jurnal Studi Ekonomi Indonesia, Fakultas Ekonomi Universitas Sebelas Maret.
[11]
Hastie, T., Tibshirani, R. and Friedman, J.H. (2008). The Element of Statistical Learning: Data Mining,
Inference, and Prediction. Springer Series in Statistics, New York.
[12]
Wahba, G. (1990). Spline Models for Observational Data. Society for Industrial and Applied Mathematics,
Philadelphia, Pennsylvania.
[13]
Yasmeen, G., Begum, R. and Mujtaba, B.G. (2011). Human Development Challenges and Opportunities in
Pakistan: Defying Income Inequality and Poverty. Journal of Business Studies Quarterly.2011,Vol.2:1-12.