You are on page 1of 7

2014 International Conference on Statistics and Mathematics (ICSM 2014)

MARS And Truncated Spline Approach On Modelling Human


Development Index (HDI) In Indonesia
Ayub Parlin Ampulembanga, Bambang W.Otokb, Agnes Tuti Rumiatib, Budiasihc
a

Ph.D. Student of Statistic Department, Faculty of Mathematics and Natural Sciences,Sepuluh Nopember Institute of Technology,
Sukolilo, Surabaya 60111, Indonesia
b
Lecturer of Statistic Deparment, Faculty of Mathematics and Natural Sciences,Sepuluh Nopember Institute of Technology, ,
Sukolilo, Surabaya 60111, Indonesia
c
Lecturer of STIS, Jakarta, Indonesia

Abstract
This study aims to compare the MARS and truncated spline estimator on modeling Human Development Index (HDI)
of Indonesia in 2012. Both of nonparametric approach is used because there were no obvious relationship pattern
between response variable (HDI) and predictor variables which influence them. The utilization of spline truncated
method allows its equation consist of 1st (linear), 2nd (quadratic), or 3rd (cubic) polynomial order of regression. While
MARS method, accommodate the presence of interactions possibility among predictor variables against its response .
Results show that according on its maximum coefficient of determination (R 2) and minimum mean square error
(MSE) value, MARS estimator gives better nonparametrics regression curve estimation of Indonesias 2012 HDI
model than truncated spline estimator.

2014 Published by and/or peer-review under responsibility of ICSM 2014


Keywords: MARS, Spline Truncated, Nonparametrik, HDI, MSE, R2

1. INTRODUCTION
The success of development can be measured by several indicators, the popular among them is Human
Development Index (HDI), introduced by United Nation Development Program (UNDP) in 1990.
Compared to other indicators, HDI is more comprehensive because it could not only able to measure
economic growth, but also the development of social aspect and human welfare.
According to Pratowo [10], Yasmeen, Begum, and Mujtaba [13] and Adediran [2], HDI can be
affected by several variables, such as population and economic growth, dependency ratio (DR), open
unemployment rate, percentage of per capita food expenditures, and inflation. The relationship between
these variables can be explained using regression analysis. But when the assumption about the form of
regression curve is unknown, classic regression method could not be use, and one could take
nonparametric approach. There were some method that often used in nonparametric regression approach;

Author(write last name only) / ICSM (2014) 000000

two of them was MARS and spline truncated method. The use of each of this method has its own
advantages and disadvantages, and should differ by its motivation to pattern the data. The utilization of
spline truncated method allows its equation consist of 1st (linear), 2nd (quadratic), or 3rd (cubic) polynomial
order of regression. Interaction between variables is possible to accommodate, but it would required far
complex mathematical model. Meanwhile, the interactions among predictor variables and its response are
easier to calculate using MARS method, but it is limit the polynomial order only to 1st (linear). Since each
method having their own advantages and disadvantages, this research aim to compare both methods on
obtaining Indonesias HDI models. We use R-square and MSE as the goodness criterion for comparing
both methods.
2. THEORY
2.1. Truncated spline estimator
Spline estimator basically gained from two optimization approach, i.e. penalized least squares (PLS)
optimization, Wahba [12] and least square (LS) optimization, Budiantara [5]. The main problem on PLS
optimization is to choice an optimal smoothing parameter (smoothing spline), while in LS optimization is
to choice optimal knots (spline truncated). Generally, the truncated spline regression equation can be
expressed as follows:
q

l 1

s 1

yi jl x lji js ( x ji t js ) q i

(1)

where

( x ji t js ) q , x ji t js

( x ji t )
q
js

(2)

, x ji t js

dan are real constant, t is knot, j is the number of predictor variabel, i is the number of data
observation, and q is polynomial degree (linear, quadratic, cubic). In matrix form, the equation above can
T
be written into : y H% %, where ( j 1 ,..., jq , j 1 ,..., jr ) and
%
%

1 x j1 L

x jq1 ( x j1 t j1 )q K ( x j1 t j r ) q

1 xj2 L

x jq2 ( x j 2 t j1 ) q L ( x j 2 t j r ) q

H
M M O

1 x K x q ( x t ) q K ( x t ) q
jn
jn
jn
j1
jn
j r

(3)

Truncated spline estimator is very depend on location and the number of knots. Among several
methods which ofted used to choice optimal knot is Generalized Cross Validation (GCV), which could be
expressed as follow:
MSE
GCV
(4)
2
1
n tr I A ( t )
%

1
T
MSE n ( y H) ( y H)
%
% T
% %
T
1

( H H ) H y

A (t ) H H H

(5)
(6)
(7)

Author(write last name only) / ICSM (2014) 000000

Optimal knot is obtained by determine the knot value that minimizing the GCV value.
2.2. Multivariate Adaptive Regression Spline (MARS)Estimator
MARS method is a complex combination beetwen spline and recursive partitioning regression
(RPR), which introduced by Friedman [7]. This method is better to use when the data have many
predictor variables and its pattern is non-linear, Munoz and Felicisimo [8]. MARS model can be
expressed in the regression equation as follows :
M

Km

m 1

k 1

m 1

yi 0 m skm x ji ( k , m ) t j ( k ,m ) i 0 m Bm ( x ji ) i
where is real constant and
simplified to :

y B
%
% %
K

B=

s x

j1( k ,1)

k 1
K1

s x
k1

j 2( k ,1)

k 1

t j ( k ,1)

K1

k1

k 1

jn ( k ,1)

KM

t j ( k ,1)

kM

KM

k 1

O
t j ( k ,1)

t j (k ,M )

skM x j 2( k ,M ) t j ( k ,M )

skM x jn ( k ,M ) t j (k ,M )

s x
k 1

s x

Bm ( x ji ) is basis function. In matrix form, the above equation can be


(9)

k1

(8)

KM

k 1

j1( k , M )

(10)

The best model of MARS method can be obtained by choosing optimal basis function that
minimizing GCV values through stepwise procedure (forward stepwise and backward stepwise). GCV
formulation in MARS method can be expressed in following equation :
MSE
GCV
2
(11)
1 C ( M%)

MSE n 1 ( y B )T ( y B )
% % % %
(BT B) 1 BT y

(13)

C ( M%) C ( M ) d .M , C ( M ) M 1

(14)

(12)

The M value is the number of basis functions and the d value is penalty factor, which having best value
on 2 d 4 interval.
2.3. Model selection criterion
The commonly used criterion to select the best model is minimum mean square error (MSE) and
maximum coefficient of determination (R-square) value. The smaller values of MSE signify that the
model could obtain estimation which approximate its actually value. While the bigger values of R-square
signify better model is obtained because they can explain more data variances, Drapper and Smith [6]. Rsquare formulation can be expressed as follows :
n

R
2

SSR
SST

(y
i 1
n

yi ) 2

( yi y )2
i 1

(15)

Author(write last name only) / ICSM (2014) 000000

In this research, we compare truncated spline estimator and MARS estimator to obtain 2012
Indonesias best HDI model, based on the criteria of MSE and R-square.
3. METHODOLOGY
The Data used in this study was secondary data from Central Bureau of statistics (BPS) publication.
We use all 497 districts in Indonesia as observation units with HDI as response variabel (Y) and set open
unemployment rate (X1), economic growth (X2), dependency ratio (X3), population growth (X4),
percentage of percapita expenditure for food (X5) and general inflation rate (X6) as predictor variables
(X). The data is processed using Matlab and MARS package software. The purpose of this research can
be achieved by procedure as follows :
1. Getting HDI model using spline truncated estimator .
2. Getting HDI model using MARS estimator.
3. Compare the value of R-square value and MSE value from both model.
4. RESULT AND DISCUSSION
General description and scatter plot from research variabel can be seen on tabel 1 and figure 1 :
Table 1. Description statistics of research variable.
Variabel
y
x1
x2
x3
x4
x5
x6

Min
48.80
0.15
-26.98
35.60
0.06
40.92
2.28

Max
80.24
19.21
31.50
93.49
7.73
80.03
11.82

Range
31.44
19.06
58.47
57.89
7.67
39.11
9.54

Mean Varians
71.66 27.62
5.44
9.96
6.55 10.87
56.30 99.33
1.56
1.05
60.82 39.22
6.21
3.10

Figure 1. Scatter Plot of y, x1, x2, x3, x4, x5, x6

Figure 1 shows the relationship pattern between response variables and predictor variables, which is very
difficult to estimate by parametric regression approach. Thus, nonparametric regression approach such as
MARS and truncated spline is used to modelling the data pattern.

Author(write last name only) / ICSM (2014) 000000

4.1. HDI modelling using truncated spline estimator


The criterion which used by truncated spline estimator to obtaining polynomial order, numbers and
optimal knot position is set based on minimum GCV value. The 2012 Indonesias HDI model obtained by
this method shows that minimum GCV value is 9,981 reached on 3rd. order (q), while the number of
knot (t) = 3. Thus, the estimation of HDI model can be written as follows:
y 10.86 x1 6.08 x12 1.03 x13 1.24( x1 2.26)3 0.31( x1 4.38) 3 0.10( x1 6.50)3 17.85 x2
38.86( x2 20.49) 25.98( x2 13.99) 5.20( x2 7.50) 0.15 x3 0.20( x3 42.02) 0.09( x3 48.45)
0.01( x3 54.89) 1.77 x4 0.97 x42 1.36( x4 0.90) 2 1.29( x4 1.75) 2 1.48( x4 2.60) 2 45.66 x5
1.20 x52 0.01x53 0.04( x5 45.26) 3 +0.05( x5 49.60) 3 0.02( x5 53.95)3 21491.11x6 24822.35 x62
7420 x6 8495.41( x6 1.30) 1085.97( x6 2.62) +11.16( x6 3.93)
3

The model can be explained on several interpretations i.e. the open unemployment rate (X1) data
behaviour change when its value reach 2.26, 4.38 and 6.50. Moreover, economic growth (X2) data
behaviour was also having change when its value reaches -20.49%, -13.99%, and -7.50%. And so on for
other variables interpretation.
4.4. HDI modelling using MARS
The minimum value of GCV could not only utilize on truncated spline estimator, but also on MARS
method. The difference is that MARS method uses minimum GCV value to determine maximum basis
function (BF), maximum interaction (MI) and minimum observation. The 2012 Indonesias HDI model
obtained by this method shows that minimum GCV value is 8,421 reached when BF = 24, MI = 3 dan
MO = 10. Thus, the estimation of HDI model can be written as follows:
Y = 59.221 - 1.352 *BF1 + 0.274 *BF2 - 0.871 *BF3- 0.211 * BF4 + 0.238 * BF5 + 0.098 * BF7+
0.068 * BF8 - 4.579 * BF11 + 3.667 * BF12- 0.501 * BF13 + 4.600 * BF15 - 0.203 * BF17 0.023 * BF19 + 0.129 * BF20 + 0.040 * BF21+ 0.935 * BF23 - 0.019 * BF24;
where :
BF1 = max(0, X5 - 70.330);
BF2 = max(0, 70.330 - X5 );
BF3 = max(0, X2 - 6.881);
BF4 = max(0, 6.881 - X2 );
BF5 = max(0, X1 - 1.268);
BF7 = max(0, X5 - 65.770) * BF3;
BF8 = max(0, 65.770 - X5 ) * BF3;
BF9 = max(0, X5 - 65.150) * BF4;
BF11 = max(0, X6 - 6.550);
BF12 = max(0, 6.550 - X6 );

BF13 = max(0, X5 - 66.090) * BF12;


BF15 = max(0, X6 - 3.970);
BF17 = max(0, X1 - 6.840) * BF12;
BF18 = max(0, 6.840 - X1 ) * BF12;
BF19 = max(0, X3 - 35.600) * BF15;
BF20 = max(0, X5 - 69.330) * BF18;
BF21 = max(0, 69.330 - X5 ) * BF18;
BF23 = max(0, 1.464 - X4 ) * BF9;
BF24 = max(0, X2 + 26.975) * BF18;

Some interpretation of the basis function, differ between positive and negatif coefficients, could be
explained as follows: (1) BF20 coefficient value +0.129, means that one unit increase on BF20 will cause
HDI increase of 0.129. In other words, HDI value will rise if the per capita expenditure for food
percentage of above 69.33 percent, open unemployment rate fewer than 6.84 percent, and general

Author(write last name only) / ICSM (2014) 000000

inflation rate fewer than 6.55 percent. (2) BF17 coefficient value -0.203, means that one unit increase on
BF17 will cause HDI decrease of 0.203 point. In other words, HDI value will decrease if open
unemployment rate of above 6.84 percent and general inflation rate is fewer than 6.55 percent.
4.2. Result comparison between truncated spline estimator and MARS.
The comparison of model goodness of fit criterion between truncated spline estimator and MARS
show that the latest has a relative higher R-square value than truncated spline method. Moreover, the
model obtained using MARS method resulting smaller MSE value than spline truncated method, as can
be seen as follows:
Table 7. Comparison of R-square and MSE value on 2012 Indonesias HDI Modeling between truncated
spline estimator and MARS method
Method
Spline Truncated
MARS

R2
0.6829
0.7510

MSE
8.7748
7.1690

According to Table 7, it can be conclude that MARS method performs better than truncated spline
estimator on 2012 Indonesias HDI modeling case. Some reason could affect those result is that the
selection of knots on truncated splines estimator is done manually. According to Smith [1], when we have
many number of predictor variables, it would be difficult to do knot selection manually. This situation
was contrast to MARS method since the estimator selection of knots is obtained automatically from the
data, Hastie, Tibshirani and Friedman [11]. Thus there is would be no problem to handling a large number
of predictor variables on MARS.
5. CONCLUSION
Based on comparison result and discussions, it can be drawn several conclusions as follows:
1. The relationshipship pattern between response variable (HDI) and predictor variable shows that there
were no obvious correlationship, so this research used nonparametric regression approach for
obtaining its models. Two popular method; MARS method and truncated spline, is used and their
estimators compared each other.
2. Based on two goodness of fit criterion; maximum R-square and minimum MSE, it could be conclude
that MARS method performs better HDI models than truncated spline method does.
References
[1]

Abraham, A. and Steinberg, D. (2001). MARS: Still an Alien Planet in Soft Computing?. School of Computing
and Information Technology, Salford System. Inc, USA.

[2]

Adediran, O.A. (2012). An Assessment of Human Development Index and Poverty Parameters in The
Millennium Development Goals: Evidence From Nigeria. Department of Economics, College of Social and
Management Sciences, Crescent University P.M.B. 2082, Sapon, Abeokuta, Ogun State, Nigeria.

[3]

Badan Pusat Statistik. (2013). Data Strategis BPS. BPS. Jakarta.

[4]

Badan Pusat Statistik. (2013). Indeks Pembangunan Manusia, Tahun 2012. BPS. Jakarta.

[5]

Budiantara, I.N. (2006). Model Spline dengan Knots Optimal. Jurnal Ilmu Dasar 7(2), 77-85.

Author(write last name only) / ICSM (2014) 000000

[6]

Drapper, N.R dan Smith, H. (1996). Applied Regression Analysis, 2nd edition, John Wiley & Sons, Chapman
and Hall, New York.

[7]

Friedman, J.H. (1991). Multivariate Adaptive Regression Splines (with discussion). Annual Statistics. 19:1141.

[8]

Munoz, J. dan Felicisimo, A.M. (2004). Comparison of Statistical Methods Commonly Used in Predictive
Modelling. Journal of Vegetation Science. 15:285-292.

[9]

Otok, B.W., Subanar dan Guritno, S. (2006). Faktor-Faktor Yang Mempengaruhi Volume Perdagangan Saham
Menggunakan Multivariate Adaptive Regression Splines, Jurnal Widya Manajemen & Akuntansi, Vol 6,
Nomor 3, UWM, Surabaya.

[10]

Pratowo, N.I. (2011). Analisis Faktor-Faktor Yang Berpengaruh Terhadap Indeks Pembangunan Manusia.
Jurnal Studi Ekonomi Indonesia, Fakultas Ekonomi Universitas Sebelas Maret.

[11]

Hastie, T., Tibshirani, R. and Friedman, J.H. (2008). The Element of Statistical Learning: Data Mining,
Inference, and Prediction. Springer Series in Statistics, New York.

[12]

Wahba, G. (1990). Spline Models for Observational Data. Society for Industrial and Applied Mathematics,
Philadelphia, Pennsylvania.

[13]

Yasmeen, G., Begum, R. and Mujtaba, B.G. (2011). Human Development Challenges and Opportunities in
Pakistan: Defying Income Inequality and Poverty. Journal of Business Studies Quarterly.2011,Vol.2:1-12.

You might also like