You are on page 1of 10

SOCA

DOI 10.1007/s11761-012-0122-2

SPECIAL ISSUE PAPER

Data mining for unemployment rate prediction using search


engine query data
Wei Xu · Ziang Li · Cheng Cheng · Tingting Zheng

Received: 31 March 2012 / Revised: 22 July 2012 / Accepted: 25 October 2012


© Springer-Verlag London 2012

Abstract Unemployment rate prediction has become used to forecast unemployment trend. The empirical results
critically significant, because it can help government to make show that the proposed framework clearly outperforms the
decision and design policies. In previous studies, traditional traditional forecasting approaches, and support vector regres-
univariate time series models and econometric methods for sion with radical basis function (RBF) kernel is dominant for
unemployment rate prediction have attracted much attention the unemployment rate prediction. These findings imply that
from governments, organizations, research institutes, and the data mining framework is efficient for unemployment
scholars. Recently, novel methods using search engine query rate prediction, and it can strengthen government’s quick
data were proposed to forecast unemployment rate. In this responses and service capability.
paper, a data mining framework using search engine query
data for unemployment rate prediction is presented. Under Keywords Unemployment rate prediction · Data mining ·
the framework, a set of data mining tools including neural Search engine query data · Government service
networks (NNs) and support vector regressions (SVRs) is
developed to forecast unemployment trend. In the proposed
method, search engine query data related to employment 1 Introduction
activities is firstly extracted. Secondly, feature selection
model is suggested to reduce the dimension of the query data. Unemployment rate prediction has become critically
Thirdly, various NNs and SVRs are employed to model the significant, in particular during economic recession, because
relationship between unemployment rate data and query data, it can not only help government to make decision and
and genetic algorithm is used to optimize the parameters and design policies, but also offer practitioners to have a bet-
refine the features simultaneously. Fourthly, an appropriate ter understanding of the future economic trend. In recent
data mining method is selected as the selective predictor by years, forecast of unemployment rate attracts much atten-
using the cross-validation method. Finally, the selective pre- tion from governments, organizations, research institutes,
dictor with the best feature subset and proper parameters is and scholars. A great number of methods are proposed for
unemployment rate prediction. Traditional univariate time
W. Xu (B) · Z. Li · C. Cheng series models have been proposed for the unemployment
School of Information, Renmin University of China,
Beijing 100872, China
rate prediction [3,13,20,22]. For example, a time deforma-
e-mail: weixu@ruc.edu.cn tion model is applied to US unemployment data, and the
Z. Li
experimental results indicate that the proposed method has
e-mail: ziang_lee@126.com better performance than other better-known models, such
C. Cheng
as the autoregressive integrated moving average (ARIMA)
e-mail: chengcheng_ruc@126.com [22]. Similarly, autoregressive fractionally integrated moving
average (ARFIMA) is offered to analyze the US unem-
T. Zheng ployment trend, and the results show that ARFIMA has a
School of Economics and Management, Tsinghua University,
Beijing 100084, China
better forecasting performance than threshold autoregressive
e-mail: zhengtingting@hotmail.com (TAR) and symmetric ARFIMA model [13].

123
SOCA

Some macroeconomic variables, such as money supply, proposed framework, an automated feature selection model
producer price index, interest rate, and gross national is firstly constructed to reduce the dimension of the query
product (GNP), have been considered in unemployment rate data. Secondly, different data mining tools are employed
prediction [10–12,15–17,21]. A smooth transition vector to describe the relationship between the unemployment rate
error-correction model (STVECM) is used to forecast the data and the search engine query data. Thirdly, an optimal
unemployment rates of the four non-Euro G-7 countries in data mining model is selected as the predictor by using the
terms of economic indicators [15]. Similarly, a Markov- cross-validation method. Finally, the selected predictor with
switching vector error-correction model (MS-VECM) is proper parameter and best feature subset is used to forecast
suggested to analyze the UK labor market [12]. Moreover, unemployment trend.
a univariate and multivariate functional coefficient autore- The rest of this paper is organized as follows. The next
gressive (FCAR) models are presented and evaluated for section introduces some basic concepts of data mining tools
multi-step unemployment rate prediction [10]. A pattern used in this paper, including NNs and SVRs. The data mining
recognition method is developed to analyze the specific framework using search engine query data is proposed for the
phenomenon of fast acceleration of unemployment [11]. unemployment rate prediction in Sect. 3. For illustration, the
In recent years, Web information is regarded as a useful efficiency of the proposed framework and empirical analy-
resource to analyze socioeconomic hot spot, such as influenza sis of unemployment trend using the data mining tools are
epidemics detection [8,23] and finance market prediction reported in Sect. 4. Finally, conclusions and future research
[2,14,18], and the unemployment rate prediction using Web directions are summarized in Sect. 5.
information has attracted more attention from researchers
and practitioners [1,4–7,19]. A new method of using data on
internet activity is proposed to demonstrate strong correla- 2 Introduction to data mining tools
tions between keyword searches and unemployment rates,
and the experimental results show that the method used Data mining is a technique that investigates the internal rules
has a strong potential for the unemployment rate prediction of data by analyzing large quantity of data. In other words, it is
[1]. An internet job-search indicator called Google Index a technique that transforms large data into useful information.
(GI) is offered as the best leading indicator to predict the Data mining makes use of the theories of statistics, artificial
US unemployment rate, and an out-of-sample comparison intelligence and the others. In this paper, neural network and
of other forecasting models is done to show that the GI support vector regression are used for mining the internal rule
indeed helps in predicting the US unemployment rate even of search engine query data and predicting the unemployment
after controlling for the effects of data snooping [6], while rate.
the power of a novel indicator based on job-search-related
Web queries is employed to predict quarterly unemployment 2.1 Neural networks
rates in short samples [7]. Similarly, the popularity of Web
searches tracked by Google is suggested as an indicator of Neural network is a mathematical model that imitates the
contemporaneous economic activity, before the official data structure and functions of biological neural network. A
become available and/or are revised [19]. Moreover, Google neural network consists of different interconnected artificial
Trends data are suggested to forecast the US unemployment neurons that are distributed in input layer, hidden layer(s),
time series, and it could improve the forecasting accuracy and output layer. Generally, in learning phase, the neural
significantly by using Google Trends [4,5]. network could change its structure based on the information
Different from the previous studies, a data mining method that flows through the network. This nonlinear computational
using neural networks has been used to forecast unemploy- model is widely used in detecting the complex relationship
ment rate with search engine query data, and the experimental between the input and the output data.
results show that the proposed method outperforms the tradi- Back-propagation neural network (BPNN) is a widely
tional methods [24]. Furthermore, combining search engine used neural network model, in which the information is trans-
query data and time series data, a hybrid forecasting model ferred from the input layer to the output layer via hidden
is suggested to improve the performance of unemployment layer(s). When the practical output is different from the esti-
rate prediction [25]. Since data mining techniques can make mated output, the weights and thresholds are adjusted by the
a significant contribution to forecast unemployment rate pre- back-propagation process of errors, as shown in Fig. 1.
diction, in this paper, a data mining framework using search When the first input information flows through the net-
engine query data for the unemployment rate prediction is work and the output information is produced, the back-
presented, and within the proposed framework, various data propagation process is commenced. As mentioned above,
mining tools are validated and compared to examine the effi- the error between the produced value and actual value is cal-
ciency and effectiveness of the proposed framework. In the culated to optimize the network with the help of an error

123
SOCA

(x.)2
f (x) = e− 2 cos(1.75x) (5)

Two parameters are important in WNN learning process. One


is to adjust the weights of network, and the other is to accom-
modate the scale factor and displacement factor.

2.2 Support vector regression

Input Layer Hidden Layer Output Layer Support vector regression (SVR) is an adaptation of support
vector machine (SVM), which is a recently proposed statis-
Fig. 1 The structure of BPNN tical learning for classification by Vapnik. The basic idea of
SVR is mapping the data to high-dimensional feature space
function. The commonly used error function is quadratic from input space and then using linear regression to solve the
function, which is displayed as follows: problem in high-dimensional feature space.
1 Given a training set {(xi , yi )}, i = 1, 2, . . ., n, where xi
E(t) = (a j (t) − y j (t))2 (1)
2 defines the input data, yi defines the corresponding output,
where y j (t) is the produced value from neural network at time and n is the total number of data instances. The regression
period t, and a j (t) represents the actual value at time period function of SVR is defined as:
t. Then, the connection weights are adjusted by generalized f (x) = (w · ϕ(x)) + b (6)
delta learning function:
ε where w and b denote weight vector and bias constant,
  
w ji (t) = η(a j − y j ) f  (.)yi + μw ji (t − 1) (2) respectively, and ϕ(x) stands for the function of the mapping
s=1 data to high-dimensional feature space from input space.
In ε-SVR, the coefficients of regression, which are w and
where η is learning rate, and μ is momentum value, ε is epoch
b, are solved by minimizing the regularized risk function
size, and f (.) is the activation function. Besides, (a j − y j )
below:
stands for the error between the actual value and the produced
value. 
n
1
R(C) = C L ε ( f (xi ), yi ) + w2 (7)
The activation function of traditional BPNN is the hyper- 2
i=1
bolic tangent function, which could be defined as:
In this function, the first part stands for empirical risk
2
f (x) = −1 (3) and the second part stands for regularized risk. Parameter
1 + e(−2x)
C, which is the regularization constant, is utilized to strike
Learning rate is a parameter that determines the efficiency the balance between empirical risk and regularized risk. In
and effectiveness of finding the best solution. The larger the addition, L ε ( f (x), y) is the ε-insensitivity loss function and
value of learning rate, the faster the learning process, but it defined as:
may jitter. However, if the value of learning rate is relatively  
0 if |y − f (x)| ≤ ε
small, the local optimal solution may reach. L ε ( f (x), y) = (8)
|y − f (x)| − ε otherwise
Different from BPNN, radical basis function neural net-
works (RBFNN) uses the nonlinear radical basis functions where ε defines the size of tube or, in other words, the max-
(RBF) as the activation function in the hidden layer, like imum error allowed in regression.
Gaussian function: By introducing slack variables ξ , the problem can be trans-
− (x−θ2 )
2 formed into an optimization problem as below:
f (x − θ ) = e σ (4)

n
where (x − θ) represents the mean value of Gaussian distrib- Minimize C (ξi + ξ ∗j ) + 1
2 w2
i=1
ution, and σ 2 stands for the variance. ‘Spread’ is a parameter (yi − (w · ϕ(x) + b)) ≤ ε + ξi , (9)
that reflects the changing speed of RBF. The larger value s.t. (y j − (w · ϕ(x) + b)) ≥ ε + ξ ∗j
of spread means that the neurons are required to fit a fast- ξi , ξ ∗j , ε ≥ 0, i = 1, 2, . . . , n
changing function, while a smaller spread indicates that the
neurons are needed to fit a smooth function. Because in ε-SVR, selection of ε in ε-sensitivity loss func-
Similarly, for wavelet neural network (WNN), the wavelet tion is difficult, ν-SVR is designed to overcome this problem
function imbedded in hidden layer is regarded as the activa- by introducing another parameter v ∈ (0, 1] for controlling
tion function. This function could be described as follow. the number of support vector. And in ν-SVR, the optimization

123
SOCA

problem that was transformed by introducing slack variable


Search Engine Query Data
ξ is transformed as follows: Local/Jobs
Society/Social The Unemployment Rate Data

n services/welfare &
Minimize C(vε + 1
n (ξi + ξ ∗j )) + 1
2 w2 unemployment
i=1
(yi − (w · ϕ(x) + b)) ≤ ε + ξi , (10)
s.t. (y j − (w · ϕ(x) + b)) ≥ ε + ξ ∗j
Training set Testing set
ξi , ξ ∗j , ε ≥ 0, i = 1, 2, . . . , n
Both Eqs. (9) and (10) can be solved by solving their
dual problem, which are finally transformed by introducing Feature Selection
Lagrange multipliers and utilizing optimality constraints:

Nsv
f (x, λi , λ∗j ) = (λi − λ∗j )K (x, xi ) + b
Data Mining Tools
(11)
i=1

where Nsv is the number of support vectors, and K (x, xi ) =


ϕ(x)·ϕ(xi ) is the kernel function, and λi and λ∗j are Lagrange
No Evaluation
Methods/Models
multipliers.
It is important that the value of kernel functions is equal Yes

to the inner product of two vectors in feature space, which


is K (xi , x j ) = φ(xi )T φ(x j ). SVR solves the problem in The Unemployment Rate Prediction
high-dimensional feature space, and the utilization of kernel
function simplifies the problem that φ(x) is not need to be Fig. 2 The framework of the unemployment rate prediction
computed. In addition, there are four commonly used kernel
functions listed below.
As can be seen from Fig. 2, the main process of the pro-
Linear kernel:
posed framework can be decomposed into the following four
K (xi , x j ) = xiT x j (12) steps.
Polynomial kernel with parameters γ , d, r : Step 1: Data collection Both the search engine query data
and the unemployment data are collected to help
K (xi , x j ) = (γ xiT x j + r )d , γ > 0 (13)
build the model. Suggested in [4], two types of the
Radial basis function (RBF) kernel with parameter γ : query data, “Local/Jobs” and “Society/Social Ser-
 2 vices/Welfare & Unemployment”, are supposed to
K (xi , x j ) = exp(−γ xi − x j  ), γ > 0 (14) be related to the unemployment queries. The weekly
Sigmoid kernel with parameters γ , r : counts for the query data are available from 2004
to now at the Google Search Insight (http://www.
K (xi , x j ) = tanh(γ xiT x j + r ) (15) google.com/insights/#), and the unemployment data
is available at US Department of Labor (http://www.
ows.doleta.gov/unemploy/claims.asp).
3 Data mining for unemployment rate prediction Step 2: Feature selection The query data collected in the
first step is of the low correlation with the pre-
3.1 Overview dict target. To exclude these outliers and improve
the performance of the model, a Pearson function
Data mining techniques together with Web information, such is applied to calculate the correlation coefficient
as neural networks (NNs) and support vector regressions between each feature and the predicted target [9].
(SVRs), have been successfully applied to many research Through correlating the search engine query data
topics [18,23]. However, there are seldom data mining–based and the unemployment data, the top 100 features
methods or systems to analyze the unemployment trend using (see “Appendix”) with highest correlation values are
Web information. So, this paper proposes a data mining chosen as the original feature set.
methodology for the unemployment rate prediction using Step 3: Modeling Different data mining tools are tested to
search engine query data, which is one important type of Web measure the fitness between the search engine query
information. The framework of our proposed methodology data and the unemployment rate data. The details are
is illustrated in Fig. 2. described in Sect. 3.2.

123
SOCA

Step 4: Prediction The designed models are taken through Randomly initializing GA
an iterative validation process using various evalua- populations
tion methods such as cross-validation method with
different evaluation criteria, until the model with
Selection
best performance is selected. The selective predictor
with the best feature subset and the optimal parame-
ters is used to forecast the unemployment trend. Dataset
Crossover

3.2 The modeling process


Crossover Training set Testing set

In this subsection, different data mining tools including NNs


and SVRs are used to model the relationship of search engine Fitness Function Data mining models
RMSE
query data and the unemployment rate data. To improve the
performance of models, genetic algorithm (GA), which imi-
No Maximal
tates the biological reproduction, is employed to optimize Generation ?
the model’s parameters and features generated in feature
selection phase. The genetic representation of parameters and Yes

features is shown in Fig. 3, and the GA-based data mining The Selective Data Mining Models
methods are summarized in Fig. 4. with Proper Feature s and Parameters
As can be seen from Fig. 4, a population consists of a
group of chromosomes and it is generated randomly in the Unemployment Rate Prediction
first generation according to the number and size of chromo-
somes. During the selection process, the fitness value of each Fig. 4 The GA-based data mining method
chromosome is calculated through fitness function, which
is served as an evaluation indicator to determine whether vector regression models, ε-SVR and v-SVR are imple-
this chromosome could appear in next generation: The chro- mented with four different kernel functions: linear, poly-
mosome with low fitness value is dropped out, and a new nomial, RBF, and sigmoid kernel. In the process of fitness
chromosome is added automatically. From the second gen- function construction, a five-fold cross-validation, in which
eration, the crossover and mutation may happen to some the data are divided into five folds evenly, is carried out,
chromosomes in accordance with some possibilities. The and each time, four folds are trained by neural networks or
crossover means that two chromosomes exchange their genes support vector regressions, while the other fold is used as
from a fixed point and develop into two new chromosomes, testing set and is used to validate the performance of data
while mutation indicates a sudden change in genes on a chro- mining models; furthermore, the average RMSE is calcu-
mosome. Then, the fitness function is applied again. This lated through this fivefold cross-validation, and 1/RMSE is
iteration may not stop until the maximum generation of evo- chosen as the value of fitness function.
lution. In this experiment, the maximum generation of evo-
lution is set at 100, and the initial size of population is set
at 60, which means 60 possible feature groups are selected 4 Empirical analysis
randomly at first.
The fitness function is calculated by the performance of 4.1 Data description and evaluation criteria
neural networks and support vector regression separately. In
neural network models, three different neural networks are The US government only releases a monthly report of unem-
implemented to train and test the selected features and para- ployment rate to the public. In order to improve the prediction
meter(s), namely BPNN, RBFNN, and WNN. In support performance, instead of forecasting the unemployment rate
itself, the Unemployment Initial Claims (UIC) is used in our
experiments. UIC is a leading indicator of US labor market
0 1 ... 0 ... 0 0 ... 1 0 1 0 ... 1 to estimate the unemployment rate, which is a weekly report
Randomly initializing GA
that issued by US Department of Labor. Thus, the weekly
P1 ... Pm F1 F2
populations F3 ... Fn initial claims data are collected from the Web site of the US
Department of Labor.
Parameter Set Feature Set On the another hand, as proposed in [4], two types
of the query data, “Local/Jobs” and “Society/Social Ser-
Fig. 3 Genetic representation vices/Welfare & Unemployment”, are supposed to be related

123
SOCA

to the unemployment queries. More specifically, different 4.2 Details of models


states in US like “Washington unemployment”, differ-
ent types of jobs like “police jobs”, and their combina- As introduced in Sect. 2, the important parameters for each
tion like “engineer in NY”, are included in “Local/Jobs”. data mining model are crucial for the performance of these
Moreover, “Society/Social Services/Welfare & Unemploy- models. Therefore, in the experiment, for each NN or SVR
ment” consists of the social reasons for unemployment, with different activation/kernel functions, these parameters
the social service for unemployment, such as “unemploy- should be optimized by GA, which are displayed in Table 1
ment insurance”, and so on. The Google keyword tool following.
(https://adwords.google.com/) is utilized to collect the query
data, and 500 key words are collected as the raw fea-
ture set based on the two types. Then, the time series of 4.3 Models comparison and selection
weekly counts for these queries are available from Jan-
uary 2004 to March 2011 in the Google Search Insight, According to the experimental design aforementioned,
with normalized values between 0 and 100. The UIC detailed experiments with different models are conducted.
data from January 2004 to March 2011 are available at Tables 2,3, and 4 reflect the performances of GA-NN and
the US Department of Labor (http://www.ows.doleta.gov/ GA-SVR models with different activation functions or ker-
unemploy/claims.asp). nels in terms of RMSE, MAE, and MAPE, respectively.
In addition, for comparison, the indicator of root-mean- As it can be seen from Table 2, in terms of RMSE, overall
square error (RMSE) is used to measure the prediction speaking, it is obvious that GA-SVR models outperform GA-
results. Given n pairs of actual values (Ai ) and the predictive NN models except SVRs with sigmoid kernel, which may
values (Pi ), the indicator can be calculated as follows: reflect that sigmoid kernel is not suited for this problem. In
addition, NN with activation of hyperbolic tangent (BPNN)


n outperforms NN with RBF activation function (RBFNN) and


RMSE = (Ai − Pi ) 2 n (16) NN with wavelet activation function (WNN) greatly, and
i=1 WNN performs the worst, which may also inflect that wavelet
function is not a suitable activation function in this problem.

n
MAE = |Ai − Pi | n (17) Next, when comparison is conducted within SVRs, ν-SVRs
i=1 outperform ε-SVRs if kernel is identical. What is more, in the

n
|Ai − Pi | average point of view, ν-SVR with polynomial kernel per-
MAPE = n (18) forms the best, and the best result also comes from ν-SVR
Ai
i=1 with polynomial kernel in iteration 5.

Table 1 Parameters of SVRs to be optimized


Model Activation/kernel function Parameters

NN Hyperbolic tangent Learning rate


RBF Spread
Wavelet Learning rate 1 (for adjusting the weights of network) and learning rate 2 (for adjusting the scale factor
and displacement factor)
ε-SVR Linear ε in Eqs. (8), (9), (10), C in Eqs. (7), (9), (10) and e which is the value of condition for stop training
Poly γ in Eqs. (13), (14), (15), d in Eq. (13), r in Eqs. (13), (15), ε in Eqs. (8), (9), (10), C in Eq. (7), (9), (10)
and e which is the value of condition for stop training
RBF γ in Eqs. (13), (14), (15), ε in Eqs. (8), (9), (10), C in Eqs. (7), (9), (10) and e which is the value of
condition for stop training
Sigmoid γ in Eqs. (13), (14), (15), r in Eqs. (13), (15), ε in Eqs. (8), (9), (10), C in Eqs. (7), (9), (10) and e which is
the value of condition for stop training
ν-SVR Linear ν in Eq. (10), C in Eqs. (7), (9), (10) and e which is the value of condition for stop training
poly γ in Eqs. (8), (9), (10), d in Eq. (13), r in Eqs. (13), (15), ν in Eq. (10), C in Eqs. (7), (9), (10) and e which
is the value of condition for stop training
RBF γ in Eqs. (8), (9), (10), ν in Eq. (10), C in Eqs. (7), (9), (10) and e which is the value of condition for stop
training
Sigmoid γ in Eqs. (8), (9), (10), r in Eqs. (8), (10), ν in Eq. (10), C in Eqs. (7), (9), (10) and e which is the value of
condition for stop training

123
SOCA

Table 2 Performance results in terms of RMSE


Model Activation/kernel function Iteration Average
1 2 3 4 5

NN Hyperbolic tangent 77,957.73 79,619.08 76,909.18 88,358.48 79,610.98 80,491.09


RBF 106,410.07 136,737.73 255,692.73 177,538.48 126,402.07 160,556.22
Wavelet 164,882.24 144,489.20 218,489.45 180,275.99 196,644.55 180,956.29
ε-SVR Linear 53,194.09 56,290.26 55,020.55 54,680.86 57,409.90 55,319.13
Poly 55,193.32 55,788.99 53,336.77 59,073.74 56,444.20 55,967.40
RBF 67,840.33 57,772.81 57,925.41 57,730.18 55,707.13 59,395.17
Sigmoid 100,893.90 336,514.17 110,264.82 121,147.52 114,994.74 156,763.03
ν-SVR Linear 53,691.49 51,854.42 54,903.67 55,957.82 51,358.71 53,553.22
Poly 52,578.54 51,799.83 52,961.33 55,934.42 50,330.03 52,720.83
RBF 54,326.95 56,733.23 50,505.49 51,385.02 52,649.24 53,119.98
Sigmoid 119,182.46 102,275.04 111,150.78 112,942.50 99,708.12 109,051.78

As revealed in Table 3, in terms of MAE, similar results ν-SVRs outperform ε-SVRs if kernels are same, (3) WNN
can be found. GA-SVR models perform better than GA-NN and SVR with sigmoid kernel are not suitable to tackle this
models except for SVRs with sigmoid kernel. In addition, problem, because of their relatively poor performances when
ν-SVRs outperform ε-SVRs under conditions that their ker- compared with the others, (4) best average result comes from
nels are same. The best average performance is generated by ν-SVR with RBF kernel, and ν-SVR with RBF kernel is best
ν-SVR with RBF kernel, and it is different from the result suited for this problem.
in terms of RMSE. Moreover, the best performance comes
from ν-SVR with RBF kernel in iteration 3.
When performance results are evaluated in terms of 4.4 Prediction and further discussion
MAPE, which is reflected in Table 4, the analyses are nearly
exactly the same: (1) SVRs perform better than NNs in most According to the result analyses above, model ν-SVR with
circumstance, (2) ν-SVRs outperform ε-SVRs if kernels are RBF kernel in iteration 3 is chosen as the model for the final
same, (3) best average result comes from ν-SVR with RBF prediction. The model ν-SVR with polynomial kernel in iter-
kernel, and (4) ν-SVR with RBF kernel in iteration 3 yields ation 5, which performs best in terms of RMSE, is not chosen
best performance. for (1) in terms of MAE and MAPE, and model ν-SVR with
Grounded on the similar results in terms of different per- RBF kernel in iteration 3 performs better; and (2) even in
formance evaluator, several implications are concluded: (1) terms of RMSE, model ν-SVR with RBF kernel in iteration
SVRs perform better than NNs in most circumstance, (2) 3 performs only slightly worse (50505.49 versus 50330.03).

Table 3 Performance results in terms of MAE


Model Activation/kernel function Iteration Average
1 2 3 4 5

NN Hyperbolic tangent 58,010.78 60,887.08 56,901.86 66,254.51 56,909.93 59,792.83


RBF 64,166.28 87,371.29 110,593.98 99,175.90 76,442.86 87,550.06
Wavelet 145,274.03 124,797.69 200,542.05 161,490.04 175,488.29 161,518.42
e-SVR Linear 41,401.69 44,171.93 41,412.31 41,702.17 44,037.61 42,545.14
Poly 42,718.27 41,770.96 41,214.89 44,187.66 44,018.46 42,782.05
RBF 53,664.54 43,167.90 43,324.22 43,446.21 42,147.97 45,150.17
Sigmoid 79,626.90 205,544.91 93,198.05 94,959.17 93,480.45 113,361.90
v-SVR Linear 38,918.37 39,442.63 40,536.70 41,618.78 38,218.97 39,747.09
Poly 38,353.66 39,649.21 39,951.25 40,081.89 37,886.79 39,184.56
RBF 38,638.26 39,689.48 36,305.30 36,687.78 37,753.21 37,814.81
Sigmoid 93,814.35 88,568.73 78,217.22 84,028.40 77,205.05 84,366.75

123
SOCA

Table 4 Performance results in terms of MAPE


Model Activation/Kernel function Iteration Average
1 2 3 4 5

NN Hyperbolic tangent 14.82 15.84 14.30 16.95 14.24 15.23


RBF 16.32 21.26 28.21 24.46 18.36 21.72
Wavelet 43.78 35.20 60.92 49.14 53.58 48.53
e-SVR Linear 10.96 11.55 10.82 10.77 11.61 11.14
Poly 11.34 11.14 11.00 11.31 11.96 11.35
RBF 14.68 11.53 11.50 11.40 11.04 12.03
Sigmoid 21.17 50.59 26.36 23.96 25.69 29.56
v-SVR Linear 9.91 10.22 10.18 10.72 9.76 10.16
Poly 9.74 10.63 10.70 10.04 10.06 10.24
RBF 9.89 10.10 9.18 9.29 9.46 9.58
Sigmoid 23.86 24.95 17.84 20.63 19.17 21.29

Table 5 Details of the model selected


Parameter γ ν C e
0.12503 0.56357 2.3622 0.10823

Selected features
No. 5, 8, 12, 13, 16, 19, 22, 24, 25, 29, 30, 31, 32, 35, 36, 38, 39, 41, 44, 45, 50, 51, 52, 53, 59, 60, 61, 62, 67, 69, 70, 73, 75, 76, 77, 78, 80,
81, 82, 85, 87, 88, 89, 91, 93, 95, 97, 99, and 100

The details of the parameters related to this model and the ment rate are compared visually in Fig. 5 below, and it is
features selected are listed in Table 5, and the numbers with not rude to conclude that the predicted value generally fol-
corresponding key words features are displayed in “Appen- lows the trend of real unemployment rate as shown in Fig. 5.
dix.” The RMSE, MAE, and MAPE are 68,182.55, 54,241.10, and
When the selected model is applied to predict the real 12.54, respectively. The worse performance may be caused
value of unemployment rate, the performance of it is not as by the outliers that occurred between 10-12-26 and 11-01-22.
good as the one in the experiment aforementioned. This may
be due to the overfitting of the model in training process. 5 Conclusions
The prediction result of select model and the real unemploy-
This paper presents a novel data mining framework for the
unemployment rate prediction using search engine query
data. Under the framework, GA-based data mining meth-
ods are proposed to forecast the unemployment rate. In the
proposed method, the proper feature subset and the optimal
parameters are selected. In terms of evaluation criteria, the
empirical results show the efficiency and effectiveness of the
proposed framework and also revealed that among these data
mining tools, the GA-based ν-SVR with RBF kernel shows
dominant advantages for the unemployment rate prediction.
So, it indicates that the proposed framework can be used as
a potential alternative to analyze the unemployment trend.
Besides, the timely search engine query data could generate
simultaneous prediction result, which could help government
and scholars deal with unemployment trend without delay.
In addition, this study also has some research questions
for further studies. Firstly, under our proposed framework,
Fig. 5 Prediction result with real unemployment rate value other data mining tools, such as ensemble methods, can be

123
SOCA

used to forecast the unemployment trend for a more stable 21 unemployment claims 71 new york unemployment
solution. Secondly, some other Web information, including benefit
Web content information and Web link information, can be 22 unemployment apply for 72 unemployment insurance
used to improve the forecast performance. Thirdly, in this benefit
paper, the primary data set of search engine query is rel- 23 apply for unemployment 73 unemployment dol
atively large, and thus an efficient feature group, which is 24 unemployment ca 74 unemployment info
small and reasonable, should be built to forecast unemploy- 25 unemployment services 75 unemployment commission
ment rate. Fourthly, an online unemployment analysis and 26 unemployment security 76 michigan unemployment
benefits
forecast system (UAFS) can be developed to assist govern-
ments and organizations for early-warning and decision sup- 27 unemployment 77 weekly unemployment
insurance
port. Finally, the proposed methodology can also be applied
28 to file unemployment 78 weekly unemployment
to other research fields, especially to society hot spot, such benefits
as real estate market, crude oil market, and foreign exchange 29 unemployment benefits 79 nyc unemployment benefits
market.
30 file for unemployment 80 green jobs
online
Acknowledgments This research work was partly supported by 973 31 ohio unemployment 81 how to claim unemployment
Project (Grant No. 2012CB316205), National Natural Science Foun- benefits
dation of China (Grant No. 71001103) and Beijing Natural Science 32 unemployment file 82 unemployment rate
Foundation (No. 9122013). claims
33 to file for unemployment 83 unemployment insurance
Appendix: The top 100 search engine query data benefits
34 unemployment benefits 84 unemployment weekly
pa benefits
No. Key words No. Key words 35 unemployment benefit 85 online unemployment
application
1 filing unemployment 51 ohio unemployment rate 36 nys dept labor 86 unemployment rate ny
2 unemployment filing for 52 unemployment ny 37 state unemployment 87 jobs in usa
3 unemployment office 53 unemployment benefit
compensation 38 connecticut 88 new york unemployment
unemployment benefits benefits
4 file for unemployment 54 unemployment in az
39 dept of unemployment 89 benefits for unemployment
5 unemployment file for 55 to apply for unemployment
40 nys dept of labor 90 police jobs
6 unemployment state 56 unemployment insurance
claim 41 for unemployment 91 dc unemployment
benefits
7 state of unemployment 57 unemployment department
42 uimn.org 92 unemployment in kansas
of labor
43 unemployment in 93 mass unemployment benefits
8 insurance unemployment 58 department of labor
michigan
unemployment
44 unemployment benefit 94 unemployment online
9 washington 59 labor department claim
unemployment unemployment 45 unemployment payment 95 unemployment in florida
10 unemployment file 60 unemployment check 46 unemployment in 96 eligible for unemployment
11 unemployment insurance 61 unemployment for mn colorado
12 unemployment apply 62 unemployment in indiana 47 apply for unemployment 97 benefits of unemployment
online insurance
13 department of 63 unemployment in california
unemployment 48 unemployment benefits 98 unemployment eligibility
14 unemployment website 64 snag a job insurance
49 application for 99 construction jobs
15 unemployment 65 unemployment grants unemployment
application 50 benefits unemployment 100 unemployment rate recession
16 unemployment new york 66 unemployment in insurance
pennsylvania
17 washington state 67 unemployment benefit
unemployment insurance
18 Wisconsinunemployment 68 claim unemployment benefit
benefits References
19 insurance for 69 part time unemployment
unemployment
20 apply for unemployment 70 security jobs 1. Askitas N, Zimmermann KF (2009) Google econometrics and
unemployment forecasting. Appl Econom Q 55(2):107–120

123
SOCA

2. Blasco N, Corredor P, Del Rio C, Santamaria R (2005) Bad news 14. Lan KC, Ho KS, Luk RWP, Yeung DS (2005) FNDS: a dialogue-
and Dow Jones make the Spanish stocks go round. Eur J Oper Res based system for accessing digested financial news. J Syst Softw
163(1):253–275 78(2):180–193
3. Chen CI (2008) Application of the novel nonlinear grey Bernoulli 15. Milas C, Rothman P (2008) Out-of-sample forecasting of unem-
model for forecasting unemployment rate. Chao Solitons Fractals ployment rates with pooled STVECM forecasts. Int J Forecast
37(1):278–287 24(1):101–121
4. Choi H, Varian H (2009) Predicting initial claims for unemploy- 16. Proietti T (2003) Forecasting the US unemployment rate. Comput
ment benefits. Google technical report Stat Data Anal 42(3):451–476
5. Choi H, Varian H (2009) Predicting the present with Google trends. 17. Schanne N, Wapler R (2010) Regional unemployment forecasts
Google technical report with spatial interdependencies. Int J Forecast 26(4):908–926
6. D’Amuri F (2009) Predicting unemployment in short samples with 18. Schumaker RP, Chen H (2009) A quantitative stock prediction sys-
internet job search query data. MPRA paper no. 18403:1–17 tem based financial news. Inform Process Manag 45(5):571–583
7. D’Amuri F, Marcucci J (2009) Google it! forecasting the US unem- 19. Suhoy T (2009) Query indices and a 2008 downturn: Israeli data.
ployment rate with a Google job search index. MPRA Paper No. Bank of Israel discussion paper
18248:1–52 20. Tashman LJ (2000) Out-of-sample tests of forecast accuracy: an
8. Ginsberg J, Mohebbi MH, Patel RS, Brammer L, Smolinski MS analysis review. Int J Forecast 16(4):437–450
(2009) Detecting influenza epidemics using search engine query 21. Terui N, van Dijk HK (2002) Combined forecasts from linear and
data. Nature 457(19):1012–1014 nonlinear time series models. Int J Forecast 18(3):421–438
9. Guyon I, Elisseeff A (2003) An introduction to variable and feature 22. Vijverberg CPC (2009) A time deformation model and its time-
selection. J Mach Learn Res 3:1157–1182 varying autocorrelation: an application to US unemployment data.
10. Harvill JL, Ray BK (2005) A note on multi-step forecasting Int J Forecast 25(1):128–145
with functional coefficient autoregressive models. Int J Forecast 23. Xu W, Han ZW, Ma J (2010) A neural network based approach to
21(4):717–727 detect influenza epidemics using search engine query data. In: Pro-
11. Keilis-Borok VI, Soloviev AA, Allegre CB, Sobolevskii AN ceeding of the ninth international conference on machine learning
(2005) Patterns of macroeconomic indicators preceding the unem- and cybernetics, Qingdao, China, pp 1408–1412
ployment rise in Western Europe and the USA. Pattern Recogn 24. Xu W, Zheng T, Li Z (2011) A neural network based forecast-
38(3):423–435 ing method for the unemployment rate prediction using the search
12. Krolzig HM, Marcellino M (2002) A Markov-switching vector engine query data. In: Proceeding of the eighth IEEE international
equilibrium correction model of the UK labour market. Empir Econ conference on e-business engineering, Beijing, China, pp 9–15
27:233–254 25. Xu W, Li Z, Chen Q (2012) Forecasting the unemployment rate
13. Lahiani A, Scaillet O (2009) Testing for threshold effect in by neural networks using search engine query data. In: Proceeding
ARFIMA models: application to US unemployment rate data. Int of the 45th Hawaii international conference on system sciences,
J Forecast 25(2):418–428 Hawaii, US, pp 3591–3599

123

You might also like