You are on page 1of 5

Expert Systems with Applications 36 (2009) 5445–5449

Contents lists available at ScienceDirect

Expert Systems with Applications
journal homepage: www.elsevier.com/locate/eswa

Customer churn prediction using improved balanced random forests
Yaya Xie a, Xiu Li a,*, E.W.T. Ngai b, Weiyun Ying c
a
Department of Automation, Tsinghua University, Beijing, PR China
b
Department of Management and Marketing, The Hong Kong Polytechnic University, Hong Kong, PR China
c
School of Management, Xi’an Jiaotong University, Xi’an, PR China

a r t i c l e i n f o a b s t r a c t

Keywords: Churn prediction is becoming a major focus of banks in China who wish to retain customers by satisfying
Churn prediction their needs under resource constraints. In churn prediction, an important yet challenging problem is the
Random forests imbalance in the data distribution. In this paper, we propose a novel learning method, called improved
Imbalanced data
balanced random forests (IBRF), and demonstrate its application to churn prediction. We investigate
the effectiveness of the standard random forests approach in predicting customer churn, while also inte-
grating sampling techniques and cost-sensitive learning into the approach to achieve a better perfor-
mance than most existing algorithms. The nature of IBRF is that the best features are iteratively
learned by altering the class distribution and by putting higher penalties on misclassification of the
minority class. We apply the method to a real bank customer churn data set. It is found to improve pre-
diction accuracy significantly compared with other algorithms, such as artificial neural networks, deci-
sion trees, and class-weighted core support vector machines (CWC-SVM). Moreover, IBRF also
produces better prediction results than other random forests algorithms such as balanced random forests
and weighted random forests.
Ó 2008 Elsevier Ltd. All rights reserved.

1. Introduction (3) the task of predicting churn requires the ranking of subscribers
according to their likelihood to churn (Au, Chan, & Yao, 2003).
Customer churn, which is defined as the propensity of custom- Several approaches have been proposed to address this prob-
ers to cease doing business with a company in a given time period, lem. Decision-tree-based algorithms can be extended to determine
has become a significant problem and is one of the prime chal- the ranking, but it is possible that some leaves in a decision tree
lenges many companies worldwide are having to face (Chandar, have similar class probabilities and the approach is vulnerable to
Laha, & Krishna, 2006). noise. The neural network algorithm does not explicitly express
In order to survive in an increasingly competitive marketplace, the uncovered patterns in a symbolic, easily understandable way.
many companies are turning to data mining techniques for churn Genetic algorithms can produce accurate predictive models, but
analysis. A number of studies using various algorithms, such as they cannot determine the likelihood associated with their predic-
sequential patterns (Chiang, Wang, Lee, & Lin, 2003), genetic tions. These problems prevent the above techniques from being
modeling (Eiben, Koudijs, & Slisser, 1998), classification trees applicable to the churn prediction problem (Au et al., 2003). Some
(Lemmens & Croux, 2003), neural networks (Mozer, Wolniewicz, other methods, such as the Bayesian multi-net classifier (Luo & Mu,
Grimes, Johnson, & Kaushansky, 2000), and SVM (Zhao, Li, Li, Liu, 2004), SVM, sequential patterns, and survival analysis (Lariviere &
& Ren, 2005), have been conducted to explore customer churn Van den Poel, 2004), have made good attempts to predict churn,
and to demonstrate the potential of data mining through experi- but the error rates are still unsatisfactory.
ments and case studies. In response to these limitations of existing algorithms, we pres-
However, the existing algorithms for churn analysis still have ent an improved balanced random forests (IBRF) method in this
some limitations because of the specific nature of the churn predic- study. To the best of our knowledge, only a few implementations
tion problem. This has three major characteristics: (1) the data is of random forests (Breiman, 2001) in a customer churn environ-
usually imbalanced; that is, the number of churn customers consti- ment have been published (Buckinx & Van den Poel, 2005; Burez
tutes only a very small minority of the data (usually 2% of the total & Van den Poel, 2007; Coussement & Van den Poel, 2008; Lariviere
samples) (Zhao et al., 2005); (2) large learning applications will & Van den Poel, 2005). Our study contributes to the existing liter-
inevitably have some type of noise in the data (Shah, 1996); and ature not only by investigating the effectiveness of the random for-
ests approach in predicting customer churn but also by integrating
* Corresponding author. Tel.: +86 10 62771152.
sampling techniques and cost-sensitive learning into random for-
E-mail address: lixiu@tsinghua.edu.cn (X. Li). ests to achieve better performance than existing algorithms.

0957-4174/$ - see front matter Ó 2008 Elsevier Ltd. All rights reserved.
doi:10.1016/j.eswa.2008.06.121

we present the methodological more vulnerable to noise. . The negative score of each sample can be Chen et al.’s study (2004). where m is the middle point of an interval The remainder of this paper is structured as follows. domly generated between m  d=2 and m þ d=2. In this method. The more trees predict the sample to random forests. . . such as the data.2. Xie et al. which are two common approaches positive training samples has little effect on the classifiers pro- to tackle the problem of imbalanced data. On one To test our proposed method. i ¼ 1. Random forests and their extensions two subsets Dþ and D . the cost- prediction of future customer behavior.1. introduced by Breiman (2001). the number of trees to grow ntree . only randomly sample mtry of data. since they need to use the entire training set. Assign w1 to the negative class and w2 to the positive class predictors with the same distribution for all trees in the forest where w1 ¼ 1  a and w2 ¼ a. Banks are thought to be appropriate because random forests is computationally more efficient with large imbal- extensive customer behavior data are available which enable the anced data and more noise tolerant. By contrast. However. . Thus. (2004) proposed two ways to handle the imbalanced considered to be the total number of trees which predict the data classification problem of random forests. balanced random forests and weighted random forests. The first. We propose improved balanced random forests by combining It therefore can achieve a more precise prediction. On the other hand. balanced random forests. the normalized vote for class j at X i equals and perform better than the existing algorithms (Chen et al. weighted sample to be negative. be negative.5446 Y. The error more vulnerable to noise (mislabeled class) (Chen et al. Grow an unpruned classification tree. data can be eas. the sampling technique which is employed in balanced banking industry. (Breiman. Although our study is limited to a specific bank. rather than searching through forests are computationally less efficient with large imbalanced all descriptors for the optimal split. we explain the methodological underpinnings of random forests. . In addition. 2002). k IðhðX i Þ ¼ jÞwk P : ð1Þ Weighted random forests assign a weight to each class. underpinnings of the technique and the evaluation criteria we The algorithm takes as input a training set D ¼ fðX 1 . which results in higher noise tolerance. The training set is then split into 2. 2005). which directly The dataset preparation and various experiments using IBRF and determines the distribution of samples from different classes for their results are presented in Sections 3 and 4. is used to split m  d=2 and m þ d=2. Randomly generate a variable a within the interval between strap sample of the training set. .. which results in a higher noise tolerance. It allows inef. there is no clear put heavier penalties on misclassification of the minority class. bal- anced random forests draw the same number of samples from both 2. 2001). Y 1 Þ. is based on a sampling technique. Order all test samples by the negative set of customer churn prediction. mtry . Let The random forests method. variables” m and d. according to Chen et al.  Output the final ranking. It is robust against overfitting and very user.. different iterations to maintain the randomicity of the sample selection. Y n Þg. Y n Þg. Moreover. ht : X#R denote a weak hypothesis. . changing the balance of negative and and cost-sensitive learning. interval vari- friendly (Liaw & Wiener. each tree being grown on a boot. However. score of each sample. . Randomly draw na sample with replacement from the nega- available for analysis (Lariviere & Van den Poel. In addition. is based on cost-sensitive learning. and the second. use to analyze the performance of the method. method can be applied to many other service industries as well To combine these two methods. the on the classifiers produced by decision-tree learning methods. the descriptors and choose the best split. Weighted random lowing modification: at each node. Here mtry is the number assigning a weight to the minority class may make the method of input descriptors which is used to split on at each node. ðX n . sensitive learning used in weighted random forests has more effect ily collected. . By introducing ‘‘interval duced by decision-tree learning methods (Elkan. ðX n . they penalize misclas. we grow an unpruned classification tree with the fol- sifications of the minority class more heavily. In this section. on datasets where data is extremely unbalanced. P 2004). the standard random forests do not work well . the different classes are no longer ran- In this study. of this method is Balanced random forests artificially make class priors equal by over-sampling the minority class in learning extremely imbal. . the first of which consists of all positive training samples and the second of all negative samples. Methodology the majority and minority class so that the classes are represented equally in each tree. these two approaches alter the class distribution and mary. 2004). Both methods improve the prediction accuracy of the minority class. In sum- variables”. Improved balanced random forests fective and unstable weak classifiers to learn based on both an appropriate discriminant measure and a more balanced dataset. A distribution variable a is ran- 2. This number. ntree : mtry descriptors to grow trees. we use the IBRF technique to predict customers’ domly distributed across different iterations. e ¼ Printree i ½ht ðX i Þ–Y i : ð2Þ . . and the k wk minority class is given a larger weight. . 2001). Because tive training dataset D and nð1  aÞ sample with replace- each tree depends on the values of an independently sampled ran. . . ables m and d. to maintain the random distribution of different classes for each iteration. the nodes and is much smaller than the total number of descriptors .  Input: Training examples fðX 1 . / Expert Systems with Applications 36 (2009) 5445–5449 The proposed method incorporates both sampling techniques anced data. the higher negative score the sample gets. The strategy of random forests is to select randomly subsets of  For t ¼ 1. adds an additional layer of randomness to bootstrap aggregating The steps of IBRF algorithm are (‘‘bagging”) and is found to perform very well compared to many other classifiers. . winner between weighted random forests and balanced random The interval variables determine the distribution of samples in forests. we apply it to predict churn in the hand. . Y1 Þ. Thus. dom vector and standard random forests are a combination of tree . n is a vector of descriptors and Y i is the corresponding class label. In Section and d is the length of the interval. where X i . they also have their limitations. ment from the positive training dataset Dþ . 2. The main reason for introducing these variables is remarks and ideas for future work are given in Section 5. we introduce two ‘‘interval as to various engineering applications. Some concluding one iteration. making the method churn behavior. .

and expense risk charges. the prediction. This is the per. credit status. Variable a.1 and d to be 0. Let Si denote the testing 3. A total of 1524 samples (762 examples for the centage of the 10% of customers predicted to be most likely to training dataset and 762 examples for the testing dataset) are ran- churn who actually churned. be contacted (Zhao et al. 1997. more generally. domly selected from the dataset. The framework of IBRF is shown in Fig. intuitively. Xie et al. Y. Empirical study sample.985021 0. There are a total of 73 potential churners in the selected the more profitable a targeted proactive churn management pro. In this study. which determines the balance of tomers described by 27 variables.982561 0. In our experiment with moved. size of disposable above the threshold against the fraction of all churners above the income. 2.00305 ⋅ ⋅ • • • ⋅ ⋅ • • • ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 1 1 1 −1 TRAIN TEST S1 S S3 IBRF S1 S2 S3 ⋅⋅⋅ 2 ⋅⋅⋅ function unordered estimated orders Fig. A major Chinese bank provided which are most prone to churn are ranked higher in output. the lift curve plots the fraction of all subscribers demographics category include age.00185 0. They lue of d using a fix step length 0. we samples should be ranked higher than positive ones. included records of more than 20. 2004). bank account.015 0.000 custom- The higher the lift.3. the more accurate the model is and. account level. which means the larger the with too many missing values (more than 30% missing) are also re- weight we assign to the negative class. We IBRF. IBRF. 1.1. as extracted from the An important question is how to determine the values of ‘‘inter.003 0. education. In this study.000 cus- val variables” m and d. negative and positive training samples in each iteration. The larger a is the greater the number of neg. 2005).976551 0. divided by the baseline churn rate.. In more put potential explanatory descriptors. number of dependants.1 and searched d again continuously until d ¼ 1 and m ¼ 1. mortality. directly We remove descriptors that obviously have nothing to do with depends on m and d. guarantee type. Framework of IBRF. . the customer behavior category includes account 2000). we find that the results are not sensi. The data set. As a result. The three categories are per- detail. sales charges. threshold. shown in Table 1. amount. Evaluation criteria (2) Account level is the billing system including contract charges. bank’s data warehouse. employment type. / Expert Systems with Applications 36 (2009) 5445–5449 5447 0. Gupta. The samples apply it to a real-world database. 1. Then we change the value of m are identified as follows: using a fix step length 0. and customer behavior. loan data and loan the information retrieval literature (Mozer. The lift curve is preferred for evaluating and comparing model performance in customer churn analysis. we use the lift curve and top-decile lift to quantify (3) Customer behavior is any behavior related to a customer’s the accuracy of the predictive model. Finally. and nonchurners (Dekimpe & Degraeve. (1) Personal demographics is the geographic and population tive to the values of m and d. we set the initial value of m to be 0. sonal demographics. marital status.9820114 0. Descriptors ative samples in the training set. It is related to the ROC (relative operating characteristic) and service grade. length of maturity of loan. Rust & Metters. For a given churn proba. and the number of times the terms of an could be included if a certain fraction of all subscribers were to agreement have been broken. It dem. Although many authors emphasize the need for a balanced onstrates the model’s power to beat the average performance or training sample in order to differentiate reliably between churners random model.1. the descriptors that we consider in the personal bility threshold. gram will be (Neslin. the database for this study. The training inputs of the ranking problem are samples from Dþ and D provided with the information that the negative To evaluate the performance of the proposed method. data of a given customer or. & Mason. then explore three major descriptor categories that encompass our in- searched the values of m and d for the best performance. information about a group living in a particular area. Fifteen descriptors remain after these two operations. samples.eps. ers. & Grimes. we first held the value of m and searched for the optimal va. Wolniewicz. Kamakura. which consists of 20. Lu. The distribution of the data used in training and testing is Another measurement we use is top-decile lift. such as identification card number. The lift curve indicates the fraction of all churners who status. The account level category includes account curve of signal detection theory and the precision-recall curve in type.

2% 6. when m ¼ 0:5. is shown in Table 2 and Fig. Furthermore. 0 0 20 40 60 80 100 struct all variables in the same way for each dataset.eps.eps. 2. The same happens when d ¼ 0:1 and m ¼ 1.6 0.2 0. Lift curve of different algorithms. clude that IBRF achieves better performance than other algorithms . However. Xie et al.1 ANN. namely balanced random forests and weighted 3 random forests. 3. We con- Fig. DT. decision tree. 4.0% 87. 2 As Fig.6 0.3.8 1 d Fig. 6 One can observe a significantly better performance for IBRF in Top-decile lift 5 Table 2 and Fig.5 5 0 0. Note that when d ¼ 0:1 and m ¼ 0. decision tree (DT). the performance of IBRF drops with increasing d. Experiment result when d equals 0. Fig. we also compare our method with other random 4 forests algorithms. discriminability is clearly higher for IBRF 1 than for the other algorithms. Platt.2 0. 5 shows the results when we vary the value of d and set m ¼ 0:5.eps the values of these two variables. Shawe. meaning that almost all the training samples are selected from the negative training dataset.2 3.8 1 capture 100% of churners. the performance of the proposed IBRF drops sharply. 2.4 0. improved balanced random forests. Smola. 5. We con. m As mentioned in Section 2.2% 93. 4 shows the results when we vary the value of m and set d ¼ 0:1.eps. 3 is the cumulative lift gain chart for identify- ing the customers who churned when ntree ¼ 50.5 Table 2 Experimental results of different algorithms 7 Algorithm ANN DT CWC-SVM IBRF Top-decile lift Accuracy rate 78.6 3. 3 indicates. To test the performance of our proposed method. The chart shows that the top-decile lift captures about 88% of churners. and CWC-SVM (Scholkopf.5 Top-decile lift 2.1% 62. We apply IBRF to a set of churn data in a bank as described above. Subscribers % 4.1.5448 Y. Random mate the predictive performance in a real-life situation. Experiment result when m equals 0. mine the values of ‘‘interval variables” m and d that give the best performance. Findings Fig. To evaluate the performance of the novel ap- proach further. & Williamson. 1999). while the top-four-decile lift 0 0 0. 6 5.4 0. A comparison of results from 8 IBRF and other standard methods. especially when the value of d is close to 1. we use a training dataset which contains a proportion of WRF churners that is representative of the true population to approxi. Fig. namely artificial neural network 7 (ANN). Fig. Lift curve of different random forests algorithms. IBRF. one would expect to have to deter- Fig. we run several comparative experiments.5. 2. / Expert Systems with Applications 36 (2009) 5445–5449 Table 1 100 Distribution of the data used in the training and the test in the simulation Training dataset 80 Number of normal examples 724 Number of churn examples 38 Churners % Testing dataset 60 Number of normal examples 727 Number of churn examples 35 40 IBRF 20 BRF 1996).5 7. the results turn out to be insensitive to 7. artificial neural network. which means that almost all the training samples are selected from the positive training data- set.

F. & Kaushansky. B. Xie et al. Computer Application.. J... (1999). Customer churn prediction using improved one-class support vector machine. Johnson.. there is fur. the top-decile lift of IBRF is better than that of 313–327. 45. (2005). J.. Lu. C. Kong Polytechnic University (Project No. 300–306. Working Paper.. time-varying variables and the computation time and memory Mozer. The foundations of cost-sensitive learning. traditional approaches due to its scalability. W. C. Genetic modelling of customer and running speeds. Project. & Breiman L. Churn reduction in the wireless requirements. research. Advances in Neural Information Processing Systems. A. J. Koudijs. ANN. Teradata Center at Duke University. & Croux. In National conference on soft computing techniques for engineering applications (SCT-2006). C. W. J. & Van den Poel. & Ren. C. 472–484. M. Lecture Notes in Computer Science. based on which Chiang. Z.. European Journal of Operational Research. D. In addition. (2002). (1998).. Technical Report MSR-TR-99-87. Wolniewicz. C. R. Expert Systems with Applications. Machine Learning. learned by artificially making class priors equal. Continuing research should aim at improving the effectiveness Elkan.. 34(1). & Krishna. D.. & Yao. (1996). be cost-effective. The best features are iteratively data. D. E. (2001). Churn prediction in subscription method produces higher accuracy than other random forests algo- services: An application of support vector machines while comparing two rithms such as balanced random forests and weighted random for. C. ing industry but is also of great concern to other industries. Classification and regression by random forest.. M. & Van den Poel. IBRF offers great potential compared to Dekimpe. 973–978). Chan. Predicting customer retention and some limitations on them in future experiments may enhance profitability by using random forests and regression forests techniques. T. L. Foundation of China (NSFC. P. 277–285. E. network banking churn analysis. Modeling churn behavior of bank churn in the banking industry. sug.. IEEE Transactions on Neural Networks.. T.. & Lin. IBRF employs internal variables to Lariviere. Cellular Business. H. Li. A. 5. 12. A. 508–523.. 82–90. Statistics Department of University of California at Berkeley. 5–32.. D. W. & Van den Poel. Experimenting with some other weak learners in ran- Liaw. C.. considering the large potential number of Luo. 24(3). Lecture Notes in Computer Science. (1996). N. K. Investigating the role of product features in determine the distribution of samples.. & Van den Poel. 935–941. M.. Smola. retention. Project No. 690–696. & Williamson. and faster training Eiben.. Special issue gesting interesting directions for future research with IBRF.. DTEW Research Report 0361. Defection detection: Acknowledgements improving predictive accuracy of customer churn models. this method. and generalization ability. S. Conclusion and future work Burez. parameter-selection techniques. Shah.. 32(2). 1391. Putting a quality edge to digital wireless networks. C. Liaw. In addition. Expert Systems with Applications. & Metters. combines sampling techniques with cost-sensitive learning to both 2006.. T. B.: G-YF02). Buckinx. Experimental results on bank databases have shown that our 293–302. 27. IBRF has advantages in that it customers using predictive data mining techniques. ests. 79–81. B. Zhao. (2004). Chen. (1997). Y.. (2003). (2001). . R. & Mu. Z. (2003). & Mason. Expert Systems with Applications. Coussement. (2005). M. The attrition of volunteers. (2004). & Grimes. 91(3). Li. Y. (2005).. Technical Report 666. Lee. G. R. Y. Moreover.. European Journal of Operational Research. C. Liu... 427–439. Although the results are preventing customer churn by using survival analysis and choice modeling: The found to be insensitive to the values of these variables. Customer base analysis: Partial defection of each iteration is designed to be relative balance. The dom forests represents an interesting direction for further Newsletter of the R.. B. 98(1). (2003). & Wiener. imposing case of financial services. on Data Mining and Knowledge Representation. A. D. D. Laha. with applications to churn prediction. we propose a novel method called IBRF to predict Chandar.. (2004). A novel evolutionary data mining algorithm 3584. / Expert Systems with Applications 36 (2009) 5445–5449 5449 because the distribution of different classes in training dataset for Breiman. 532–545. 29. S. A. Predicting subscriber dissatisfaction and improving retention in the wireless telecommunication industry.. S.. & Degraeve. B. (2000). Wang. D. 18–22. In Proceedings of the 17th international joint conference on artificial intelligence (pp.. Random forests. References 13. Rust. Mathematical models of service. Gupta. 37–51. DT.. & Slisser. A.. E. European Journal of This research was partly supported by National Natural Science Operational Research. Microsoft Research.. However. Estimation the support of a high-dimensional distribution. Lariviere. & Van den Poel. R. A. (2000).. 178–186. S. W. Bagging and boosting classification trees to predict churn. and CWC-SVM. Bayesian network classifier and its application in CRM. (2004). inappropriately chosen weak learners would not industry. Mozer. (2006). Expert Systems with Applications. IEEE Transactions on Evolutionary Computation. In this paper. (2007). alter the class distribution and penalize more heavily misclassifi. X. Goal-oriented sequential pattern for best weak classifiers are derived. Systems with Applications. J. Platt. H. March 24–26. D. 25(3). X. K.. Using random forests to learn imbalanced cation of the minority class. B.. CRM at a pay-TV company: Using analytical models to reduce customer attrition by targeted marketing for subscription services. Expert the predictive effectiveness of the method. 277–288. M. (2008). behaviorally-loyal clients in a non-contractual FMCG retail setting. Wolniewicz. Au. R. Shawe. churning is not restricted to the bank. Neslin. 154(2). ther research potential in the inquiry into the cost-effectiveness of Lemmens.: 70671059) and The Hong Scholkopf. Kamakura. Grimes. 2(3). 7(6).