Professional Documents
Culture Documents
Abstract—Large scale mobile data are generated continuously by scalability over large scale mobile data. This method is called
multiple mobile devices in daily communications. Classification MBNN (i.e. MapReduce-based Backpropagation Neural
on such data possesses high significance for analyzing the Network).
behaviors of mobile customers. However, the natural properties
of mobile data presents three non-trivial challenges: large data Google’s MapRedcue framework is well known as a
scale leads it difficult to keep both efficiency and accuracy; parallel programming model for cloud computing. It is built on
similar data increases the system load; and noise in the data set is the top of a distributed file system and supports the
also an important influence factor of the processing result. To parallelization of data processing on large clusters. However
resolve the above problems, this paper integrates conventional the research of how to design a neural network on MapReduce
backpropagation neural network in the cloud computing framework is rarely touched nowadays especially over large
environment. A MapReduce-based method called MBNN (i.e. scale mobile data.
MapReduce-based Backpropagation Neural Network) is proposed
to process classifications on large-scale mobile data. It utilizes a Although the MBNN can process large scale data, the
diversity-based algorithm to decrease the computational loads. efficiency of the method and precision of the network model
Moreover, the Adaboosting mechanism is introduced to further may not be assured since the data scale is so large. During the
ameliorate the performance of classifications. Extensive study, several natural properties of data are observed, which
experiments on gigabyte of realistic mobile data are performed can affect the training process of a neural network.
on a cloud computing platform. And the results show that MBNN Accordingly, the diversity-based and Adaboosting methods are
is characterized by superior efficiency, good scalability and anti- introduced in this paper to improve the efficiency of the
noise. method and the precision of classification in terms of these
natural properties.
Keywords-MapReduce; large-scale data; Backpropagation
neural network; Adaboosting Based on our observation, the data have the following
natural properties: large scale, similarity and noisiness etc.
I. INTRODUCTION Firstly, the mobile data is large-scale because the China Mobile
has large number of customers the cardinality of which is over
Classification of the mobile data generated by the mobile one hundred million. The information related to these
communication customers in the mobile community is very customers is also in large scale and usually in size of terabytes.
useful for mobile operators because the results can be used to And these customers induce data generation everyday as long
analyze the behaviors of the customers. However, large scale as they are active. Secondly, lots of the data are similar and
mobile data are widely produced in mobile community, noisy with each other. The reason might be that many
especially with the increase of customers and applications customers have similar behaviors. When these similar data
recently. Thus, to analyze large-scale mobile data with scalable participate in the classification procedure of backpropagation
conventional data mining methods has become an important neural network, they contribute little to the precision of
research issue for mobile operators as well as database classification and may disturb the training process of the
community. algorithms.
Conventional data mining algorithms for classification (e.g. Since the training sets of the neural network are enormous
backpropagation neural network) are not suitable for the cloud and noisy, the precision of the model is not high enough.
computing platform of China Mobile Research Institute (CMRI) Eventually, the precision of the model over large scale mobile
based on MapReduce framework [1]. Moreover, when the data data sets is usually about 50% which meets the definition of
scale is large (usually in the size of gigabytes), such an weak learning [2]. And it poses a challenging study in large
algorithm might be very slow and sometimes cannot even run scale mobile data Hence, in this paper, we introduce
out a result at all. So this paper studies a conventional data Adaboosting [3] method to improve the precision of the neural
mining algorithm widely used, backpropagation neural network, network.
and proposes a MapReduce-based framework on cloud
computing platform in order to improve the efficiency and
†
Hongyan Li is the corresponding author
*This work is supported by Natural Science Foundation of China (NSFC)
under grant numbers 60973002 and 60673113
1727
4 Emit intermediate key/value pair <w, Δw> complexity of the whole backpropagation training method
is O(anN / m) + O(anN / r ) .
1728
The ith dimension of feature vector of an item can be The detailed discussion and procedure of Adaboosting
computed by the following formula (2): method for neural network can be found in [3] and [14].
1729
bit worse than linear speed up. It proves that MBNN have an V. CONCLUSIONS
excellent speed up performance. Figure 2 shows that the time In this paper, a MapReduced-based backpropagation neural
has an inverse ratio relationship with the nodes number network (MBNN) over large scale mobile data was proposed in
approximately, so that the method has an excellent scalability. the context of cloud computing platform. To improve
That accords with the scalability analysis in section 3 efficiency of the training process as well as the precision, the
diversity-based and Adaboosting methods are introduced.
C. Diversity-based sampling test Extensive experiments were conducted based on gigabyte-class
This test is operated on a cloud cluster of 64 nodes. TABLE mobile data sets, and the experimental results proved the
Ⅰ gives the results of the diversity-based sampling method efficiency, scalability and anti-noise of our proposed method.
applied to the training sets of different size. The original data In the near future, we plan to design and implement more
scale is the scale of training set. The data scale after sampling data mining algorithms on the CMRI cloud computing platform
and the sampling rate is shown in this table. When the data in order to make a further support for the mobile operators.
scale is decreased, the time of one loop of training is also
decreased. Thus the training time is saved.
ACKNOWLEDGMENT
Suppose the original data scale is OS, and the data scale
after sampling is SS. Then the sampling rate SR can be
calculated by the follow formula (3): The China Mobile Research Institute (CMRI) is appreciated
for providing cloud computing platform as well as data sets for
SR = SS / OS (3) the experiments.
1730