You are on page 1of 4

A streaming parallel decision tree algorithm

Yael Ben-Haim and Elad Yom-Tov IBM Haifa Research Lab, 165 Aba Hushi st., Haifa 31905 , Israel {yaelbh,yomtov}@il.ibm.com

ABSTRACT
A new algorithm for building decision tree classiers is proposed. The algorithm is executed in a distributed environment and is especially designed for classifying large datasets and streaming data. It is empirically shown to be as accurate as standard decision tree classiers, while being scalable to innite streaming data and multiple processors.

to parallelize decision trees (described in detail in [2],[19],[13]): In horizontal parallelism, the data is partitioned such that dierent processors see dierent examples 1 . In vertical parallelism, dierent processors see dierent attributes. Task parallelism involves distribution of the tree nodes among the processors. Finally, hybrid parallelism combines horizontal or vertical parallelism in the rst stages of tree construction with task parallelism towards the end. Like their serial counterparts, parallel decision trees overcome the sorting obstacle by applying pre-sorting, distributed sorting, and approximations. Following our interest in innite streaming data, we focus on approximate algorithms. In streaming algorithms, the dominant approach is to read a limited batch of data and use each such batch to split tree nodes. We refer to processing each such batch as an iteration of the algorithm. The SPIES algorithm [10] is designed for streaming data, but requires holding each batch in memory because it may need several passes over each batch. pCLOUDS [18] relies on assumptions on the behavior of the impurity function, which are empirically justied but can be false for a particular dataset. We note that none of the experiments reported in previous works involved both a large number of examples and a large number of attributes.

1.

INTRODUCTION

We propose a new algorithm for building decision tree classiers for classifying both large datasets and (possibly innite) streaming data. As recently noted [4], the challenge which distinguishes large-scale learning from small-scale learning is that training time is limited compared to the amount of available data. Thus, in our algorithm both training and testing are executed in a distributed environment. We refer to the new algorithm as the Streaming Parallel Decision Tree (SPDT). Decision trees are simple yet eective classication algorithms. One of their main advantages is that they provide human-readable rules of classication. Decision trees have several drawbacks, especially when trained on large data, where the need to sort all numerical attributes becomes costly in terms of both running time and memory storage. The sorting is needed in order to decide where to split a node. The various techniques for handling large data can be roughly grouped into two approaches: Performing pre-sorting of the data (SLIQ [12] and its successors SPRINT [17] and ScalParC [11]), or replacing sorting by approximate representations of the data such as sampling and/or histogram building (e.g. BOAT [7], CLOUDS [1], and SPIES [10]). While pre-sorting techniques are more accurate, they cannot accommodate very large datasets nor innite streaming data. Faced with the challenge of handling large data, a large body of work has been dedicated to parallel decision tree algorithms [17],[11],[13],[10],[19],[18],[8]. There are several ways

2.

ALGORITHM DESCRIPTION

Our proposed algorithm builds the decision tree in a breadthrst mode, using horizontal parallelism. At the core of our algorithm is an on-line method for building histograms from streaming data at the processors. These histograms are then used for making decisions on new tree nodes at the master processor. We empirically show that our proposed algorithm is as accurate as traditional, single-processor algorithms, while being scalable to innite streaming data and multiple processors.

2.1

Tree growing algorithm

We construct a decision tree based on a set of training examples {(x1 , y1 ), . . . , (xn , yn )}, where x1 , . . . , xn Rd are the feature vectors and y1 , . . . , yn {1, . . . , c} are the labels. Every internal node in the tree possesses two ordered child nodes and a decision rule of the form x(i) < a, where x(i)
1 We refer to processing nodes as processors, to avoid confusion with tree nodes

is the ith feature and a is a real number. Feature vectors that satisfy the decision rule are directed to the nodes left child node, and the other vectors are directed to the right child node. Every example x has thus a path from the root to one of the leaves, denoted l(x). Every leaf has a label t, so that an example x is assigned the label t(l(x)). The label is accompanied with a real number that represents the condence on the labels correctness 2 . Initially, the tree consists only of one node. The tree is grown iteratively, such that in each iteration a new level of nodes is appended to the tree. We apply a distributed architecture that consists of Nw processors. Each processor can observe 1/Nw of the data, but has a view of the complete classication tree built so far. At each iteration, each processor uses the data points it observes to build a histogram for each class, terminal node (leaf), and feature. Each data point is classied to the correct leaf of the current tree, and is used to update the relevant histograms. Section 2.2 provides a description of histogram algorithms. After observing a predened number of data points (or, in the case of nite data, after seeing the complete data) the histograms are communicated to a master processor which integrates these histograms and reaches a decision on how to split the nodes, using the chosen split criterion (see e.g. [5, 16]): For each bin location in the histogram of each dimension, the (approximate) number of points from each class to the left and to the right of this location is counted. This is then used in the computation of the purity of the leafs child nodes, if this node is chosen to be split in the current dimension and location. The feature i and location a for which the child nodes purities are maximized will constitute the decision rule x(i) < a. The leaf becomes an internal node with the chosen decision rule, and two new nodes (its child nodes) are created. If the node is already pure enough, the splitting is stopped and the node is assigned a label and a condence level, both determined by the number of examples from each class that reached it. Decision trees are frequently pruned during or after training to obtain smaller trees and better generalization. We adapted the MDL-based pruning algorithm of [12]. This algorithm involves simple calculations during node splitting, that reect the nodes purity. In a bottom-up pass on the complete tree, some subtrees are chosen to be pruned, based on estimates of the expected error rate before and after pruning.

merge. The histogram building algorithm is a slight adaptation of the on-line clustering algorithm developed by Guedalia et al. [9], with the addition of a procedure for merging histograms.

The update procedure: Given a histogram (p1 , m1 ), . . . , (pr , mr ), p1 < . . . < pr and a point p, the update procedure adds p to the set S represented by the histogram. If p = pi for some i, then increment mi by 1. Otherwise: Add the bin (p, 1) to the histogram, resulting in a histogram of r + 1 bins (q1 , k1 ), . . . , (qr+1 , kr+1 ), q1 < . . . < qr+1 . Find a point qi such that qi+1 qi is minimal. Replace the bins (qi , ki ), (qi+1 , ki+1 ) by the bin qi ki + qi+1 ki+1 , ki + ki+1 . ki + ki+1

The merge procedure: Given two histograms, the merge procedure creates a new histogram, that represents the union S1 S2 of the sets S1 , S2 represented by the histograms. The algorithm is similar to the update algorithm. In the rst step, the two histograms form a single histogram with many bins. In the second step, bins which are closest are merged together to form a single bin, and the process repeats until the histogram has r bins.

3.

EMPIRICAL RESULTS

We compared the error rate of the SPDT algorithm with the error rate of a standard decision tree on seven mediumsized datasets taken from the USI repository [3]: Adult, Isolet, Letter recognition, Nursery, Page blocks, Pen digits, and Spambase. The characteristics and error rates of all datasets are summarized in Table 1. Ten-fold cross validation was applied where there was no natural train/test partition. We used an 8-CPU Power5 machine with 16GB memory, using a Linux operating system. Our algorithm was implemented within the IBM Parallel Machine Learning toolbox [15], which runs using MPICH2. The comparison shows that the approximations undertaken by the SPDT algorithm do not necessarily have a detrimental eect on its error rate. The FF statistics combined with Holms procedure (see [6]) with a condence level of 95% shows that all but SPDT with eight processors exhibited performance which could not be detected as statistically signicantly dierent. For relatively small data, using eight processors means that each node sees little data, and thus the histograms suer in accuracy. This may explain the degradation in performance when using eight processors.

2.2

On-line histogram building

A histogram is a set of r pairs (called bins) (p1 , m1 ), . . . , (pr , mr ), where r is a preset constant integer, p1 , . . . , pr are real numbers, and m1 , . . . , mr are integers. The histogram is a compressed and approximate representation of a large set S of P real numbers, so that |S| = r m , and mi is the numi i=1 ber of points in S at the surroundings of pi . The histogram data structure supports two procedures, named update and Note that since the number of dierent condence levels is upper-bounded by the number of leaves, the decision tree does not provide continuous-valued outputs.
2

Dataset Adult Isolet Letter Nursery Page blocks Pen digits Spambase

Number of examples 32561 (16281) 6238 (1559) 20000 12960 5473 7494 (3498) 4601

Number of features 105 617 16 25 10 16 57

Standard tree 17.67 18.70 7.48 1.01 3.13 4.6 8.37

SPDT 1 processor 15.75 14.56 8.65 2.58 3.07 5.37 10.52

SPDT 2 processors 15.58 17.90 9.28 2.67 3.18 5.43 11.11

SPDT 4 processors 16.16 19.69 10.13 2.82 3.51 5.20 11.29

SPDT 8 processors 16.50 19.31 10.07 3.16 3.44 5.83 11.61

Table 1: Error rates for medium-sized datasets. The number of examples in parentheses is the number of test examples (if a train/test partition exists). The lowest error rate for each dataset is marked in bold. Dataset Adult Isolet Letter Nursery Page blocks Pen digits Spambase Error rate before pruning 16.50 19.31 10.07 3.16 3.44 5.83 11.61 Tree size before pruning 1645 221 135 178 55 89 572 Error rate after pruning 14.34 17.77 9.26 3.21 3.44 5.83 11.45 Tree size after pruning 409 141 67 167 36 81 445

Table 2: Error rates and tree sizes (number of processors) before and after pruning, with eight processor.

It is also interesting to study the eect of pruning on the error rate and tree size. Using the procedure described above, we pruned the trees obtained by SPDT. Table 2 shows that pruning usually improves the error rate (though not to a statistically signicant threshold (sign test)) , while reducing the tree size by 80% on average. We tested SPDT for speedup and scalability on the alpha and beta datasets from the Pascal Large Scale Learning Challenge [14]. Both datasets have 500000 examples and 500 dimensions, out of which we extracted datasets of sizes 100, 1000, 10000, 100000, and 500000. Figure 1 shows the speedup for dierent sized datasets. We further tested speedup in ve more datasets taken from the Pascal challenge: delta, epsilon, fd, ocr, and dna. Referring to dataset size as the number of examples multiplied by the number of dimensions, we found that dataset size and speedup are highly correlated (Spearman correlation of 0.919). This ts the theoretic analysis of complexity expected for the algorithm, which is dominated by the histogram building process. For scalability, we checked the running time as a function of the dataset size. In a logarithmic scale, we obtain approximate regression curves (average R2 = 0.9982) with slopes improving from 1.1 for a single processor from up to 0.8 for eight processors. Thus, our proposed algorithm is especially suited for cases where large data is available and processing can be shared between many processors.

4.

REFERENCES

[1] K. Alsabti, S. Ranka, and V. Singh. CLOUDS: Classication for large or out-of-core datasets. In Conference on Knowledge Discovery and Data Mining,

August 1998. [2] N. Amado, J. Gama, and F. Silva. Parallel implementation of decision tree learning algorithms. In The 10th Portuguese Conference on Articial Intelligence on Progress in Articial Intelligence, Knowledge Extraction, Multi-agent Systems, Logic Programming and Constraint Solving, pages 613, December 2001. [3] C. L. Blake, E. J. Keogh, and C. J. Merz. UCI repository of machine learning databases, 1998. [4] L. Bottou and O. Bousquet. The tradeos of large scale learning. In Advances in Neural Information Processing Systems, volume 20. MIT Press, Cambridge, MA, 2008. to appear. [5] L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classication and Regression Trees. Wadsworth, Monterrey, CA, 1984. [6] J. Dem sar. Statistical comparisons of classiers over multiple data sets. Journal of Machine Learning Research, 7:130, 2006. [7] J. Gehrke, V. Ganti, R. Ramakrishnan, and W. Y. Loh. BOAT optimistic decision tree construction. In ACM SIGMOD International Conference on Management of Data, pages 169180, June 1999. [8] S. Goil and A. Choudhary. Ecient parallel classication using dimensional aggregates. In Workshop on Large-Scale Parallel KDD Systems, SIGKDD, pages 197210, August 1999. [9] I. D. Guedalia, M. London, , and M. Werman. An on-line agglomerative clustering method for nonstationary data. Neural Comp., 11(2):521540, 1999. [10] R. Jin and G. Agrawal. Communication and memory ecient parallel decision tree construction. In The 3rd SIAM International Conference on Data Mining, May

6 100 examples 1000 examples 10000 examples 100000 examples 500000 examples 6 100 examples 1000 examples 10000 examples 100000 examples 500000 examples

Speedup

4 3

Speedup
2 3 4 5 6 7 8

0 1

0 1

Number of processors

Figure 1: Speedup of the SPDT algorithm for the alpha (left) and beta (right) datasets.

2003. [11] M. V. Joshi, G. Karypis, and V. Kumar. ScalParC: A new scalable and ecient parallel classication algorithm for mining large datasets. In The 12th International Parallel Processing Symposium, pages 573579, March 1998. [12] M. Mehta, R. Agrawal, and J. Rissanen. SLIQ: A fast scalable classier for data mining. In The 5th International Conference on Extending Database Technology, pages 1832, 1996. [13] G. J. Narlikar. A parallel, multithreaded decision tree builder. In Technical Report CMU-CS-98-184, Carnegie Mellon University, 1998. [14] Pascal large scale learning challenge, 2008. [15] Ibm parallel machine learning toolbox. [16] J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA, 1993. [17] J. Shafer, R. Agrawal, and M. Mehta. SPRINT: A scalable parallel classier for data mining. In The 22nd International Conference on Very Large Databases, pages 544555, September 1996. [18] M. K. Sreevinas, K. Alsabti, and S. Ranka. Parallel out-of-core divide-and-conquer techniques with applications to classication trees. In The 13th International Symposium on Parallel Processing and the 10th Symposium on Parallel and Distributed Processing, pages 555562, 1999. [19] A. Srivastava, E.-H. Han, V. Kumar, and V. Singh. Parallel formulations of decision-tree classication algorithms. Data Mining and Knowledge Discovery, 3(3):237261, September 1999.