2009 Sixth Web Information Systems and Applications Conference

Improved C4.5 Algorithm for the Analysis of Sales
Rong Cao

Lizhen Xu

School of Computer Science and Engineering
Southeast University
Nanjing211189, China.
Email: caorong_1986@126.com

School of Computer Science and Engineering
Southeast University
Nanjing211189, China.
Email: lzxu@seu.edu.cn

Abstract—A decision tree is an important means of data mining
and inductive learning, which is usually used to form classifiers
and prediction models. C4.5 is one of the most classic
classification algorithms on data mining, but when it is used in
mass calculations, the efficiency is very low. In this paper, the
rule of C4.5 is improved by the use of L’Hospital Rule, which
simplifies the calculation process and improves the efficiency of
decision-making algorithm. When calculating the rate of
information gain, the similar principle is used, which improves
the algorithm a lot. And the application at the end of the paper
shows that the improved algorithm is efficient, which is more
suitable for the application of large amounts of data, and its
efficiency has been greatly improved in line with the practical
application.

the information gain, and later, an improved C4.5 algorithm
in1993. In the following years, many scholars made kinds of
improvements on the decision tree algorithm. But the
problem is that these decision tree algorithms need multiple
scanning and sorting of data collection several times in the
construction process of the decision tree. The processing
speed reduced greatly in the case that the data set is so large
that can not fit in the memory.
At present, the literature about the improvement on the
efficiency of decision tree classification algorithm is rarely,
while some literature has merely made a simple
improvement. For example, Wei Zhao, Jamming Su in the
literature [8] proposed improvements to the ID3 algorithm,
which is simplify the information gain in the use of Taylor's
formula. But this improvement is more suitable for a small
amount of data, so it's not particularly effective in large data
sets.
Due to dealing with large amount of data sets, a variety of
decision tree classification algorithm has been considered.
The advantages of C4.5 algorithm is significantly, so it can
be chose. But its efficiency must be improved to meet the
dramatic increase in the demand for large amount of data.

Keywords-decision tree, algorithm
information gain, large data sets

I.

C4.5,

the

rate

of

INTRODUCTION

With the rapid development of information technology,
the amount of data grows at an amazing rate. People in
various fields urgently need to seek useful information from
these large data sets. Classification Algorithm [1] [2] is one
of the most important issues which are widely used, whose
purpose is to generate an accurate classifier through the
analysis of the characteristics of training data set so as to
decide the category of the unknown data samples.
In this paper, we analyzed several decision tree
classification algorithms currently in use, including the ID3
[3] and C4.5 [4] algorithm as well as some of the improved
algorithms [5] [6] [7] thereafter them. When these
classification algorithms are used in the data processing of
commodity sales analysis, we can find that its efficiency is
very low and it can cause excessive consumption of memory.
On this basis, combining with large quantity of goods sales
data, we put forward the improvement of C4.5 algorithm
efficiency, and uses L’Hospital rule to simplify the
calculation process by using approximate method. Although
accuracy is reduced, the application of commodity sales
analysis (based on tobacco sales analysis as an example to
discuss) indicates that the improved C4.5 algorithm is
efficient. This improved algorithm not only has no essential
impact on the outcome of decision-making, but can greatly
improve the efficiency and reduce the use of memory. So it
is more easily used to process large amount of data
collection.
II.

THE IMPROVEMENT OF C4.5ALGORITHM

A. The improvement
The C4.5 algorithm [3] [4] generates a decision tree
through learning from a training set, in which each example
is structured in terms of attribute-value pair. The current
attribute node is one which has the maximum rate of
information gain which has been calculated, and the root
node of the decision tree is obtained in this way. Having
studied carefully, we find that for each node in the selection
of test attributes there are logarithmic calculations, and in
each time these calculations have been performed previously
too. The efficiency of decision tree generation can be
impacted when the dataset is large. We find that the all
antilogarithm in logarithmic calculation is usually small after
studying the calculation process carefully, so the process can
be simplified by using L’Hospital Rule. As follows:
If f(x) and g(x) satisfy:
(1) lim f ( x) And lim g ( x) are both zero or are both
x − > x0

x − > x0


(2) In the deleted neighborhood of the point x0, both f'(x)
and g'(x) exist and g'(x)! = 0;

RELATED RESEARCH SUMMAEIESR

Quinlan puts forward ID3 decision tree algorithm base on
978-0-7695-3874-7/09 $26.00
$25.00 © 2009 IEEE
DOI 10.1109/WISA.2009.36

III.

173

S21: the number of examples that A is negative and attributes value is positive.(3) lim x − > x0 Then f ( x) g ( x) exist or is ∞ Because N = p + n.n N N equation: lim f ( x) = lim f ' ( x) . In order to facilitate the improvement of the calculation. Suppose that in the sample set S the number of positive is p and the negative is n. S 22 )} N N I ( S1 . in which p j and n j are respective ∑ j =1 p + n the number of positive examples and negative examples in the sample set. so computing time is much shorter than the original expression. So we can get the equation: E(S. ㏑ (1-x) = -x (x approaches 0) (1) ㏑(1-x)≈-x (when x is quite small ) (2) Suppose c = 2.Then replaces N n with 1. S12: the number of examples that A is positive and attributes value is negative.A). in the calculation of E(S. each probability value is needed to calculated first and this need o (n) time. And the antilogarithm in logarithmic calculation is a probability which is less than 1. S 2 ) B.E(S.5 algorithm’s complexity is mainly concentrated in E(S) and E(S. the total complexity are O (n). multiplication and division but no logarithmic calculation. A) = τ p +n j j I ( S1 j + S 2 j ) . What’s more. Each candidate attribute’s information gain is calculated and the one has the largest information gain is selected as the root. subtract. And the improved C4. there are only two categories in this article and the probability is a little bigger than in multi-class. Reasonable arguments for the improvement In the improvement of C4. S22: the number of examples that A is negative and attributes value is negative Go on the simplification we can get: S S S p p n n { log 2 + log 2 − { 1 [ 11 log 2 11 + N N N N N S1 S1 S12 S12 S 2 S 21 S 21 S 22 S + log 2 ]+ [ log 2 log 2 22 ]} S1 S1 N S2 S2 S2 S2 S S S S /{ 1 log 2 1 + 2 log 2 2 } N N N N In the equation above. So Gain-Ratio (A) can be simplified as: Gain( A) E(S) . Gain-Ratio (A) only has addition. there is no item increased or decreased only approximate calculation is used when we calculate the information gain rate.5 above. there is also the guarantee of L’Hospital Rule in the approximate calculation. the simplification can be extended for multi-class.5 algorithm. Then each one is multiplied and accumulated which need O(log2n) time. then we can get N Gain-Ratio(S. S1: the number of positive examples in A S2: the number of negative examples in A S11: the number of examples that A is positive and attributes value is positive. 174 .Again. A) Gain-Ratio(A)= = I ( A) I(A) I ( p . so the improvement is reasonable. So the complexity is O(log2n).When we compute E(s). and multiplied by N simultaneously.the complexity is O(n(log2n)2). A). n) − { N { p ln(1 − S n p ) + n ln(1 − ) − {[ S11 ln(1 − 12 ) + N N S1 S12 ln(1 − S11 S S )] + [ S 21 ln(1 − 22 ) + S 22 ln(1 − 21 )]} S1 S2 S2 / S1 ln(1 − S2 S ) + S 2 ln(1 − 1 ) N N Because we already have equation (2).p respectively. we can get equation: Gain-Ratio(S. p + n =1. that is there are only two categories in the basic definition of C4. so we get: pn Gain-Ratio(S. A). So it only needs one scan to obtain the total value and then do some simple calculations.5 algorithm only involves original data and only addition. Comparison of the complexity To calculate Gain – Ratio(S. S1 S I ( S11 .A) is O(n(log2n)2). Divide the numerator and denominator by ㏒ 2e simultaneously. multiply and divide operation. so x − > x0 g ( x) x − > x0 g ' ( x) 1 − ln(1 − x) [ln(1 − x)]' − 1 x = lim 1 = 1 lim = lim = lim −1 −x 1− x − x' p and N and 1. And the probability will become smaller when the number of categories becomes larger. S12 ) + 2 I ( S 21 .so the total complexity of Gain-Ration(S. subtraction. it is more helpful to justify the rationality. Furthermore. we can easily learn that each item in both numerator and denominator has logarithmic calculation and N. the C4. A) = N − {[ S 11 * S 12 S *S ] + [ 21 22 ]} S1 S2 S1 S 2 N In the expression above. A) = S n p { p ln + n ln − {[ S11 ln 11 + N S1 N S S S S12 ln 12 ] + [ S 21 ln 21 + S 22 ln 22 ]} S1 S2 S2 S1 S2 /{S1 ln + S 2 ln } N N C. A) = (x approaches 0) viz.

C4.And also the improved C4.This data set have 4 attribute that is price categories(A). B. correlation analysis(feature selection) and data conversion(to generalize or normalize data). the contrast ratio is not very large.3 the decision tree generated by c4.024 0.096 0.cigarette specifications(C) and cigarette production place(D).IV. the memory is saved too the improved c4. EXPERIMENTS query statement. The concrete implementation is as follows: A.072 0.5 the c4.15 the original data the improved c4.236 High sales(95) 100000 50000 0 0.358 Attribute D 0.5 does not need to scan the data for several times.5.135 0.5 algorithm.5 .5 contrast ratio Attribute A 0.1 sale(E) Figure. The following 3 steps are made to obtain data set about relevant targets that affect cigarette sales. those are data cleaning(to reduce noisy value and deal with null value).081 0. And the result is compared with the original algorithm. Discussion 0. The difference contrast ratio between them is calculated too. a analysis data table is obtained from cigarette table and cigarette sales table connected with one connection and price (A) cigarette packing (B) TABLE .5 and the improved c4. Figure.The 4 attribute is divided into high sales volume(positive examples) and low sales volume(negative examples).05 0 Low sales (75) We can find that though information gain ratio is a little changed and precision is worse than C4.034 0.289 Attribute B 0.2 the efficiency comparison between c4.055 0.1 the calculation of contrast between c4. cigarette packing specifications(B).052 0.257 Attribute C 0.5 improved C4.5 Figure.5 175 .5 0.5 Time is saved because its complexity is changed from O (n (log2n) 2) to O (n).5 and the improved c4. Data Preparation According to the requirements of the analysis of cigarette sales. so it does not have a profound influence on decision tree while efficiency is improved a lot. The improved algorithm is applicable to large amount of data. so we implement our algorithm on the real sale data of a tobacco company in a certain year.5 the c4.1: THE SALE OF CIGARETTE cigarette (C) production place(D) Low-grade(43) Hardboard box(62) cured tobacco (17) Outside of the province(67) High-grade(32) No hardboard box(13) General Cigar (58) Inside of the province(8) Low-grade(82) Hardboard box(65) cured tobacco (6) Outside of the province(68) High-grade(13) No hardboard box(30) General Cigar (89) Inside of the province (27) Each attribute’s information gain is calculated in both improved and original C4.

. B. the efficiency of classification was greatly improved. the algorithm was verified by the analysis of tobacco [6] [7] [8] 176 Mehta.5 The attribute which has the largest information gain ratio is selected to be the root to get the decision tree. all attributes are divided into two classes in order to make the formulas t more concise. Proceedings of the 5th International Conference on Extending Database Technology. Improved use of continuous attributes in C4. Ma. http: //www. 1998. we can find that there are a little difference in the decision trees between C4.. Therefore. The improved C4. University of California. New York: AAAI Press. J. Department of Information and Computer Science.5 is recommended because of its higher accuracy. W. Iyer. With the improved algorithm.25~12 .. the experiment proved that it has minimal impact on the classification accuracy. but the efficiency was increased a lot. Bouzeghoub. Jian-hua. Rissanen. In: Apers. Proceedings of the 1998 International Database Engineering and Applications Symposium. C4. Desai..Machine Learing. 1996. 4:77290 UCIRepository of machine earning databases. 1996. Shao. Integrating classification and association rule mining. Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining. ardarin. M.. ed. But the improved algorithm reduces unimportant attributes’ influence about commodity sales while it also increase important attributes’ influence about commodity sales for that the important attributes are closer to the root. The improvement is also applicable. B.J. Jianming Su.5. Obviously the improved algorithm is not only concise and practical but also with a better efficiency..1986 Quinlan. Berlin: pringer-Verlag. 80~86 RossQuinlan J. B.C4. we can get faster and more effective results without the change of the final decision. Wales: IEEE Computer Society. 5 [J] .5.Figure.5:Programsfor Machine Learning. R.Induction of decision tree [J]. edu/~mlearn/MLRepository.. In the paper. In: Eaglestone. We can not only speed up the growing of the decision tree.Artificial Intelligence Research. 1998. More classes are divided.18~32 Wang.R. and the disadvantages of low efficiency and memory consumption while dealing with large amount of data were overcome as it was in C4. Scalable mining for classification rules in relational databases. V. Hsu. It should be pointed out that. P.. R. Although approximate calculation were used in the calculation of Gain-Ratio (S. so that better information of rules can be excavated.and there are also some differences between the classification rules of the two algorithms.SanMateo. eds. SLIQ: a fast scalable classifier for data mining. Html Wei Zhao. M. Comparing the two kinds of decision tree algorithm of extracting rules. sales. 58~67 Quinlan JR. Vitter. more precise the result is.5 algorithm will speed up the processing phase and improve the efficiency especially used in large data set. Agrawal. J. B. 1998. A). But in practice. REFERENCES [1] [2] [3] [4] [5] CONCLUSION In this paper. the original C4. Based on the research of the decision tree ID3 algorithms and improvement. In: Agrawal.5 algorithm was improved. but also get a better-structured decision tree... G eds. CA:Morgan Kaufmann1993 Liu..2003. uci.S. If the amount of data is not very large. The computer application and software. attributes can be divided into more than two classes according to the requirement.5 and improved C4. Y.C.4 the decision tree generated by the improved c4. M.ics.