8 views

Uploaded by Umi Dzihniyatii

save

- Exponential and Power Law Relationships
- Macro I
- 2 Curve Fitting
- e portfolio spring 2014
- Technology.pdf
- Partial Differentiation Notes 2
- Scientific Computatio
- An Overview of Inductive Learning Algorithms
- Log
- A Smart Pill Box with Remind and Consumption using IOT
- Lesson 22.Derivative of Logarithmic Functions.pptx
- Tool for Processing Hydrologic Parameters for HCFCD HEC-HMS.pdf
- Exponential and Logarithm Applications
- Trends in Social Media Persistence and Decay - Twitter Study
- IJAIEM-2013-05-26-066
- Teaching Exercise Matlab
- API 111 Solutions 2 (1)
- DataMiningFinalReport(Qamaruzzaman,Zulqarnain,Aidi)
- Learning From Observations
- Data Repository for Sensor Network
- Liana.pdf
- Integrale
- Note13_Anitdifferentiation_The_Indefinite_Integral.pdf
- MATH203_2_15
- Computer Aided Detection Algorithm for Digital Mammogram Images – A Survey
- WEF Aeration Summary
- 212066_8_TC1 (1)
- 9709_w15_ms_22
- Assignment#2 Rainfall Duration Relationship Mod
- Comparison of Traffic Noise Prediction Using FHWA and ERTC Models in Gorakhpur City 16 Aug
- Adakah-perubahan-konsentrasi-spasial_Jurnal-Ekonomi-Dan-Bisnis-Indonesia-vol-19-No.-4-2004..pdf
- (1)KonsepDasarPeluang.pdf
- 2011-1-00601-mtif 2.pdf
- A - 9.pdf
- entropy.and.PDE.pdf
- Proses pembentukan bumi
- 164-282-1-PB.pdf
- 2011-1-00601-mtif 2.pdf
- 3-KEAMANAN KOMPUTER.ppt
- 81-241-1-PB.pdf
- 739-1591-1-SM.pdf
- 5487-15354-1-PB.pdf
- teori-shannon-ghifar.doc
- bab6_variabel_acak.pdf
- perbandingan trigonometri
- bab1peluang-130318191228-phpapp02.pptx
- Pohon_Keputusan-1.pdf
- C4.5byRossQuinlanforICDM06.pdf
- KLASIFIKASI DATA NASABAH SEBUAH ASURANSI MENGGUNAKAN ALGORITMA C4.5.pdf
- TEKNIK KLASIFIKASI POHON KEPUTUSAN UNTUK MEMPREDIKSI KEBANGKRUTAN BANK BERDASARKAN RASIO KEUANGAN BANK.pdf
- KLASIFIKASI DATA NASABAH SEBUAH ASURANSI MENGGUNAKAN ALGORITMA C4.5.pdf
- UKURAN_KEEFEKTIFAN_MODEL_ANTRIAN__MG1_
- PERBANDINGAN KINERJA ALGORITMA ID3 DAN C4.5 DALAM KLASIFIKASI SPAM-MAIL.pdf
- Penyelesaian Uas Teori Bilangan
- F01_1400843314_Proposal 1008605056.docx
- Klasifikasi-Data-Karyawan-.-.-.-..pdf
- Entropi.pptx
- Karakteristik Ring
- DECISION TREE BERBASIS ALGORITMA UNTUK PENGAMBILAN KEPUTUSAN.pdf

**Improved C4.5 Algorithm for the Analysis of Sales
**

Rong Cao

Lizhen Xu

**School of Computer Science and Engineering
**

Southeast University

Nanjing211189, China.

Email: caorong_1986@126.com

**School of Computer Science and Engineering
**

Southeast University

Nanjing211189, China.

Email: lzxu@seu.edu.cn

**Abstract—A decision tree is an important means of data mining
**

and inductive learning, which is usually used to form classifiers

and prediction models. C4.5 is one of the most classic

classification algorithms on data mining, but when it is used in

mass calculations, the efficiency is very low. In this paper, the

rule of C4.5 is improved by the use of L’Hospital Rule, which

simplifies the calculation process and improves the efficiency of

decision-making algorithm. When calculating the rate of

information gain, the similar principle is used, which improves

the algorithm a lot. And the application at the end of the paper

shows that the improved algorithm is efficient, which is more

suitable for the application of large amounts of data, and its

efficiency has been greatly improved in line with the practical

application.

**the information gain, and later, an improved C4.5 algorithm
**

in1993. In the following years, many scholars made kinds of

improvements on the decision tree algorithm. But the

problem is that these decision tree algorithms need multiple

scanning and sorting of data collection several times in the

construction process of the decision tree. The processing

speed reduced greatly in the case that the data set is so large

that can not fit in the memory.

At present, the literature about the improvement on the

efficiency of decision tree classification algorithm is rarely,

while some literature has merely made a simple

improvement. For example, Wei Zhao, Jamming Su in the

literature [8] proposed improvements to the ID3 algorithm,

which is simplify the information gain in the use of Taylor's

formula. But this improvement is more suitable for a small

amount of data, so it's not particularly effective in large data

sets.

Due to dealing with large amount of data sets, a variety of

decision tree classification algorithm has been considered.

The advantages of C4.5 algorithm is significantly, so it can

be chose. But its efficiency must be improved to meet the

dramatic increase in the demand for large amount of data.

**Keywords-decision tree, algorithm
**

information gain, large data sets

I.

C4.5,

the

rate

of

INTRODUCTION

**With the rapid development of information technology,
**

the amount of data grows at an amazing rate. People in

various fields urgently need to seek useful information from

these large data sets. Classification Algorithm [1] [2] is one

of the most important issues which are widely used, whose

purpose is to generate an accurate classifier through the

analysis of the characteristics of training data set so as to

decide the category of the unknown data samples.

In this paper, we analyzed several decision tree

classification algorithms currently in use, including the ID3

[3] and C4.5 [4] algorithm as well as some of the improved

algorithms [5] [6] [7] thereafter them. When these

classification algorithms are used in the data processing of

commodity sales analysis, we can find that its efficiency is

very low and it can cause excessive consumption of memory.

On this basis, combining with large quantity of goods sales

data, we put forward the improvement of C4.5 algorithm

efficiency, and uses L’Hospital rule to simplify the

calculation process by using approximate method. Although

accuracy is reduced, the application of commodity sales

analysis (based on tobacco sales analysis as an example to

discuss) indicates that the improved C4.5 algorithm is

efficient. This improved algorithm not only has no essential

impact on the outcome of decision-making, but can greatly

improve the efficiency and reduce the use of memory. So it

is more easily used to process large amount of data

collection.

II.

THE IMPROVEMENT OF C4.5ALGORITHM

A. The improvement

The C4.5 algorithm [3] [4] generates a decision tree

through learning from a training set, in which each example

is structured in terms of attribute-value pair. The current

attribute node is one which has the maximum rate of

information gain which has been calculated, and the root

node of the decision tree is obtained in this way. Having

studied carefully, we find that for each node in the selection

of test attributes there are logarithmic calculations, and in

each time these calculations have been performed previously

too. The efficiency of decision tree generation can be

impacted when the dataset is large. We find that the all

antilogarithm in logarithmic calculation is usually small after

studying the calculation process carefully, so the process can

be simplified by using L’Hospital Rule. As follows:

If f(x) and g(x) satisfy:

(1) lim f ( x) And lim g ( x) are both zero or are both

x − > x0

x − > x0

∞

(2) In the deleted neighborhood of the point x0, both f'(x)

and g'(x) exist and g'(x)! = 0;

RELATED RESEARCH SUMMAEIESR

**Quinlan puts forward ID3 decision tree algorithm base on
**

978-0-7695-3874-7/09 $26.00

$25.00 © 2009 IEEE

DOI 10.1109/WISA.2009.36

III.

173

S21: the number of examples that A is negative and attributes value is positive.(3) lim x − > x0 Then f ( x) g ( x) exist or is ∞ Because N = p + n.n N N equation: lim f ( x) = lim f ' ( x) . In order to facilitate the improvement of the calculation. Suppose that in the sample set S the number of positive is p and the negative is n. S 22 )} N N I ( S1 . in which p j and n j are respective ∑ j =1 p + n the number of positive examples and negative examples in the sample set. so computing time is much shorter than the original expression. So we can get the equation: E(S. ㏑ (1-x) = -x (x approaches 0) (1) ㏑(1-x)≈-x (when x is quite small ) (2) Suppose c = 2.Then replaces N n with 1. S12: the number of examples that A is positive and attributes value is negative.A). in the calculation of E(S. each probability value is needed to calculated first and this need o (n) time. And the antilogarithm in logarithmic calculation is a probability which is less than 1. S 2 ) B.E(S.5 algorithm’s complexity is mainly concentrated in E(S) and E(S. the total complexity are O (n). multiplication and division but no logarithmic calculation. A) = τ p +n j j I ( S1 j + S 2 j ) . What’s more. Each candidate attribute’s information gain is calculated and the one has the largest information gain is selected as the root. subtract. And the improved C4. there are only two categories in this article and the probability is a little bigger than in multi-class. Reasonable arguments for the improvement In the improvement of C4. S22: the number of examples that A is negative and attributes value is negative Go on the simplification we can get: S S S p p n n { log 2 + log 2 − { 1 [ 11 log 2 11 + N N N N N S1 S1 S12 S12 S 2 S 21 S 21 S 22 S + log 2 ]+ [ log 2 log 2 22 ]} S1 S1 N S2 S2 S2 S2 S S S S /{ 1 log 2 1 + 2 log 2 2 } N N N N In the equation above. So Gain-Ratio (A) can be simplified as: Gain( A) E(S) . Gain-Ratio (A) only has addition. there is no item increased or decreased only approximate calculation is used when we calculate the information gain rate.5 above. there is also the guarantee of L’Hospital Rule in the approximate calculation. the simplification can be extended for multi-class.5 algorithm. Then each one is multiplied and accumulated which need O(log2n) time. then we can get N Gain-Ratio(S. S1: the number of positive examples in A S2: the number of negative examples in A S11: the number of examples that A is positive and attributes value is positive. 174 .Again. A) Gain-Ratio(A)= = I ( A) I(A) I ( p . so the improvement is reasonable. So the complexity is O(log2n).When we compute E(s). and multiplied by N simultaneously.the complexity is O(n(log2n)2). A). n) − { N { p ln(1 − S n p ) + n ln(1 − ) − {[ S11 ln(1 − 12 ) + N N S1 S12 ln(1 − S11 S S )] + [ S 21 ln(1 − 22 ) + S 22 ln(1 − 21 )]} S1 S2 S2 / S1 ln(1 − S2 S ) + S 2 ln(1 − 1 ) N N Because we already have equation (2).p respectively. we can get equation: Gain-Ratio(S. p + n =1. that is there are only two categories in the basic definition of C4. so we get: pn Gain-Ratio(S. A). So it only needs one scan to obtain the total value and then do some simple calculations.5 algorithm only involves original data and only addition. Comparison of the complexity To calculate Gain – Ratio(S. S1 S I ( S11 .A) is O(n(log2n)2). Divide the numerator and denominator by ㏒ 2e simultaneously. multiply and divide operation. so x − > x0 g ( x) x − > x0 g ' ( x) 1 − ln(1 − x) [ln(1 − x)]' − 1 x = lim 1 = 1 lim = lim = lim −1 −x 1− x − x' p and N and 1. And the probability will become smaller when the number of categories becomes larger. S12 ) + 2 I ( S 21 .so the total complexity of Gain-Ration(S. subtraction. it is more helpful to justify the rationality. Furthermore. we can easily learn that each item in both numerator and denominator has logarithmic calculation and N. the C4. A) = N − {[ S 11 * S 12 S *S ] + [ 21 22 ]} S1 S2 S1 S 2 N In the expression above. A) = S n p { p ln + n ln − {[ S11 ln 11 + N S1 N S S S S12 ln 12 ] + [ S 21 ln 21 + S 22 ln 22 ]} S1 S2 S2 S1 S2 /{S1 ln + S 2 ln } N N C. A) = (x approaches 0) viz.

C4.And also the improved C4.This data set have 4 attribute that is price categories(A). B. correlation analysis(feature selection) and data conversion(to generalize or normalize data). the contrast ratio is not very large.3 the decision tree generated by c4.024 0.096 0.cigarette specifications(C) and cigarette production place(D).IV. the memory is saved too the improved c4. EXPERIMENTS query statement. The concrete implementation is as follows: A.072 0.5 the c4.15 the original data the improved c4.236 High sales（95） 100000 50000 0 0.358 Attribute D 0.5 does not need to scan the data for several times.5.135 0.5 algorithm.5 .5 contrast ratio Attribute A 0.1 sale(E) Figure. The following 3 steps are made to obtain data set about relevant targets that affect cigarette sales. those are data cleaning(to reduce noisy value and deal with null value).081 0. And the result is compared with the original algorithm. Discussion 0. The difference contrast ratio between them is calculated too. a analysis data table is obtained from cigarette table and cigarette sales table connected with one connection and price (A) cigarette packing (B) TABLE .5 and the improved c4. Figure.The 4 attribute is divided into high sales volume(positive examples) and low sales volume(negative examples).05 0 Low sales （75） We can find that though information gain ratio is a little changed and precision is worse than C4.034 0.289 Attribute B 0.2 the efficiency comparison between c4.055 0.1 the calculation of contrast between c4. cigarette packing specifications(B).052 0.257 Attribute C 0.5 improved C4.5 Figure.5 175 .5 0.5 Time is saved because its complexity is changed from O (n (log2n) 2) to O (n).5 and the improved c4. Data Preparation According to the requirements of the analysis of cigarette sales. so it does not have a profound influence on decision tree while efficiency is improved a lot. The improved algorithm is applicable to large amount of data. so we implement our algorithm on the real sale data of a tobacco company in a certain year.5 the c4.1: THE SALE OF CIGARETTE cigarette (C) production place(D) Low-grade(43) Hardboard box(62) cured tobacco (17) Outside of the province(67) High-grade(32) No hardboard box(13) General Cigar (58) Inside of the province(8) Low-grade(82) Hardboard box(65) cured tobacco (6) Outside of the province(68) High-grade(13) No hardboard box(30) General Cigar (89) Inside of the province (27) Each attribute’s information gain is calculated in both improved and original C4.

. B. the efficiency of classification was greatly improved. the algorithm was verified by the analysis of tobacco [6] [7] [8] 176 Mehta.5 The attribute which has the largest information gain ratio is selected to be the root to get the decision tree. all attributes are divided into two classes in order to make the formulas t more concise. Proceedings of the 5th International Conference on Extending Database Technology. Improved use of continuous attributes in C4. Ma. http: //www. 1998. we can find that there are a little difference in the decision trees between C4.. Therefore. The improved C4. University of California. New York: AAAI Press. J. Department of Information and Computer Science.5 is recommended because of its higher accuracy. W. Iyer. With the improved algorithm.25~12 .. the experiment proved that it has minimal impact on the classification accuracy. but the efficiency was increased a lot. Bouzeghoub. Jian-hua. Rissanen. In: Apers. Proceedings of the 1998 International Database Engineering and Applications Symposium. C4. Desai..Machine Learing. 1996. 4:77290 UCIRepository of machine earning databases. 1996. Shao. Integrating classification and association rule mining. Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining. ardarin. M.. ed. But the improved algorithm reduces unimportant attributes’ influence about commodity sales while it also increase important attributes’ influence about commodity sales for that the important attributes are closer to the root. The improvement is also applicable. B.J. Jianming Su.5. Obviously the improved algorithm is not only concise and practical but also with a better efficiency..1986 Quinlan. Berlin: pringer-Verlag. 80~86 RossQuinlan J. B.C4. we can get faster and more effective results without the change of the final decision. Wales: IEEE Computer Society. 5 [J] .5.Figure.5:Programsfor Machine Learning. R.Induction of decision tree [J]. edu/～mlearn/MLRepository.. In the paper. In: Eaglestone. We can not only speed up the growing of the decision tree.Artificial Intelligence Research. 1998. More classes are divided.18~32 Wang.R. and the disadvantages of low efficiency and memory consumption while dealing with large amount of data were overcome as it was in C4. Scalable mining for classification rules in relational databases. V. Hsu. It should be pointed out that. P.. R. Although approximate calculation were used in the calculation of Gain-Ratio (S. so that better information of rules can be excavated.and there are also some differences between the classification rules of the two algorithms.SanMateo. eds. SLIQ: a fast scalable classifier for data mining. Html Wei Zhao. M. Comparing the two kinds of decision tree algorithm of extracting rules. sales. 58~67 Quinlan JR. Vitter. more precise the result is.5 algorithm will speed up the processing phase and improve the efficiency especially used in large data set. Agrawal. J. B. 1998. A). But in practice. REFERENCES [1] [2] [3] [4] [5] CONCLUSION In this paper. the original C4. Based on the research of the decision tree ID3 algorithms and improvement. In: Agrawal.5 algorithm was improved. but also get a better-structured decision tree... G eds. CA:Morgan Kaufmann1993 Liu..2003. uci.S. If the amount of data is not very large. The computer application and software. attributes can be divided into more than two classes according to the requirement.5 and improved C4. Y.C.4 the decision tree generated by the improved c4. M.ics.

- Exponential and Power Law RelationshipsUploaded bygustavo orozco
- Macro IUploaded byChiragDahiya
- 2 Curve FittingUploaded byJack Wallace
- e portfolio spring 2014Uploaded byapi-242859609
- Technology.pdfUploaded byFelipe Augusto Diaz Suaza
- Partial Differentiation Notes 2Uploaded byFairaz Mohammed
- Scientific ComputatioUploaded byc_mc2
- An Overview of Inductive Learning AlgorithmsUploaded byLusi Nanda
- LogUploaded byassholed
- A Smart Pill Box with Remind and Consumption using IOTUploaded byIRJET Journal
- Lesson 22.Derivative of Logarithmic Functions.pptxUploaded byFlory Jane Mendoza
- Tool for Processing Hydrologic Parameters for HCFCD HEC-HMS.pdfUploaded bymurbieta
- Exponential and Logarithm ApplicationsUploaded byJenny Lea Ternate Sarmiento
- Trends in Social Media Persistence and Decay - Twitter StudyUploaded byNavicus
- IJAIEM-2013-05-26-066Uploaded byAnonymous vQrJlEN
- Teaching Exercise MatlabUploaded byNust Razi
- API 111 Solutions 2 (1)Uploaded byAnonymous L7XrxpeI1z
- DataMiningFinalReport(Qamaruzzaman,Zulqarnain,Aidi)Uploaded byMuhamad Aidi Taufiq Idris
- Learning From ObservationsUploaded byDurai Raj Kumar
- Data Repository for Sensor NetworkUploaded byMaurice Lee
- Liana.pdfUploaded byZikora Agbapu
- IntegraleUploaded bykisslevente007
- Note13_Anitdifferentiation_The_Indefinite_Integral.pdfUploaded by李华夏
- MATH203_2_15Uploaded byjohny25
- Computer Aided Detection Algorithm for Digital Mammogram Images – A SurveyUploaded byseventhsensegroup
- WEF Aeration SummaryUploaded byahmahamad
- 212066_8_TC1 (1)Uploaded byLeidy Picon
- 9709_w15_ms_22Uploaded byyuke kristina
- Assignment#2 Rainfall Duration Relationship ModUploaded byOliver Brown
- Comparison of Traffic Noise Prediction Using FHWA and ERTC Models in Gorakhpur City 16 AugUploaded byGJESR

- Adakah-perubahan-konsentrasi-spasial_Jurnal-Ekonomi-Dan-Bisnis-Indonesia-vol-19-No.-4-2004..pdfUploaded byUmi Dzihniyatii
- (1)KonsepDasarPeluang.pdfUploaded byUmi Dzihniyatii
- 2011-1-00601-mtif 2.pdfUploaded byUmi Dzihniyatii
- A - 9.pdfUploaded byUmi Dzihniyatii
- entropy.and.PDE.pdfUploaded byUmi Dzihniyatii
- Proses pembentukan bumiUploaded byUmi Dzihniyatii
- 164-282-1-PB.pdfUploaded byUmi Dzihniyatii
- 2011-1-00601-mtif 2.pdfUploaded byUmi Dzihniyatii
- 3-KEAMANAN KOMPUTER.pptUploaded byUmi Dzihniyatii
- 81-241-1-PB.pdfUploaded byUmi Dzihniyatii
- 739-1591-1-SM.pdfUploaded byUmi Dzihniyatii
- 5487-15354-1-PB.pdfUploaded byUmi Dzihniyatii
- teori-shannon-ghifar.docUploaded byUmi Dzihniyatii
- bab6_variabel_acak.pdfUploaded byUmi Dzihniyatii
- perbandingan trigonometriUploaded byUmi Dzihniyatii
- bab1peluang-130318191228-phpapp02.pptxUploaded byUmi Dzihniyatii
- Pohon_Keputusan-1.pdfUploaded byUmi Dzihniyatii
- C4.5byRossQuinlanforICDM06.pdfUploaded byUmi Dzihniyatii
- KLASIFIKASI DATA NASABAH SEBUAH ASURANSI MENGGUNAKAN ALGORITMA C4.5.pdfUploaded byUmi Dzihniyatii
- TEKNIK KLASIFIKASI POHON KEPUTUSAN UNTUK MEMPREDIKSI KEBANGKRUTAN BANK BERDASARKAN RASIO KEUANGAN BANK.pdfUploaded byUmi Dzihniyatii
- KLASIFIKASI DATA NASABAH SEBUAH ASURANSI MENGGUNAKAN ALGORITMA C4.5.pdfUploaded byUmi Dzihniyatii
- UKURAN_KEEFEKTIFAN_MODEL_ANTRIAN__MG1_Uploaded byUmi Dzihniyatii
- PERBANDINGAN KINERJA ALGORITMA ID3 DAN C4.5 DALAM KLASIFIKASI SPAM-MAIL.pdfUploaded byUmi Dzihniyatii
- Penyelesaian Uas Teori BilanganUploaded byUmi Dzihniyatii
- F01_1400843314_Proposal 1008605056.docxUploaded byUmi Dzihniyatii
- Klasifikasi-Data-Karyawan-.-.-.-..pdfUploaded byUmi Dzihniyatii
- Entropi.pptxUploaded byUmi Dzihniyatii
- Karakteristik RingUploaded byUmi Dzihniyatii
- DECISION TREE BERBASIS ALGORITMA UNTUK PENGAMBILAN KEPUTUSAN.pdfUploaded byUmi Dzihniyatii