Text Associative Classification Approach For Mining Arabic Data Set

2012 4th Conference on Data Mining and Optimization (DMO)
02-04 September 2012, Langkawi, Malaysia
Text Associative Classification Approach for Mining

Arabic Data Set
Abdullah S. Ghareb, Abdul Razak Hamdan, Azuraliza Abu Bakar
Data Mining & Optimization Research Group, Center for Artificial Intelligence Technology,
Faculty of Information Science and Technology,
University Kebangsaan Malaysia,
43600 UKM Bangi, Selangor, Malaysia.
aghurieb@yahoo.com, {arh, aab}@ftsm.ukm.my
Abstract—Text classification problem receives a lot of research selection processes, T = {T1, T2, T3,…, Tn }, two values are
that are based on machine learning, statistical, and information used to determine the importance of class association rules;
retrieval techniques. In the last decade, the associative minimum support and minimum confidence. These values are
classification algorithms which depends on pure data mining used as thresholds. Using this restriction, a set of frequent
techniques appears as an effective method for classification. In terms is generated where frequent terms are those that exceed
this paper, we examine associative classification approach on the the minimum support threshold. Out of these frequent terms,
Arabic language to mine knowledge from Arabic text data set. the classification rule is discovered based on rule confidence
Two methods of classification using AC are applied in this study; and based on restriction in which the right hand side of rules
these methods are single rule prediction and multiple rule
must be a class label, for this reason the AC is considered as
prediction. The experimental results against different classes of
Arabic data set show that multiple rule prediction method
special case of association rule [3]. Formally, TC is called
outperforms single rule prediction method with regards to their class association rule where T is a set of frequent terms and C
accuracy. In general, the associative classification approach is a must be a class label associated with this rule. This rule is
suitable method to classify Arabic text data set, and is able to strong if the support for the co-occurrence of TC satisfies
achieve a good classification performance in terms of minimum support, and the confidence satisfies minimum
classification time and classification accuracy. confidence [4].
Most of the classification methods are applied to English
Keywords-associative classification; class association rule;
Arabic text.
and European languages. In this paper we investigate the use of
AC to classify Arabic language text documents. Arabic and
I. INTRODUCTION English are two different languages. For instance they have
different alphabets; the English alphabet has 26 characters
Associative classification AC is a subset of data mining while the Arabic alphabet has 28 characters. The writing style
fields which utilizes two major tasks of data mining process i.e. is also different, in Arabic language, the text is written and read
association rule and classification. AC merges association rule from right to left while in English language, the text is written
discovery and classification to construct a rule based and read from left to right. In addition, the shape of the Arabic
classification model for the prediction purpose. The use of character is context-sensitive, depending on its location within
associative classification for text mining-classification- the words. One of the important properties of Arabic language
problem appears in the last few years [1]. Text classification is is that it is a derivational and highly inflectional language so
the process of automatically predicting the valid categories of that the morphological analysis takes a significant role when
the text documents based on knowledge discovery in the we deal with Arabic text computerized systems.
construction phase of classification model. Currently, text
classification has a wide range of applications, such as The rest of this paper is organized as follows: in section II
classification of news stories, e-mail routing, spam filtering, we highlight and discuss the related works. Section III
news monitoring, document indexing, searching for interesting describes our methodology and the architecture of AC and it
information on the internet, web page classification, and so on addresses the issues such as feature selection, associative
[2]. classifier construction and testing. In section IV, we explain the
experimental results. The conclusion and future works are
During the building and testing of text association given in section V.
classification model, we conduct several processes such as data
pre-processing, feature weighting and selection, class II. RELATED WORKS
association rule generation and define the prediction methods.
Given a text training data set which is labelled with many Many previous efforts have been conducted for
categories, in our approach, we consider each category classification methods development. The main motivation of
individually. Assume that D is a set of documents in training research on classification problem is to obtain a classification
data set, D = {D1, D2, D3,…, Dn }, and T is a set of terms that model with an acceptable accuracy. In the last decade of
represent documents after data preprocessing and feature research in data mining, AC was presented as a new method for
text classification and it has been applied with various of
978-1-4673-2718-3/12/$31.00 ©2012 IEEE
114
algorithms in many researches. Particularly, Zu et al. [4] IDF resulted in 94.91 % which is the highest average in F1
introduce new text associative classifier based on generic rules measure.
with regards to closed item sets and they use rough set theory
to reduce feature space. Abu Bakar et al. [5] use AC approach Al-Kabi & Al-Sinjilawi [15], Khitam [16], and Harrag &
to mine knowledge and make strategic decision in the Al-Qawasmah [17] apply text classification methods to mine
insurance companies, in this study; two types of decision rules and discover knowledge from Al-Hadith data set (Says of
are processed using a heuristic approach which enhances CBA, prophet Mohammed “peace and blessings of Allah be upon
the heuristic process rules that is generated from data that is him”). Al-Kabi compares different variations of vector space
classified correctly and rules that is discovered from uncertain model with Naïve Bayesian classifier which outperforms the
classified data. Thabtah et al. [3] present a method of rule other methods. While Khitam uses stem expansion
pruning called high precedence for text associative classification method with similarity based method as a
classification. In this method the rule is added to classifier if its classification method, this method achieves better
terms partially cover the terms in training data set. High classification when compared to the other two methods (Al-
precedence pruning method achieves higher accuracy with Kabi 2007 and word based classification methods). However,
MACAR algorithm than lazy and database coverage pruning Harrag uses neural network (NN) classifier with singular value
decomposition (SVD) as a feature selection method, the
methods and it produces reasonable classification rules.
results show that the SVD is effective to reduce feature space
Antonie & Zaiane [6], and Srividhya & Anitha [7] describe and the overall precision, recall and F measures are more
and use association rule based classifier by category that stable with SVD and NN, but the overall average is not so
generates rules for each category in training data set that is high.
divided to N subset by category. Thaicharoen [8] introduces a
method called text association mining with cross-sentence Omer & Shilong [18] use stemming algorithm with
inference, this method is considered to be general approach keywords matching to classify 400 Arabic text documents
and it can be applied to any application domain. Chiang et al. distributed over four classes. The recall average for four
[9] apply association rule mining with a manual category categories ranged from 88% to 99% while precision average
priority table to classify Chinese text. Abu Bakar et al. [10] ranged from 86% to 100%. Noaman et al. [19] apply Naive
employ AC method to mine association rule and build a Bayesian to classify 300 Arabic text documents that belong to
knowledge model from diet nutrition data set, the results show 10 categories, the reported accuracy is about 62%. Mesleh &
that the model can be built using unsupervised data set using Kanaan [20] use the ant colony optimization (ACO) based
the discovered knowledge in association rule mining phase. feature subset selection with support vector machine (SVM) to
classify Arabic news articles. The experimental results on
AC method is considered to be an accurate and outperform online Arabic newspaper corpus which contains 1445 articles
traditional classification method [1]. However, most of the show that ACO achieves better performance with SVM than
classification methods are applied for English and European other six compared FS methods (Chi-squar, NGL, Odds Ratio,
languages. In the last years, a few works dealt with Arabic text Information Gain, GSS, and Mutual Information).
classification problem, for instance, Al-Rdaidah et.al [11] use
association rule mining as a learning method to build their III. ASSOCIATIVE CLASSIFICATION APPROACH
Arabic text classifier, in this study a priory algorithm is used The overall steps of our associative classifier are shown in
to generate the Arabic text classification association rules. Figure 1. The data collection is divided into two data sets;
The ordered decision list, weighted rules, and majority voting training data set which is used for model construction and
prediction methods of test document category are tested and testing data set that is used to validate associative classifier.
compared regards to their classification accuracy. The The following subsections describe the construction and
experimental results against collection of Arabic text validation processes.
documents show that the majority voting method outperforms
the other prediction methods and the association rule mining is A. Arabic Text Preprocessing
a good method to build an Arabic text categorizer. AL-Harbi In this phase the Arabic text document passes through
et al. [12] evaluate two classification methods (SVM and C5.0 preprocessing steps, which include removal of non Arabic
decision tree) on seven different Arabic corpora. The letters, digits, punctuation marks, and Arabic stop words. In
classification accuracy reported against seven corpora is addition, the stemming process is also conducted in this phase.
68.65% for SVM and 78.42% for C5.0. In their study, the
C5.0 algorithm outperforms the SVM algorithm in all seven
Arabic corpora. Al-Halees [13] presents an Arabic text B. Feature Selection Process
classifier based on maximum entropy, the results show that the Feature selection FS means the selection of optimal or the
text preprocessing techniques increase the F- measure average relevant subset of features among the whole sets of feature, it
by about 12.28%, from 68.13% to 80.41%. Thabtah et al. [14] deals with data reduction, this process reduces the space of
apply k-nearest neighbor with three variations of vector space high dimensional data into a small sample dimension that
model ( Cosine, Jacaard and Dice Coefficients) on Arabic represents the best features of data. FS is an important
newspapers data set. The results related to F1 measures show preprocess through the building of text classification systems,
that Jaccard and Dice coefficient outperform the Cosine the good choosing of features that enhances the classification
coefficient method. Dice and Jaccard which are based on TF- accuracy and minimize the classification errors. In this paper,
we use TF_IDF based feature selection method as described in
[16]. In our approach, each term in a given document, is
115
assigned to a weight value based on term frequency and C. Associative Classifier Construction
inverse document frequency (TF_IDF) weighting method, and The class association rules are generated in this phase of
the documents are represented as a vector of weighted terms our methodology. We employ Apriori algorithm similar to that
with its class labels. When all terms in a given document are used in [6] to discover all frequent terms that form rules body
assigned to weighting values, these values are used to select from a preprocessed supervised training data set. Two
the important features for documents, those features are measurable values are used to determine frequent items and
considered as discriminative features (terms or words) frequent class association rules i.e. support and confidence.
between different classes and it passes to the association rule The discovered rule are ordered and pruned in this phase, and
generation process. then the associative classification model is constructed, these
steps are called post mining of class association rules [5].
Arabic Text Text Pre-Processing

Document
…………
Training Tokenization Stop word removal Stemming
Data
Features Selection Method (TF-IDF Based Feature Selection)
Class Association Rule Discovery
Frequent Term Generation Class Association Rule Generation

Testing
Data
Rule Ordering and Pruning
Associative Classifier
CR1: T1→c1 Assign Class to

CR2: T2→c2 Test Document
Preprocessing
CR3: T3→c3
……….
CRn: Tn→cn
Figure 1. Associative classifier construction and validation steps.
116
1) Frequent Class Terms Generation: The frequent Where Dtot (R) is the number of documents in training data set
terms for each class are generated during frequent terms that contains rule body and Nc is the number of classes.
generation process. The term set that are selected in feature
selection step represent training data set that is associated with 3) Class Association Rule Ordering and Pruning: The
its predefined classes are used as an input to the Apriori class association rules that are discovered in the last step will
algorithm. The frequent class terms are measured using term be passed through rule sorting procedure. In this paper, we
set support value, which is the number of occurrence of terms order the rules according to its confidence, support and the
together in the observed data [1]. An iterative search over number of terms in the rule body as presented in Fig. 2 [1].
training data set is conducted to discover the associated terms After rule ordering process, a list of ordered rules is retained.
that distinguish each category from the others, Apriori Actually, these rules are pruned during the search in the rule
knowledge is used to remove unfrequented terms and reduce mining process according to rule confidence. The rules that
search space. have lower confidence than the minimum confidence
threshold are eliminated. In addition, the rules are also pruned
Assume that T = [t1, t2, t3, …, tm] is a set of terms. based on rule redundancy method if the rule contains another
The first candidates term set are generated directly and the rule. In other words, if there is a rule body that fully matches
frequent 1-term sets that pass minimum support are retained or subset of other rule body and it has less confidence then this
and used to generate candidate 2-term set. The frequent 2- rule is eliminated from the list. The two main reasons of doing
term set is discovered in the next step according to minimum pruning process are to create an Associative classification
support. The generation process of subsequent candidate and model using the set of rules that have a discriminative power
frequent m-term sets continues until no further frequent term for distinguishing categories and to build model using a
set can be generated from the training data set [6], [7]. reasonable number of class association rules that could not
take a long time for future prediction [3], [6].
2) Class Association Rule Generation: Class association
D. Prediction of New Document Class
rule must be restricted and constrained in the rule head and
rule body. The rules that form classifier are only rules that Several methods can be used for classification using a set
indicate a category label. These rules take the form Tj Ci, of class association rules. These methods belong to two
where Tj is a set of frequent terms that represent documents in categories; single rule prediction or multiple rules prediction
the training data set, and Ci is the class that is associated with [1]. In this paper, we focus on ordered decision list (single rule
prediction) and majority voting (multiple rules prediction)
these documents. Formally, class association rule CR is of the
methods [22]. In ordered decision list method, the first rule
form Tj Ci , where, Tj is a set of frequent terms of the form
that covers new test document is used for prediction; the new
[t1& t2 &…& tm ] and is called rule terms, and Ci is the
test document is assigned to the class that is associated with
category of this rule and is called the rule head. Each rule must this rule. However, the majority voting method classifies the
have support and confidence, the rule CR is called a strong new document based on the total number of rules that cover
and frequent rule if it passes the minimum confidence and this document. In this method, all rules that cover new
minimum support threshold values [1], [6], [7]. document (rule body is partially or fully match the terms of
The support of rule CR: Tj Ci is the percentage of new document) are weighted equally. The majority voting
document that contains rule terms Tj among the whole method composed of four steps as follows:
documents in each observed class Ci [1], “(1)”, presented this
ratio.
(
sup T j ⇒ Ci ) • Search through the rules list.
(
Support T j ⇒ Ci = ) N (1) • Find all rules that cover the test document to be
classified.
Where sup (Tj Ci) is the number of documents in data
set that match terms of R, and associated with class C of R and
N is the total number of documents in class data set. Procedure: Rule Ordering.
For a given two class association rules CR1 and CR2,
CR1 is higher ranked than CR2 if:
The expected rule accuracy [21] is used as the rule
(A) The confidence of CR1 is higher than the confidence
confidence, which is defined as the conditional probability
of CR2.
which rule head is valid given the condition of rule body. The
(B) If the confidences are equal, check if support of CR1
classes are already associated with each rule and always occur
exceeds support of CR2.
with rule body. Therefore, the rule accuracy as in “ (2)” will
(C) If both confidences and supports are equal, but CR1
be used in this paper.
has more terms in its body than CR2.
(D tot ( R ) + 1) (2)
Rule _ accuracy =
(D tot ( R ) + N c ) Figure 2. Procedure of the association rule ordering.
117
• If all retained rules have identical class, classify this TABLE I. CATEGORIES AND NUMBER OF DOCUMENTS
document to this class. Category Name Number of Documents
• Else, assign this document to the class that is favored Culture 1450
Economy 1054
by the majority of all retained rules. Politic 1098
Sport 1574
IV. EXPERIMENT AND RESULTS Education 144
Information technology 144
A. Data Set Health 176
Arabic news article data sets are used in our experiment. Total 5640
This data set is collected from many sources. Most of our data
are taken from online Arabic corpora [23]; the rest is collected
from Arabic news channels websites such as Aljazeera web site Table II shows the number of class association rule that is
and from other websites. The data set consists of 5640 generated from our data set, the classification accuracy of the
documents (about 16.5 MB) with different sizes belong to two prediction methods that are used in this paper in addition to
seven categories. The data set categories and the number of the training and testing time of our associative classifier. The
documents for each category are presented in Table I. results obtained when the minimum support was 10% with
variant confidences of rules. After analyzing table II, we find
B. Results and Evaluation that the classification accuracy increases with both prediction
The classification accuracy is used in this paper as the base methods when the rule confidences is high. The majority
measure of our experimental results. The classification voting and ordered decision list reach the peak of accuracy
accuracy is calculated as in “(3)” by dividing the number of the 84% and 81% respectively when the rule confidence is 80%
correctly classified document by the total number of documents with 50 prediction rules. The majority voting method
in the testing dataset. outperforms than ordered decision list method in most
experiments. The execution time for both training and testing
TrueC does not exceed 6 minutes in all experiments, and most of this
Accuracy = (3) time is taken for model training.
Total
Additional experiments are conducted to test the
performance of associative classifier with a large number of
Where TrueC is the number of test documents that are Arabic class association rules. Table III presents the results
classified correctly (True classification) and Total is the total when the minimum support is 5%. We discover that the
number of test documents. number of class association rules increase roughly by six times
In this paper, an Arabic data set is used for experiments, than obtained rule when minimum support is 10%. The large
this data set described in section IV,A, in our experiment we number of prediction rules affects the classification accuracy
consider about 70% of data for training and the rest for model and this indicates that the associative classifier with a
testing. The data is processed and the important terms that reasonable number of rules performs well than those with a
represent labeled documents are selected as discussed in large number of rules. Figure 3 depicts the classification
section III, and the Apriori algorithm is used to discover accuracy reported in Table II and Table III for both prediction
association rule for each class. The minimum support and methods (Majority voting and ordered decision list).
minimum confidence threshold values are selected according to Figure 3 depicts the classification accuracy reported in
the literature and are based on experimental results. We set Table II and Table III for both prediction methods (Majority
min-supp to 10% in the first experiment and 5% in the second voting and ordered decision list).According to this figure, the
with 50%, 70%, and 80% for min-conf. The class association classification accuracies for both prediction methods are not so
rules that are generated are ordered and pruned using the steps far from each other. The ordered decision list prediction
that are described in section III, C, and the remaining rules are method produces better accuracy than majority voting method
used to form the Arabic text associative classifier that when the minimum confidence is 50%, but the majority voting
represents as a set of class association rules. Finally, the method superior ordered decision list method with 70% and
ordered decision list and majority voting methods (Section III, 80% of rule confidences. In general, the results are good and
D) are used to predict the classes of test documents in testing satisfactory when we deal with Arabic text with a size that is
data set. used in this study. To the best of our knowledge, the most of
the pervious works that conducted on Arabic text classification
used small Arabic
TABLE II. RESULTS OF ASSOCIATIVE CLASSIFIER FOR THE ARABIC TEXT DATA SET
Minimum Support = 10% Accuracy Testing Time (m)

Confidence NO. of Rules Training Majority Voting Ordered Decision List Majority Voting Ordered Decision
Time(m) List
50% 95 3.26 0.771 0.787 1.39 1.22
70% 82 3.19 0.803 0.792 1.27 1.38
80% 50 3.17 0.846 0.812 1.32 1.35
118
TABLE III. RESULTS OF ASSOCIATIVE CLASSIFIER WITH LARGE NUMBER OF CLASSIFICATION RULES
Minimum Support = 5% Accuracy Testing Time (m)
Confidence NO. of Rules Training Majority Voting Ordered Decision List Majority Voting Ordered Decision
Time(m) List
50% 633 3.54 0.719 0.750 1.40 1.43
70% 359 3.50 0.801 0.733 1.22 1.38
80% 345 4.00 0.797 0.749 1.35 1.32
corpus for experiments. However, in our experiment the

number of test documents that is used for AC model testing
exceeds those used for both training and testing in some
previous researches.
Figure 4 shows the number of classification rules that is
derived from our Arabic data set with different parameters of
association rule discovery i.e. minimum confidence and
minimum supports. As shown in this figure the number of
classification rules (with 5% of minimum support) decreases
from about 633 rules with 50% of rule confidence into 359
rules with 70% of rule confidence. At 10% of minimum
support no much difference between 50% and 70% of rule
confidences, only 13 rules is the difference. The number of
classification rules is the same with 50% and 70% of rule
confidence when the minimum support is 15% and 20%, which
produces 33 and 13 of rules respectively. However, at 15% and Figure 4. Number of classification rule with 50% and 70% of rule
20% of minimum support the culture category appears without confidence, minimum support varying from 5% to 20%.
any rule, and that means this category doesn’t have much of
correlated terms. V. CONCLUSION AND FUTURE WORKS
As shown in the experimental results, the AC model In this paper, we investigate the use of associative
achieves an acceptable accuracy and it classifies test classification approach for Arabic text classification problem.
documents in reasonable time. This indicates that our model is Associative classification method integrates association rule
constructed using association rules that are strongly related to discovery with classification task to build a rule based
its predefined classes and most of these rules have a classification model for prediction. The results that are
discriminative power to distinguish each class from the others. obtained in this paper against Arabic texts data set show that
In addition, the number of class association rules that form AC AC is a good classification model which can classify text data
model is not large and the prediction methods are simple with set with a reasonable number of understandable classification
few computations, therefore, the search space for prediction is rules. The model produces an acceptable accuracy when we
limited and the prediction method does not need a long time to classify Arabic text data set. In our future work, we intend to
classify test documents. enhance Arabic text associative classifier by examining other
feature selection methods and we plan to make balances
between all classes that are used in this paper. In addition, we
intend to compare the method that is presented in this paper
with other classification methods.
REFERENCES
[1] F. Thabtah, “A Review of associative classification mining,”
Knowledge Engineering Review, 22 (1). 2007, pp. 37-65, ISSN
0269-8889.
[2] L. Khreisat, “A Machine learning approach for Arabic text
classification using N-Gram frequency statistics,” Journal of
Informetrics 3, 2009, pp. 72–77.
[3] F. Thabtah, W. Hadi, H. Abu-Mansour, and L. McCluskey, “A
New rule pruning text categorization method,” 7th International
multi-conference on systems, signals and devices. 2010.
[4] Z. Su, W. Song, D. Meng, and J. Li, “A New associative classifier
for text categorization,” Proceedings of 3rd international
conference on intelligent System and Knowledge engineering,
Figure 3. Associative classification accuracy, minimum confidence 2008.
varying from 50% to 80%.
.
119
[5] A. Abu Bakar, Z. Othman, S. Nizam, and R. Ismail, “Development [24] Siti Sakira Kamarudin, Abdul Razak Hamdan & Azuraliza Abu
of Knowledge model for Insurance product decision using the Bakar. 2012. Deviation Detection in text using Conceptual Graph
associative classification approach,” The International conference Interchange Format and Error Tolerance Dissimilarity Algorithm
on intelligent systems design and Applications (ISDA10), Cairo, Intelligent Data Analysis. Volume 16(3). IOS Press. IMPACT
Egypt, 29 Nov- 1Dec. 2010, pp. 1481-1486. FACTOR: 0.429. pp487-511.
[6] M. Antonie and O. Zaiane, “Text document categorization by term [25] Mohd Zakree Ahmad Nazri, Siti Mariyam Shamsuddin, Azuraliza
association,” IEEE international conference on Data mining Abu Bakar and Salwani Abdullah. 2011. A hybrid approach for
(ICDM'02), Maebashi City, Japan, 2002, pp 19-26. learning concept hierarchy from Malay text using artificial immune
[7] V. Srividhya, and R. Anitha, “Text documents classification by network. Natural Computing. Springer. 10(1). MPACT FACTOR:
associating terms with text categories,” Applications of Soft 0.5.pp275-304.
Computing, AISC, vol. 58, 2009, pp. 223 – 231.
[8] S. Thaicharoen, “Text association mining with Cross-Sentence
inference,Structure-based document model and multi-relational
text mining,” Ph.D thesis, University of Colorado Denver, 2009.
[9] D. Chiang, H. Keh, H. Huang, and D. Chyr, “The Chinese text
categorization system with association rule and category priority”,
Expert Systems with Applications 35, 2008, pp. 102–110.
[10] A. Abu Bakar, Z. Othman, and L. Fai, “Development of
Knowledge model for Nutrition Diet planning decision using
associative classification approach,” Proceeding of 2nd Malaysian
joint conference on Artificial Intelligence. Malaysia, 27-29. July
2010, pp. 85-93.
[11] Q. Al_Radaideh, E. Al_Shawakfeh, A. Ghareb, and H. Abu Salem,
“An Approach for Arabic text categorization using association rule
mining,” International Journal of Computer processing of
Languages (IJCPOL), 2011, pp. 81-106.
[12] S. Al-Harbi, A. Almuhareb, A. Al-Thubaity, M. Khorsheed, and A.
Al-Rajeh, “Automatic Arabic text classification,” JADT: 9es
Journées Internationales d’Analyse Statistique des Données
Textuelles. 2008, pp. 77-83.
[13] A. El-Halees, "Arabic text classification using maximum entropy,"
The Islamic University Journal (Series of natural studies and
engineering), vol. 15, no.1, 2007, pp. 157-167.
[14] F. Thabtah, W. Hadi, and G. Al-shammare, “VSMs with K-
Nearest Neighbour to categorise Arabic text data,” Proceeding of
the world congress on engineering and computer science, San
Francisco, USA, October 2008, pp.778-781.
[15] M. Al-Kabi and S. Al-Sinjilawi, “A Comparative study of the
efficiency of different measures to classify Arabic text,”
University of Sharjah Journal of Pure and Applied Sciences, 4(2),
2007, pp. 13-26.
[16] K. Jbara, “Knowledge discovery in Al-Hadith using text
classification algorithm,” Journal of American Science, 2010 (11),
pp. 409-419.
[17] F. Harrag, and E. Al-Qawasmah, “Improving Arabic text
categorization using neural network with SVD,” Journal of Digital
Information Management, Vol. 8, no. 4, August 2010 , pp. 233-
239.
[18] M. Omer, and M. Shilong, “Stemming algorithm to classify Arabic
documents,” Journal of Communication and Computer, ISSN
1548-7709, USA, vol. 7, no.9 (serial no.70), 2010.
[19] H. Noaman, S. Elmougy, A. Ghoneim, and T. Hamza, “Naive
Bayes classifier based Arabic document categorization,” In
Proceedings of the 7th international conference in Informatics and
Systems (INFOS 2010), Cairo, Egypt, March. 28-30, 2010.
[20] A. Mesleh, and G. Kanaan, “Support Vector Machine text
classification system: using Ant Colony Optimization based
feature subset selection,” In Proceeding of the 2008 International
conference on Computer Engineering and Systems (ICCES’08),
Ain Shams University, Cairo, Egypt, Nov. 25-27, 2008, pp. 143-
148.
[21] X. Yin, and J. Han, “CPAR: Classification based on predictive
association rules,” Proceeding of the SIAM international
conference on data mining, San Francisco, CA: SIAM Press,2003,
pp. 331-335.
[22] S. Mutter, “Classification using Association Rules,” Master thesis,
Department of Computer Science, University of Freiburg,
Germany, 2004.
[23] http://sourceforge.net/projects/arabiccorpus/
http://www.comp.leeds.ac.uk/eric/latifa/research.htm
120

Text Associative Classification Approach For Mining Arabic Data Set

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Text Associative Classification Approach For Mining Arabic Data Set

Uploaded by

Copyright:

Available Formats

2012 4th Conference on Data Mining and Optimization (DMO)

02-04 September 2012, Langkawi, Malaysia

Text Associative Classification Approach for Mining

978-1-4673-2718-3/12/$31.00 ©2012 IEEE

Arabic Text Text Pre-Processing

Features Selection Method (TF-IDF Based Feature Selection)

Class Association Rule Discovery

Frequent Term Generation Class Association Rule Generation

Rule Ordering and Pruning

CR1: T1→c1 Assign Class to

Figure 1. Associative classifier construction and validation steps.

Minimum Support = 10% Accuracy Testing Time (m)

corpus for experiments. However, in our experiment the

You might also like