You are on page 1of 26

2nd Reading

November 1, 2018 9:10:26am WSPC/188-JIKM 1850043 ISSN: 0219-6492

Journal of Information & Knowledge Management


Vol. 17, No. 4 (2018) 1850043 (26 pages)
.c World Scienti¯c Publishing Co.
#
DOI: 10.1142/S0219649218500430

A New Associative Classi¯cation Algorithm


by UNIVERSITY OF LIVERPOOL on 11/13/18. Re-use and distribution is strictly not permitted, except for Open Access articles.

for Predicting Groundwater Locations

Faisal Aburub
Department of Management Information Systems
Faculty of Administrative and Financial Sciences
University of Petra, P. O. Box 961343
Amman, Jordan
faburub@uop.edu.jo
J. Info. Know. Mgmt. Downloaded from www.worldscientific.com

Wa'el Hadi*
Department of Computer Information Systems
Faculty of Information Technology
University of Petra, P. O. Box 961343
Amman, Jordan
whadi@uop.edu.jo

Published 2 November 2018

Abstract. In this paper, we study the problem of predicting new locations of groundwater in Jordan
through the application of a proposed new method, Groundwater Prediction using Associative Classi¯-
cation (GwPAC). We identify features that di®erentiate locations of groundwater wells according to
whether or not they contain water. In addition, we survey intelligent-based methods related to ground-
water exploration and management. Three experimental analyses were conducted with the objective to
evaluate the capability of data mining algorithms using real groundwater data from the Ministry of Water
and Irrigation. In the ¯rst experiment, we investigated the performance of GwPAC against three well-
known associative classi¯cation algorithms, namely CBA, CMAR and FACA. Furthermore, three rule-
based algorithms — C4.5, Random Forest and PBC4cip — were investigated in the second experiment;
further, so as to generalise the capability of using data mining for solving the groundwater detection
problem, four benchmark algorithms — SVMs, NB, KNN and ANNs — were evaluated in the third
experiment. From all the experiments, the results indicated that all considered data mining algorithms
predict locations of groundwater with acceptable classi¯cation rate (all classi¯cation accuracies > 79%),
and can be useful methods when seeking to address the problem of exploring new groundwater locations.

Keywords: Data mining; groundwater detection; associative classi¯cation; classi¯cation.

1. Introduction
Jordan is recognised as one of the poorest countries in the world in terms of water
resources. According to the World Health Organization (WHO) (2010), \Jordan has
one of the lowest levels of water resource availability, per capita, in the world". This
poses a serious challenge, which threatens all sectors that depend on the availability

* Corresponding author.

1850043-1
2nd Reading
November 1, 2018 9:10:27am WSPC/188-JIKM 1850043 ISSN: 0219-6492

F. Aburub and W. Hadi

of water, such as industry, agriculture and households. Furthermore, climate change


and the shortage of rainfall have a®ected water resources in Jordan, which are
already considered scarce in relation to the continuous high demand for water.
Forced Syrian immigration has also contributed to the increasing water demands
in Jordan. These factors have led to a per-capita reduction of water resources
by UNIVERSITY OF LIVERPOOL on 11/13/18. Re-use and distribution is strictly not permitted, except for Open Access articles.

for Jordanians from 3,600 m3 in 1946 to less than 123 m3 in 2014. This means that,
per-capita, water resources for Jordanians are less than 12% of the international
water poverty level (MWI, 2015).
The main water resources in Jordan are surface water and groundwater. Surface
water resources are found in 15 main basins, whilst groundwater resources are spread
across primary basins. The groundwater basins are Azraq, Hammad, North Jordan
Valley, Sirhan, Dead Sea, Jafer, North Wadi Araba, Yarmouk, Disi, Jordan Valley,
South Wadi Araba and Zarqa. According to the Jordanian Ministry of Water and
J. Info. Know. Mgmt. Downloaded from www.worldscientific.com

Irrigation (JMWI), the number of working wells is approximately 3,000; there are
also many illegal wells that exploit groundwater to the maximum level.
Rainfall is the only water supply resource for groundwater aquifers in Jordan.
According to the Jordanian Ministry of Water and Irrigation, the quantity of re-
newable water resources for di®erent purposes is around 750 MCM (million cubic metre).
One approach able to decrease the cost of extracting groundwater is to use data
mining techniques to predict groundwater areas. This paper aims to develop a new
approach based on data mining methods in order to predict new groundwater sites,
notably using the areas of Azraq, Zarqa and Mafraq in Jordan as a case study.
Classi¯cation using association — also referred to as Associative Classi¯cation
(AC) — is a study area in data mining that integrates both association rule dis-
covery (unsupervised learning) and classi¯cation (supervised learning) tasks. In fact,
AC uses association rule discovery tasks to ¯nd the knowledge and then chooses a
subset on which to build the classi¯er (Thabtah et al., 2011). The main goal for AC is
to construct a classi¯er, also known as a model, that consists of a number of
knowledge (rules) from labelled input data, referred to as the training dataset, in
order to predict the class value for a test data instance as accurately as possible
(Hadi, 2015). In this paper, the labelled input data consists of features related to
groundwater wells already dug in di®erent regions of Jordan. In Sec. 5 of this paper,
we provide details of the data collected, which relates to 900 groundwater wells
already dug in three di®erent governorates of Jordan, namely Zarqa, Al-Mafraq and
Jerash.
In the last few years, a number of AC algorithms have been developed, such as
FACA (Hadi et al., 2016), ECAR (Hadi, 2015), MCAC (Abdelhamid et al., 2014),
CBC (Deng et al., 2014) and LCA (Thabtah et al., 2010). These studies have shown
that the AC approach usually produces more accurate classi¯ers than the classic
classi¯cation data mining approaches, such as probabilistic, statistical and decision
tree methods (Quinlan, 1993; Joachims, 1998). However, AC algorithms normally
su®er from the exponential growth of rules; in other words, they derive large num-
bers of rules, which make the resulting models outsized, and, consequently, means

1850043-2
2nd Reading
November 1, 2018 9:10:27am WSPC/188-JIKM 1850043 ISSN: 0219-6492

A New Associative Classi¯cation Algorithm for Predicting Groundwater Locations

decision makers face di±culties in understanding and manipulating them. However,


numerous attempts have been made for reducing the size of classi¯ers through
the development of rule-pruning procedures that do not a®ect the accuracy of
classi¯cation (García-Borroto et al., 2010; Thabtah et al., 2011). More speci¯cally,
García-Borroto et al. (2010) introduced a rule-pruning algorithm, known as the
by UNIVERSITY OF LIVERPOOL on 11/13/18. Re-use and distribution is strictly not permitted, except for Open Access articles.

¯ltering strategy, which aims at reducing the number of rules whilst improving
accuracy.
Most of the previous works relating to groundwater have concentrated on the
problems of water quality and resource management, rather than predicting new
groundwater locations. Furthermore, the approaches used in such researches are
mainly related to Geographic Information System's (GIS) statistical and spatial
analysis tools (Israil et al., 2006; Rahmati et al., 2016) or complex mathematical
models from machine learning and data mining or statistics, such as regression
J. Info. Know. Mgmt. Downloaded from www.worldscientific.com

(Mishra and Dwibedy, 2015), Support Vector Machines (SVMs) (Xu and Valocchi,
2015) and Arti¯cial Neural Networks (ANNs) (Zhu et al., 2011). According to Hadi
et al. (2017, 2018) and Abdelhamid et al. (2012), such approaches gain high clas-
si¯cation accuracy. However, they require extensive knowledge of mathematical
modelling and spatial analysis related to the di®erent types of maps in GIS, and
often produce black box and complex models that are di±cult for end-users to
understand and interpret. Therefore, the AC mining approach, which has been
proved by many scholars to be accurate in prediction, as well as producing easy-to-
interpret models, has the potential to be successful in applications, such as
groundwater prediction (Thabtah et al., 2011; Abdelhamid et al., 2014; Alazaidah
et al., 2015; Hadi, 2015). We ¯rmly believe that the AC mining approach has never
been explored in the ¯eld of groundwater, and that it can be useful in predicting new
groundwater locations.
In this paper, we propose a new AC algorithm, namely Groundwater Prediction
using Associative Classi¯cation (GwPAC). This ¯rst discovers hidden correlation
(rules) between groundwater location features and then cuts down the number of
rules generated during the model-building step by pruning useless and redundant
rules in order to derive moderate-size classi¯ers. This results in a controllable
number of rules that the end-user can better understand and manipulate. Second,
unlike other AC mining algorithms, such as LCA (Thabtah et al., 2010), which is
known to use a single rule for prediction, the GwPAC algorithm employs a new
prediction method that guarantees that only high-quality rules are used to predict
test instances. This prediction procedure is based on multiple rules rather than a
single rule, and, therefore, multiple rules are used to play a role in predicting the
class value for the unseen groundwater locations. This may improve the classi¯cation
accuracy rate of the resulting models in predicting new locations for groundwater.
Overall, the bene¯ts of this new AC method are twofold: the ¯rst bene¯t concerns
developing a new spatial AC approach to the di±cult problem of predicting new
locations of groundwater by examining features obtained from old groundwater
wells. In order to achieve this ¯rst goal, we must extend and enhance the performance

1850043-3
2nd Reading
November 1, 2018 9:10:27am WSPC/188-JIKM 1850043 ISSN: 0219-6492

F. Aburub and W. Hadi

of a currently known AC algorithm in three of its main steps (learning, classi¯er


builder and prediction), as well as adding a pre-processing step of feature-extraction
from aerial and satellite images; this is the second bene¯t. Hence, the second bene¯t
aims to develop a new algorithm based on AC to enhance: (1) the learning procedure
of rule; (2) the classi¯er building procedure by developing a new heuristics that can
by UNIVERSITY OF LIVERPOOL on 11/13/18. Re-use and distribution is strictly not permitted, except for Open Access articles.

reduce the number of rules by eliminating redundant and unnecessary rules from
playing any part in the class assignment process of test cases; and (3) the class
assignment method in the prediction step by utilising a group of rules for prediction
rather than a single rule, as in LCA.
Through our analysis, we answer the following critical questions:
(1) Is the new spatial AC method appropriate to the problem of exploration of
groundwater locations in Jordan?
J. Info. Know. Mgmt. Downloaded from www.worldscientific.com

(2) What are the most signi¯cant features for the prediction of groundwater loca-
tion?
(3) Will the new algorithm builder model of the spatial AC method reduce the
number of rules discovered without negatively impacting the classi¯cation ac-
curacy of test cases, so that the end-user can easily understand and manipulate
the results?
This paper makes the following contributions:
. Identi¯es a set of features that will discover new groundwater locations.
. Investigates commonly used AC algorithms on real groundwater data collected
from the Jordanian Ministry of Water and Irrigation. No previous research
has tackled the problem of discovering new groundwater locations using AC
algorithms, and the current study is therefore a pioneering work in this subject.
. Assesses popular data mining algorithms on the same dataset with the aim of
exploring the applicability of their use in predicting groundwater locations.
. Proposes the GwPAC algorithm to enhance the performance of well-known AC
algorithms, especially for predicting new groundwater locations.
. Conducts an extensive experimental study on groundwater data.

Section 2 of this paper sets out the problem statements. Section 3 surveys the recent
AC algorithms, data mining techniques that are commonly used to address the
groundwater problem, and related works in the study area. Section 4 presents the
proposed algorithm steps. Section 5 provides an overview of the experiments, a
description of the groundwater data, the evaluation measures, the compared algo-
rithms, the results and the discussion of the results. Finally, Sec. 6 addresses the
conclusion.

2. Problem Statements
An AC algorithm generally operates in three main phases. During the ¯rst phase, it
extracts the hidden relationships between the feature values and the class feature

1850043-4
2nd Reading
November 1, 2018 9:10:27am WSPC/188-JIKM 1850043 ISSN: 0219-6492

A New Associative Classi¯cation Algorithm for Predicting Groundwater Locations

values in the input data and represents them in \IF–THEN" rules (Hadi, 2015).
Once all rules have been extracted, the ranking and pruning procedures (Phase 2)
begin. The ranking procedure ranks rules in line with their con¯dence values or,
where two or more rules have the same con¯dence value, by their support values. In
addition, in the pruning procedure, useless and con°icting rules are removed, and the
by UNIVERSITY OF LIVERPOOL on 11/13/18. Re-use and distribution is strictly not permitted, except for Open Access articles.

rules that remain represent the AC model. Finally, the AC model is evaluated on
new data, referred to as testing data, to investigate its performance in predicting the
class of new test instances. The result of the ¯nal phase is the classi¯cation accuracy
rate of the AC algorithm.
Our proposed algorithm assumes that the input groundwater data is a normal
relational table containing 900 instances described by seven distinct features, ex-
cluding the class feature. These 900 instances have been clustered into two known
classes (\Yes" and \No").
J. Info. Know. Mgmt. Downloaded from www.worldscientific.com

Formally, our proposed algorithm is de¯ned as follows: Let D be the input data;
let I be the set of all feature values in D and Y be the set of class values (labels). An
instance d is said to contain X, a set of feature values in I, if X  D. The following
are the main terms that we study in this paper.
Term 1 (Support). The percentage of instances in D that contain the feature
value.
Term 2 (Frequent features). A feature value f 2 I is said to be frequent if
its support is greater than or equal to a speci¯ed minimum support
constraint.
Term 3 (K-itemset). An itemset that contains K items (features).
Term 4 (A class association rule (CAR)). An association of the form X ! y,
where X  I and y 2 Y .
Term 5 (Rule support). The support of rule X ! y is the percentage of instances
in D that X and y hold together, X [ y.
Term 6 (Rule con¯dence). The con¯dence of the rule X ! y is the ratio of the
instances in D that contain X and also contain y, and is de¯ned as follows:

SupportðX [ yÞ
ConfidenceðX ! yÞ ¼ : ð1Þ
SupportðXÞ

Our aims are as follows: (1) to extract all CARs that have supports and con¯dences
greater than or equal to speci¯ed minimum support and minimum con¯dence con-
straints, respectively; and (2) to build a model for predicting new groundwater
locations from CARs.

3. Related Works
In the following subsections, we present a more detailed overview of the related
works from the two domains with which our research is concerned: groundwater
prediction and AC.

1850043-5
2nd Reading
November 1, 2018 9:10:30am WSPC/188-JIKM 1850043 ISSN: 0219-6492

F. Aburub and W. Hadi

3.1. Study area


The purpose of this study is to develop a new approach to predicting groundwater
wells using the Jerash, Zarqa and Mafraq governorates as case studies. Figure 1
illustrates the 900 groundwater wells already dug in the governorates that are in-
vestigated in this study.
by UNIVERSITY OF LIVERPOOL on 11/13/18. Re-use and distribution is strictly not permitted, except for Open Access articles.

Zarqa governorate is located approximately 25 km east of Amman — the capital


of Jordan. It is the third largest governorate in terms of population. The main cities
of the governorate are the capital, Zarqa, and Alrusseifa. The governorate's borders
are as follows: Amman governorate in the south and southwest; Mafraq governorate
in the north; and Balqa and Jerash governorates in the west.
Mafraq governorate is the second largest governorate, by area, in Jordan.
It is located northeast of Amman, and its capital is Mafraq. The climate in the
J. Info. Know. Mgmt. Downloaded from www.worldscientific.com

governorate is dry for most of the year. Mafraq borders with Saudi Arabia in the
south, Syria in the north, Iraq in the east and Jerash governorate in the west.
Jerash governorate is located to the north of Amman, and is the smallest
governorate in Jordan in terms of area. Its capital is Jerash. The governorate's lands
are hilly and fertile. Its borders are with Irbid in the north, Ajloun governorate in the
west, Mafraq governorate in the east and Zarqa governorate in the south.
The main basins that cross these governorates are Azraq Basin and Zarqa Basin.
Azraq Basin is considered the main supplier of drinking water for the city of Amman.
This basin is located in the northeast of Jordan; more speci¯cally, it is the area that

Fig. 1. Distribution of groundwater wells in the study area.

1850043-6
2nd Reading
November 1, 2018 9:10:36am WSPC/188-JIKM 1850043 ISSN: 0219-6492

A New Associative Classi¯cation Algorithm for Predicting Groundwater Locations

lies between 205  and 400  East and between 55  and 230  North, according to the
Palestine Grid. The town of Azraq is located in the middle of the basin, approxi-
mately 100 km east of Amman. Azraq Basin is the largest desert basin in Jordan,
with an area of about 12,710 km2 (El-Naqa, 2010). The area of Zarqa Basin is around
4,120 km2, with 95% of its area within Jordan and 5% within Syria.
by UNIVERSITY OF LIVERPOOL on 11/13/18. Re-use and distribution is strictly not permitted, except for Open Access articles.

As there is a water shortage in Jordan, there have been many studies carried out,
discussing this problem and seeking a feasible solution. Hadadin et al. (2010) considered
the Jordanian government's claim that the average annual water share of the Jorda-
nian people is very small in comparison with those of their neighbours, such as the
people of Egypt and Turkey. Therefore, they suggested a set of sustainable solutions,
namely the desalination of seawater and/or brackish water, importation of water from
Turkey, processing of wastewater and its reuse, particularly in agriculture, and the
reduction of water demand by using technology and awareness programmes/initiatives.
J. Info. Know. Mgmt. Downloaded from www.worldscientific.com

Mohsen (2007) stated that groundwater is a major water resource in Jordan, and
therefore suggested several strategies to protect this important resource. These
strategies focussed on water harvesting, wastewater recycling, water imports, sup-
plying the Dead Sea with ocean water, the desalination of seawater and brackish
water, and the reallocation of water resources based on sector and use.
Abdulla et al. (2000) developed a new model called MODFLOW to simulate the
level change in the complex aquifer systems, using Azraq Basin as a case study.
MODFLOW is a three-dimensional groundwater °ow model that aims at predicting
the level of aquifer water. The results show that, if water pumping continues at its
current rate, the level of water in the well-¯eld areas will fall by approximately 25 m by
2025. The worst-case scenario is if water-pumping increases to 1.5 times its current rate,
this will reduce the level of water in the well-¯eld areas by approximately 39 m by 2025.
Salahat et al. (2014) investigated the factors that control the quality of the
groundwater in semi-arid area. They used advanced statistical methods and hier-
archical clustering combined with GIS to predict quality based on three factors: land
use or land cover, aquifer type and soil texture. The results showed that the most
e®ective factor for predicting pollution of groundwater is land use or land cover,
followed by aquifer type and soil texture.
El-Naqa and Al-Shayeb (2009) indicated that groundwater resources are very
important in Jordan, and that protection plans and management are required in
order to save the groundwater from over-exploitation, which would, in turn, lead to a
decline in the water level. Al-Zyoud et al. (2015) used satellite data to estimate
groundwater over-exploitation in the Amman Zarqa Basin.
A new approach to detecting groundwater sites in Jordan will help to locate
groundwater for a reasonable cost — not only for Jordan itself but for all countries
that su®er from a lack of water.

3.2. Groundwater applications using machine learning


A new data mining approach that can assess the relationships of groundwater
pollution sensitivity was proposed by Yoo et al. (2016). The authors collected

1850043-7
2nd Reading
November 1, 2018 9:10:36am WSPC/188-JIKM 1850043 ISSN: 0219-6492

F. Aburub and W. Hadi

hydrogeological and pollution sensitivity data from the Woosan Industrial Complex
in Korea and identi¯ed seven hydrogeological features: net recharge, depth to water,
aquifer media, topography, soil media, hydraulic conductivity and vadose zone
media. The experimental results with four commonly used data mining algo-
rithms — decision tree, Arti¯cial Neural Network, multinomial logistic regression
by UNIVERSITY OF LIVERPOOL on 11/13/18. Re-use and distribution is strictly not permitted, except for Open Access articles.

and case-based reasoning — showed that the decision tree algorithm produced
higher classi¯cation accuracy than the other algorithms. The authors also utilised
the ordinal pairwise partitioning algorithm with the decision tree to increase the
classi¯cation accuracy. The proposed model results showed that the soil media, net
recharge and aquifer media were the main hydrogeological features a®ecting
groundwater sensitivity. Furthermore, the results indicated that the proposed
new algorithm gave more accurate and more consistent estimates of groundwater
pollution than other algorithms.
J. Info. Know. Mgmt. Downloaded from www.worldscientific.com

Rahmati et al. (2016) investigated two popular data mining techniques, namely
Random Forest and Maximum Entropy, on 163 groundwater wells in Mehran Re-
gion, Iran, using data collected from the Iranian Department of Water Resources
Management. The authors identi¯ed 10 features that a®ect the storage of ground-
water: slope aspect, slope percent, altitude, plan curvature, distance from rivers,
drainage density, topographic wetness index, lithology, soil texture and land use.
The area under the receiver operating characteristic (ROC) curve (AUC) was used
to evaluate the performance of Random Forest with Maximum Entropy. The ex-
perimental results indicated that the AUCs for the success rates of Random Forest
and Maximum Entropy were 86.5% and 91%, respectively, while the AUCs for the
prediction rates of Random Forest and Maximum Entropy were 83.1% and 87.7%,
respectively. Thus, the authors concluded that the data mining algorithms were
e®ective for detecting new groundwater locations.
Karthik and Vijayarekha (2014) introduced Principal Component Analysis
(PCA) with a supervised data mining algorithm called JRIP, which was imple-
mented to predict the groundwater in various locations of Thanjavur, Ariyalur and
Nagapattinam — the Cauvery Delta Regions of Tamil Nadu in India. The study
aimed to check whether or not the groundwater from these locations is potable. The
experimental results showed that machine learning techniques can be used for faster
classi¯cation of water portability on datasets containing 11 chemical and physical
features: electrical conductivity (EC), pH, alkalinity, total alkalinity, TDS and the
levels of calcium, magnesium, sodium, potassium, chloride and sulphates.
Meganathan and Sivaramakrishnan (2013) presented an association rule-mining
algorithm called predictive apriori for generating rules; these rules were tested using
the K  classi¯er for predicting rain in Cuddalore station on the East Coast of India.
The experimental results indicated that the data mining techniques produce satis-
factory classi¯cation accuracy for rain prediction before 48 h of the actual occurrence
of the rain. Furthermore, data mining techniques can discover hidden relationships
between various atmospheric features, such as temperature, dew point, wind speed,
visibility and rainfall. The dataset investigated contained 3,039 instances belonging

1850043-8
2nd Reading
November 1, 2018 9:10:38am WSPC/188-JIKM 1850043 ISSN: 0219-6492

A New Associative Classi¯cation Algorithm for Predicting Groundwater Locations

to two classes: \yes" for the occurrence of rainfall and \no" for the non-occurrence of
rainfall.
A total of 45 groundwater samples were collected by Kolli and Seshadri (2013) at
Tadepalli mandal in India in the period of September–November 2012. The authors
identi¯ed a number of parameters to assess the water quality: pH, chlorides, elec-
by UNIVERSITY OF LIVERPOOL on 11/13/18. Re-use and distribution is strictly not permitted, except for Open Access articles.

trical conductivity, nitrates, sulphates, °uorides, total hardness, alkalinity, total


dissolved solids, potassium, phosphates and sodium. The results indicated that the
association rule-mining approach is useful for managing and monitoring ground-
water pollution in the study area in terms of water quality.

4. The Proposed AC Method


The GwPAC algorithm mainly operates through four procedures, including the pre-
J. Info. Know. Mgmt. Downloaded from www.worldscientific.com

processing procedure that is required. The procedures for our proposed associative
classi¯er are as follows: extract frequent itemsets, produce classi¯cation rules and
predict a new groundwater location (Fig. 2).

Fig. 2. The methodology of GwPAC.

1850043-9
2nd Reading
November 1, 2018 9:10:40am WSPC/188-JIKM 1850043 ISSN: 0219-6492

F. Aburub and W. Hadi

The details of the GwPAC algorithm procedures are as follows:


. Pre-processing, if the input data is satellite or aerial images that require processing
in order to extract the seven identi¯ed feature values used in the mining process.
The details of this step are given in Sec. 5.1.
. Extract frequent itemsets. Let F be the set of all frequent itemsets, F ¼ f[ f such
by UNIVERSITY OF LIVERPOOL on 11/13/18. Re-use and distribution is strictly not permitted, except for Open Access articles.

that f  I, and supportðfÞ  minimum supportg. Since we are interested in


building a classi¯er, only itemsets that have a class value are considered. GwPAC
employs fast vertical mining, named the Di®sets method, which was developed by
Zaki and Gouda (2003). Di®set is a di®erence in the transaction between a can-
didate K-itemset and its pre¯x (K  1)-itemset.
Consider, for example, the single frequent itemsets (h\Rainfall", \Semi-Dry"i)
and (h\Faults", \Yes"i) and assume that these itemset occurrences in the input
J. Info. Know. Mgmt. Downloaded from www.worldscientific.com

data are (1, 2, 4, 5, 8, 10) and (1, 3, 5, 8, 10), respectively. The Di®set for
h\Rainfall", \Semi-Dry"i is f3, 6, 7, 9g and the one for h\Faults", \Yes"i is f2, 4,
6, 7, 9g. These single-itemsets can be utilized to produce the two-itemset
hh\Rainfall", \Semi-Dry"i, h\Faults", \Yes"ii by intersecting their Di®sets, i.e.
f3, 6, 7, 9g and f2, 4, 6, 7, 9g. The result of this intersection is the set f2, 4g.
The support for a candidate K-itemset is calculated by subtracting the cardi-
nality of Di®sets between the ðK  1Þ-itemset and K-itemset itself. The support
for hh\Rainfall", \Semi-Dry"i, h\Faults", \Yes"ii is 4, because the support for
h\Rainfall", \Semi-Dry"i is 6 and the cardinality of hh\Rainfall", \Semi-Dry"i,
h\Faults", \Yes"ii is 2, thus 6  2 ¼ 4. In other words, if we have two single-
itemsets A and B, we would like to ¯nd Di®sets(AB). Di®sets(AB) ¼ Di®setðAÞ –
Di®setðBÞ, and support(AB) ¼ supportðAÞ – supportðABÞ. Now, if the two-itemset
support produced (4) passes the minimum support, the itemset becomes frequent.
. Produce classi¯cation rules from mined itemsets. From the set of frequent itemsets
ðF Þ, ¯nd all rules such that the head of the rule is a class value (\Yes" or \No").
Let R be the set of all classi¯cation rules, R ¼ f[ r such that r is of the form
A ! B, where B is a class value and con¯denceðrÞ  minimum con¯denceg. Let
us consider that CARs are the rule sets generated from frequent itemsets. Rank
classi¯cation rules in CARs by con¯dence, support and then by a more general
rule (a rule with a smaller number of feature values in the body of the rule)
constraints. GwPAC then prunes the redundant and useless rules according to the
following process:
 The GwPAC algorithm begins with the ¯rst ranked rule and checks it on the
input data; the rule will be inserted into the model if it matches at least one
instance from the input data. All input data instances that match the rule body
and its class value are removed, and the rule is added into the GwPAC model.
Otherwise, the rule will be pruned. This procedure is repeated on the remaining
rules until no more instances remain in the input data, or all rules are checked.
The rule-pruning procedure in the GwPAC algorithm guarantees to select a

1850043-10
2nd Reading
November 1, 2018 9:10:40am WSPC/188-JIKM 1850043 ISSN: 0219-6492

A New Associative Classi¯cation Algorithm for Predicting Groundwater Locations

minimal representative subset of rules that cover the input data, and only high-
quality rules are inserted into the model of the algorithm, which may increase
the classi¯cation accuracy rate.
. Classify a new groundwater location using the set of classi¯cation rules (CARs).
The classi¯cation judgment is made according to a new scoring method. The
by UNIVERSITY OF LIVERPOOL on 11/13/18. Re-use and distribution is strictly not permitted, except for Open Access articles.

proposed scoring method we consider is as follows:


 Classify a new groundwater location based on the geometric mean of the
con¯dence values. Let us assume that n rules match to a new groundwater
location. Let G be the set of n rules. Split G into subsets by class value: S1 and
S2 (groundwater data contains two class values). Let k be the number of rules in
Si . For each subset, Si calculates the kth root of the product of k con¯dence
values; this is the score that is related to class i. The object is predicted in
J. Info. Know. Mgmt. Downloaded from www.worldscientific.com

the class with the maximum score. We named this method the GmR (i.e.
geometric mean of the con¯dences values of the rules that match). The
GwPAC algorithm makes the prediction judgment using multiple rules, which
is considered by previous studies on AC approaches to be a bene¯t, since
multiple rules enhance and improve the prediction judgment (Thabtah et al.,
2011). Finally, in situations when no rules in the GwPAC model are matched
to the test instance (new groundwater location), the default class (majority
class in the input data) will be given to that instance.
In summary, our proposed algorithm has many advantages over normal AC algo-
rithms. These advantages are as follows:
(1) The GwPAC algorithm uses multiple rules to predict test cases. This may en-
hance the classi¯cation accuracy of the resulting models in predicting new test
cases. On the other hand, most of the current AC algorithms use a single rule
with highest con¯dence to predict test cases. FACA algorithm (Hadi et al.,
2016) predicts class with the highest number of rules to a test case. FACA
prediction method is sensitive to the majority class.
(2) The GwPAC algorithm proposes a GmR prediction procedure that uses both
con¯dence and support constraints to evaluate the rules, unlike other AC
algorithms, which use only the con¯dence constraint to evaluate the rules.
(3) The GwPAC extracts new hidden rules that current associative classi¯cation
algorithms are unable to extract. These rules might play a substantial role in
a decision-making process especially in real-life applications such as medical
diagnosis, weather forecasting and groundwater detection.

5. Experiments and Discussion


In this section, three sets of experiment are used to explore the applicability of using
data mining classi¯cation algorithms on establishing groundwater locations. In
the ¯rst experiment, three AC algorithms, namely CBA (Liu et al., 1998), CMAR

1850043-11
2nd Reading
November 1, 2018 9:10:41am WSPC/188-JIKM 1850043 ISSN: 0219-6492

F. Aburub and W. Hadi

(Li et al., 2001) and FACA (Hadi et al., 2016), are used to investigate the perfor-
mance of the GwPAC algorithm on predicting groundwater locations. The selection
of these algorithms is based on the fact that they use similar learning methodologies
for a fair investigation and the implementations of these algorithms are publicly
available.
by UNIVERSITY OF LIVERPOOL on 11/13/18. Re-use and distribution is strictly not permitted, except for Open Access articles.

To discover AC rules, we ¯rst pre-process the features, as mentioned in Sec. 4,


and then capture the rules with minimum support ¼ 0.05 and minimum con¯dence ¼
0.50. We also investigate the sensitivity to the minimum support and con¯dence.
When evaluating the performance of GwPAC, we use the standard classi¯cation
accuracy, number of rules, learning time and F 1 measures. Classi¯cation accuracy is
calculated by dividing the number of correctly predicted groundwater locations
by the total number of groundwater locations in the testing data. The F 1 measure is
the harmonic mean (weighted average) of recall and precision, and is computed as
J. Info. Know. Mgmt. Downloaded from www.worldscientific.com

follows:
2  precision  recall
F1 ¼ ; ð2Þ
recall þ precision
where recall is de¯ned as the ratio of correct predictions divided by the total number
of predictions, and precision is the ratio of correct predictions divided by the total
number of the system's predictions.
Let us illustrate the performance measures using an example in which a classi¯-
cation system has been found to predict groundwater locations. The sample has 18
groundwater locations, where 10 locations are labelled \Yes" and eight are labelled
\No".
For the 10 locations labelled \Yes", the classi¯cation system predicted seven as
\Yes" and three as \No", and for the eight labelled \No", the system predicted six as
\No" and two as \Yes". Precision for class \Yes" ¼ 7/9, recall for class \Yes" ¼ 7/10,
F 1 for class \Yes" ¼ 0.737; precision for class \No" ¼ 6/9, recall for class \No" ¼ 6/8,
F 1 for class \No" ¼ 0.706.
In the second experiment, we used C4.5 (Quinlan, 1993), Random Forest (Brei-
man, 2001) and PBC4cip (Loyola-Gonzalez et al., 2017), as tests in our groundwater
dataset. The selection of C4.5 and Random Forest algorithms was owing to them
being two well-known easy-to-understand classi¯ers (rule-based algorithms), which
exhibit excellent performance in many contexts of applications. Moreover, we
evaluate the PBC4cip classi¯er that is suitable for this investigation because of three
reasons:
(1) It is a rule-based algorithm.
(2) It was evaluated in more than 90 datasets, proving that it signi¯cantly out-
performed other 11 di®erent algorithms for class imbalance problems.
(3) The source code is publicly available.1

1 https://sites.google.com/site/octavioloyola/papers/PBC4cip.

1850043-12
2nd Reading
November 1, 2018 9:10:41am WSPC/188-JIKM 1850043 ISSN: 0219-6492

A New Associative Classi¯cation Algorithm for Predicting Groundwater Locations

In the last experiment, in an e®ort to generalise the applicability of using data


mining algorithms for predicting groundwater locations, four well-known benchmark
classi¯cation algorithms are investigated in the same dataset. These algorithms are
SVMs, ANN, NB and KNN.
The Waikato Environment for Knowledge Analysis (WEKA) tool (Hall et al.,
by UNIVERSITY OF LIVERPOOL on 11/13/18. Re-use and distribution is strictly not permitted, except for Open Access articles.

2009) was used to implement the algorithms considered in our experiments. WEKA
is known as a landmark system in data mining and machine learning. It has achieved
widespread acceptance within academia and business circles, and has become a
widely used tool for data mining research (Hall et al., 2009).
Here, we use 10-fold cross-validation to evaluate the algorithms considered in our
experiments. Experiments are performed on an Intel I7 machine with 3-GHz pro-
cessor and 16 GB of main memory in a Windows-8 environment.
J. Info. Know. Mgmt. Downloaded from www.worldscientific.com

5.1. Groundwater dataset


We use a dataset of 900 groundwater wells already dug in three governorates of
Jordan — Zarqa, Al-Mafraq and Jerash — in our experiments (Aburub and Hadi,
2016).2 The dataset was collected from JMWI. The dataset is already clustered by
JMWI into two classes: \Yes" for the existence of water (683 instances) and \No"
for the non-existence of water (217 instances). There are numerous features that
distinguish groundwater wells. Based on previous groundwater studies such as those
by Sahoo and Jha (2013), Sujay Raghavendra and Deka (2015) and Johnson et al.
(2017), and after conducting intensive meetings with geologist and groundwater
experts from JMWI, seven features had been identi¯ed to distinguish groundwater
wells. These are height above sea level (elevation), faults on the Earth's surface,
average rainfall, slope of the Earth's surface, valleys (Wadis), the annual long-term
average of temperatures and geological outcrop. More detailed descriptions of the
seven features selected are given in the following sub-subsection.

5.1.1. The selected features


Here, we explain the features that are used in the experiments and their corre-
sponding rules.
(1) Average rainfall. The long-term average in depth (over space and time) of
annual precipitation in the country. Precipitation is de¯ned as any kind of water
that falls from clouds as a liquid or a solid (Nation Master, 2015). For instance,
the process of extracting a long-term rainfall average for a certain region requires
reading precipitation maps, but preparing such maps requires hydrological ex-
perience. Accordingly, raster maps are used to help us in calculating the nu-
merical values for each study area. Raster maps can subsequently be used to
provide us with numerical values for each well location, ¯rst by creating bu®er
polygons around the input features to a speci¯ed distance. An optional dissolve

2 https://drive.google.com/drive/folders/0B0g0LP5sLwQhdnJxLUt2S24xOWM.

1850043-13
2nd Reading
November 1, 2018 9:10:45am WSPC/188-JIKM 1850043 ISSN: 0219-6492

F. Aburub and W. Hadi


by UNIVERSITY OF LIVERPOOL on 11/13/18. Re-use and distribution is strictly not permitted, except for Open Access articles.

(a) (b)
J. Info. Know. Mgmt. Downloaded from www.worldscientific.com

(c) (d)

(e) (f)

(g)

Fig. 3. Groundwater features: (a) average rainfall, (b) average temperature, (c) elevation, (d) slope,
(e) faults, (f) valleys and (g) outcrop.

can be performed in the case of overlapping of bu®er zones by converting polygon


features to a raster dataset. Average rainfall feature values are numerical, as
shown in Fig. 3(a). These values are replaced according to the following rules.

Rule: If Average Rainfall between 0 and 250 ! Dry


Else if Average Rainfall between 251 and 300 ! Semi-Dry
Else ! Wet

1850043-14
2nd Reading
November 1, 2018 9:10:55am WSPC/188-JIKM 1850043 ISSN: 0219-6492

A New Associative Classi¯cation Algorithm for Predicting Groundwater Locations

(2) Average temperature. The long-term average annual temperature in the


country. To calculate the long-term average annual temperature, we apply the
aforementioned steps and then ¯nd suitable numeric values that represent
the temperature for each well location, as shown in Fig. 3(b). These values are
replaced according to the following rules.
by UNIVERSITY OF LIVERPOOL on 11/13/18. Re-use and distribution is strictly not permitted, except for Open Access articles.

Rule: If Average Temperature between 0 and 14 ! Mountain


Else if Average Temperature between 15 and 25 ! Edge of the Valley
Else if Average Temperature between 26 and 35 ! Semi-Desert
Else ! Desert

(3) Elevation. The height of a geographic location above or below a ¯xed reference
point, most commonly a reference geoid, and a mathematical model of the
J. Info. Know. Mgmt. Downloaded from www.worldscientific.com

Earth's sea level (Encyclopedia.com, 2009). The elevation level values for study
area vary between 343 m and 1,014 m, as displayed in Fig. 3(c). These values are
replaced according to the following rules.

Rule: If elevation level between 343 and 500 ! A


Else if elevation level between 501 and 650 ! B
Else if elevation level between 651 and 800 ! C
Else if elevation level between 801 and 950 ! D
Else ! E

(4) Faults on the Earth's surface. A planar fracture or discontinuity in a volume


of rock, across which there has been signi¯cant displacement as a result of rock mass
movement. A fault trace is also the line commonly plotted on geological maps to
represent a fault (Brodie et al., 2007). The faults feature values are numerical, as
illustrated in Fig. 3(e). These values are replaced according to the following rules.

Rule: If the value of the faults ¼ 9999 ! No


Else ! yes

(5) Slope of the Earth's surface. Calculated by ¯nding the ratio of the \vertical
change" to the \horizontal change" between (any) two distinct points on a line.
Sometimes, the ratio is expressed as a quotient (\rise over run"), giving the same
number for every two distinct points on the same line (Wikipedia, 2002). The
slope feature values are numerical, as shown in Fig. 3(d). These values are
replaced according to the following rules.

Rule: If the value of the slope between 0 and 2.5 ! Very easy to dig
Else if the value of the slope between 2.6 and 5 ! Easy to dig
Else ! Hard to dig

1850043-15
2nd Reading
November 1, 2018 9:10:55am WSPC/188-JIKM 1850043 ISSN: 0219-6492

F. Aburub and W. Hadi

(6) Valleys (Wadis). A depression that is longer than it is wide. The terms U-
shaped and V-shaped are descriptive geographical terms to characterise the form
of valleys. Most valleys belong to one of these two main types, or a mixture of
them, (at least) with respect to the cross-section of the slopes or hillsides
(Wikipedia, 2001). The valleys feature values are numerical, as shown in Fig. 3(f).
by UNIVERSITY OF LIVERPOOL on 11/13/18. Re-use and distribution is strictly not permitted, except for Open Access articles.

These values are replaced according to the following rules.

Rule: If Valleys value between 1 and 1.99 ! Rank A


Else if Valleys value between 2 and 2.99 ! Rank B
Else if Valleys value between 3 and 3.99 ! Rank C
Else if Valleys value between 4 and 5.99 ! Rank D
Else if Valleys value between 6 and 14.99 ! Rank E
Else if Valleys value between 15 and 29.99 ! Rank F
J. Info. Know. Mgmt. Downloaded from www.worldscientific.com

Else if Valleys value between 30 and 49.99 ! Rank G


Else if Valleys value between 50 and 99.99 ! Rank H
Else if Valleys value between 100 and 149.99 ! Rank I
Else if Valleys value between 150 and 199.99 ! Rank J
Else if Valleys value between 200 and 299.99 ! Rank K
Else if Valleys value between 300 and 499.99 ! Rank L
Else if Valleys value between 500 and 999.99 ! Rank M
Else if Valleys value between 1000 and 9999.99 ! Rank N
Else ! Rank O

(7) Geological outcrop. The part of a rock-formation that appears above the
surface of the surrounding land (Howell, 1957). Extracted feature numerical
values [Fig. 3(g)] are replaced according to the following rules.

Rule: If outcrop value ¼ 0 ! B2


Else If outcrop value ¼ 1 ! B5
Else If outcrop value ¼ 2 ! AI
Else If outcrop value ¼ 3 ! B3
Else If outcrop value ¼ 4 ! Ram
Else If outcrop value ¼ 5 ! A7
Else If outcrop value ¼ 6 ! K
Else If outcrop value ¼ 7 ! Kh
Else If outcrop value ¼ 8 ! BSCPX
Else If outcrop value ¼ 9 ! B4
Else If outcrop value ¼ 10 ! BA
Else If outcrop value ¼ 11 ! L/S
Else If outcrop value ¼ 12 ! A5–6
Else If outcrop value ¼ 13 ! A1–2
(Continued)

1850043-16
2nd Reading
November 1, 2018 9:10:56am WSPC/188-JIKM 1850043 ISSN: 0219-6492

A New Associative Classi¯cation Algorithm for Predicting Groundwater Locations

Else If outcrop value ¼ 14 ! A4


Else If outcrop value ¼ 15 ! A3
Else If outcrop value ¼ 16 ! Z
Else If outcrop value ¼ 17 ! AL
by UNIVERSITY OF LIVERPOOL on 11/13/18. Re-use and distribution is strictly not permitted, except for Open Access articles.

Else If outcrop value ¼ 18 ! ALL


Else ! SEA

5.2. AC algorithms performance


The classi¯cation accuracies (%) of the algorithms considered are shown in Table 1.
The GwPAC algorithm is clearly superior to the FACA, CMAR and CBA algo-
J. Info. Know. Mgmt. Downloaded from www.worldscientific.com

rithms, where the CBA algorithm is the worst algorithm in terms of predicting
groundwater locations. More speci¯cally, the GwPAC algorithm outperformed
FACA, CMAR and CBA by 2.2%, 4.6% and 5.9%, respectively.
The F 1 measures of the FACA, CBA, CMAR and GwPAC algorithms are also
shown in Table 1. It is obvious from Table 1 that the GwPAC algorithm outperforms
FACA, CMAR and CBA algorithms. In addition, the GwPAC algorithm has 2.1%,
10.8% and 13.4% higher F 1 scores than FACA, CMAR and CBA, respectively.
There are two fundamental reasons for the higher classi¯cation accuracy rate
achieved by the GwPAC algorithm: ¯rst, it uses multiple rules to predict ground-
water locations, unlike the CBA algorithm, which uses only one rule for classi¯ca-
tion. In addition, it di®ers from the CMAR and FACA algorithms, CMAR algorithm
uses multiple rules to predict cases based on chi-square method (Li et al., 2001). One
of the main drawbacks of this prediction method is its biases to the minority class,
whilst the FACA prediction method has biases to the majority class.
Another disadvantage of using a single rule is that the highest con¯dence rule is
occasionally fruitless, particularly for datasets that have an imbalanced distribution
of classes such as groundwater datasets (Thabtah et al., 2010). Therefore, handling a
small subset of rules for predicting groundwater locations appears to be more
fruitful. The second reason for GwPAC's higher accuracy rate is that our GmR
prediction method uses both con¯dence and support constraints to evaluate the
rules, unlike other AC algorithms, which use only the con¯dence constraint to
evaluate the rules.

Table 1. The classi¯cation accuracies (%), F 1 scores, numbers of rules and learning times of AC algorithms.

AC algorithm Classi¯cation accuracy (%) F 1 score Number of rules Learning time (s)

GwPAC 83.8 0.843 57 0.24


CMAR 79.2 0.735 413 0.72
CBA 77.9 0.709 50 1.16
FACA 81.1 0.822 57 0.24

1850043-17
2nd Reading
November 1, 2018 9:10:56am WSPC/188-JIKM 1850043 ISSN: 0219-6492

F. Aburub and W. Hadi


by UNIVERSITY OF LIVERPOOL on 11/13/18. Re-use and distribution is strictly not permitted, except for Open Access articles.
J. Info. Know. Mgmt. Downloaded from www.worldscientific.com

Fig. 4. The top 10 rules of CMAR algorithms.

Table 1 summarises the numbers of rules used by the GwPAC, FACA, CBA and
CMAR algorithms on groundwater datasets. It is clear that the CMAR algorithm
generates the highest number of rules whilst the CBA algorithm generates the
lowest. It is obvious from Fig. 4 that the CMAR model contains many redundant
rules that reduce its classi¯cation accuracy, such as Rule 3, Rule 5, Rule 7 and Rule
10, whilst the CBA model contains a subset of rules that are not representative of the
groundwater locations in the input data. However, our proposed ranking and
pruning methods, which implement the GwPAC algorithm, extract the minimal
number of rules that are representative of all groundwater location datasets; these
methods guarantee that the remaining rules are high-quality rules that enhance the
outputted classi¯cation accuracy rate of the GwPAC algorithm. Besides, only two
instances are classi¯ed in the default class because no rules cover them; this re°ects
the goodness of the remaining rules in the GwPAC model.
Furthermore, the GwPAC algorithm outperforms CMAR and CBA in terms of
the learning time measure. In fact, the proposed fast rule discovery method that
implements the GwPAC algorithm (¯rst step) needs only one scan of the input data
and then implements a simple intersection among the Di®sets of frequent rules of size
N  1 to discover candidate rules of size N.
Finally, another notable result that was reported is that all algorithms produce
acceptable classi¯cation accuracy rates and F 1 scores; this re°ects the features'
relevance for the groundwater dataset.

5.2.1. Sensitivity to support and con¯dence


Here we study the impact of the sensitivity of the algorithms considered in our
experiments to the variations of minimum con¯dence and minimum support. First,
we ¯x the minimum con¯dence (0.60) and evaluate di®erent minimum support
values: f0.05, 0.06, 0.07, 0.08, 0.09, 0.10g. We then ¯x the minimum support
(0.05) and evaluate di®erent minimum con¯dence values: f0.60, 0.70, 0.80, 0.90g.

1850043-18
2nd Reading
November 1, 2018 9:10:59am WSPC/188-JIKM 1850043 ISSN: 0219-6492

A New Associative Classi¯cation Algorithm for Predicting Groundwater Locations


by UNIVERSITY OF LIVERPOOL on 11/13/18. Re-use and distribution is strictly not permitted, except for Open Access articles.

Fig. 5. The accuracy rates of GwPAC, CBA, CMAR and FACA as the minimum support changes with
¯xed minimum con¯dence ¼ 0.60.
J. Info. Know. Mgmt. Downloaded from www.worldscientific.com

Fig. 6. The accuracy rates of GwPAC, CBA, CMAR and FACA as the minimum con¯dence changes
with ¯xed minimum support ¼ 0.05.

The classi¯cation accuracy rates against the minimum con¯dence or minimum


support for groundwater datasets is shown in Figs. 5 and 6.
Figure 5 displays the classi¯cation accuracy rates produced by all algorithms
against the groundwater datasets as minimum support changes with ¯xed minimum
con¯dence of 0.60. When the support increases, the classi¯cation accuracy rates for
GwPAC, FACA and CBA algorithms tend to increase for the groundwater datasets,
because most of the rules that can be pruned by setting a larger support are not
useful for predicting groundwater locations; however, the decrease in classi¯cation
accuracy rate in the CMAR algorithm is relatively large when the minimum support
is changed from 0.08 to 0.09 because the CMAR algorithm loses a signi¯cant number
of useful rules as a result. This reveals that the rules with support greater than 0.08
contribute the main hidden information to building the CMAR algorithm, and
pruning these rules has a harmful impact on the classi¯cation accuracy rate.
Finally, Fig. 6 shows the classi¯cation accuracy rates produced by all algorithms
as the minimum con¯dence changes with ¯xed minimum support at 0.05. As shown
in the ¯gure, when the minimum con¯dence threshold is increased from 0.80 to 0.90,
the classi¯cation accuracy rate tends to fall for the GwPAC, FACA and CBA

1850043-19
2nd Reading
November 1, 2018 9:11:02am WSPC/188-JIKM 1850043 ISSN: 0219-6492

F. Aburub and W. Hadi

algorithms. The classi¯cation accuracy rate decreases for the CMAR algorithm when
the minimum con¯dence changes from 0.70 to 0.80. These results indicate that a rule
with a con¯dence value greater than 0.80 is considered a high-quality rule for the
CBA, FACA and GwPAC algorithms. Furthermore, a rule with a con¯dence value
greater than 0.70 is considered a high-quality rule for CMAR, where pruning such
by UNIVERSITY OF LIVERPOOL on 11/13/18. Re-use and distribution is strictly not permitted, except for Open Access articles.

rules has a negative impact on classi¯cation accuracy.

5.2.2. Receiver operating characteristics


An ROC plot is a graphical evaluation method for organising and visualising algo-
rithms according to their performance (Fawcett, 2006). It is widely used in medical
decision-making, and recently has been increasingly adopted by the data mining and
machine learning communities (Fawcett, 2006; Peterson and Coleman, 2008; Zhang
et al., 2015). The ROC plot represents trade-o®s amongst bene¯ts (true positive
J. Info. Know. Mgmt. Downloaded from www.worldscientific.com

rate) and costs (false positive rate) in which the true positive rate is plotted on the
y-axis and the false positive rate is plotted on the x-axis. The true positive rate (also
called recall) of an algorithm is calculated as follows:
positives correctly classified
True positive rate ¼ : ð3Þ
total positives
The false positive rate of an algorithm is calculated as follows:
negatives incorrectly classified
False positive rate ¼ : ð4Þ
total negatives
Several aspects of the ROC curve are worthy of mention. The lower left point (0, 0)
represents the classi¯cation of all instances as negative; point (1, 1) represents the
classi¯cation of all instances as positive. Point (0, 1) represents the best classi¯cation.
Figure 7 depicts the ROC curves for the GwPAC, FACA, CBA and CMAR
algorithms. It is clear from the ¯gure that the GwPAC algorithm performs better
than CBA, FACA and CMAR, and that CBA is the poorest algorithm for detecting
new groundwater locations. More speci¯cally, GwPAC is generally better than the
CMAR algorithm, except at 0.20 < false positive rate < 0.65, where the CMAR
algorithm has a minor bene¯t.

Fig. 7. The ROC curves of GwPAC, CBA, FACA and CMAR.

1850043-20
2nd Reading
November 1, 2018 9:11:03am WSPC/188-JIKM 1850043 ISSN: 0219-6492

A New Associative Classi¯cation Algorithm for Predicting Groundwater Locations

Figure 7 also shows the areas under the three ROC curves. The GwPAC has
a larger area and, therefore, better average performance. The AUC values for
GwPAC, FACA, CMAR and CBA are 0.892, 0.832, 0.868 and 0.551, respectively.
In comparison with the AUC classi¯cation in the study by Yesilnacar (2005), it can
be seen that the GwPAC, FACA and CMAR algorithms (all AUCs > 80%) applied
by UNIVERSITY OF LIVERPOOL on 11/13/18. Re-use and distribution is strictly not permitted, except for Open Access articles.

in this study produce reasonably good classi¯cation accuracy rates in the prediction
of new groundwater locations. Based on the classi¯cation accuracies achieved,
it can be observed that the AC algorithms — especially GwPAC, FACA and
CMAR — can be applied as e±cient data mining algorithms in predicting new
groundwater locations in Jordan. However, the CBA algorithm (AUC ¼ 0.551) is
shown to have a poor classi¯cation accuracy rate in predicting new groundwater
locations.
J. Info. Know. Mgmt. Downloaded from www.worldscientific.com

5.3. Rule-based algorithms performance


In the second experiment, we show a comparison between three easy-to-understand
classi¯ers: C4.5, Random Forest and PBC4cip. The bases of this comparison are
classi¯cation accuracy and AUC measure.
According to Table 2, we can observe that Random Forest achieves the best
results for both measures (classi¯cation accuracy and AUC). The second-best al-
gorithm is C4.5, which obtained a classi¯cation accuracy similar to Random Forest,
although its AUC was lower than that of Random Forest. The PBC4cip algorithm
obtained the worst results for both measures; in other words, Random Forest
and C4.5 algorithms outperform PBC4cip in terms of classi¯cation accuracy and
AUC measure. To be more speci¯c, Random Forest and C4.5 achieved 8.1% and 8%
higher classi¯cation accuracies than PBC4cip algorithm, respectively. In addition,
Random Forest produced 6.6% and 4% higher AUCs than PBC4cip and C4.5
algorithms.
Furthermore, the results indicated that the Random Forest and C4.5 produced
3.5% and 3.4% higher accuracies than our proposed algorithm. On the other hand,
our proposed algorithm outperformed PBC4cip by 4.6%. In terms of AUC measure,
the GwPAC algorithm outperformed C4.5 and PBC4cip by 0.8% and 3.4%,
respectively.
Finally, from this experiment, we can conclude that all considered rule-based
algorithms performed well and are applicable when dealing with the identi¯cation of
groundwater locations.

Table 2. The results of rule-based algorithms.

Rule-based algorithms Classi¯cation accuracy (%) AUC (%)

C4.5 87.2 88.4


Random Forest 87.3 92.4
PBC4cip 79.2 85.8

1850043-21
2nd Reading
November 1, 2018 9:11:03am WSPC/188-JIKM 1850043 ISSN: 0219-6492

F. Aburub and W. Hadi

Table 3. The results of popular algorithms.

Rule-based algorithms Classi¯cation accuracy (%) AUC (%)

SVM 88.4 86.3


NB 87.3 92.4
by UNIVERSITY OF LIVERPOOL on 11/13/18. Re-use and distribution is strictly not permitted, except for Open Access articles.

KNN 87.2 91.5


ANN 86.3 91.0

5.4. Benchmark data mining algorithms performance


To generalise the applicability of using data mining on discovering new groundwater
locations in Jordan, we compare four popular data mining classi¯ers: SVMs, KNN,
NB and ANN. For assessing the classi¯ers, we used the classi¯cation accuracy and
AUC measure.
J. Info. Know. Mgmt. Downloaded from www.worldscientific.com

After analysing Table 3, regarding classi¯cation accuracy, we found that the


SVMs classi¯er outperformed all other popular data mining algorithms. The ANN
algorithm achieved the worst results. The KNN algorithm produced similar results
to NB. In particular, the SVMs achieved 1.1%, 1.2% and 2.1% higher classi¯cation
accuracies than NB, KNN and ANN algorithms, respectively. In addition, regarding
the AUC measure, we observed that NB obtained the best results and the ANN
achieved the worst results. More speci¯cally, the NB obtained 6.1%, 0.90% and 1.4%
higher AUCs than SVMs, KNN and ANN classi¯ers, respectively.
Besides, our proposed algorithm gained lower classi¯cation accuracy by 4.6%, 3.5%,
3.4% and 2.5% than SVMs, NB, KNN and ANN, respectively. On the other hand, our
proposed algorithm outperformed SVMs in terms of AUC measure by 2.9%.
The results also indicated that popular classi¯cation algorithms, such as SVMs
and ANN, obtain high classi¯cation accuracy. Nevertheless, they produce black box
and complex models that are di±cult for the decision maker to understand and
interpret. The rule-based model is vital for decision makers are due to the following:
(1) the decision maker can easily manipulate, read and understand the produced
rules by the rule-based algorithms; and (2) classi¯cation process is e±cient and
comparable with black box classi¯ers, such as SVMs and ANN.
Finally, from all experiments, we can conclude that all considered algorithms
produce acceptable classi¯cation accuracy and AUC rates; this re°ects the features'
relevance for the groundwater dataset. In addition, all algorithms can be useful and
suitable methods for addressing the problem of exploring new groundwater locations
in Jordan.

6. Conclusions
The problem of the exploration of groundwater locations is an important topic of
research, especially in arid and semi-arid regions. Scholars have used several algo-
rithms to address this problem, such as SVMs, ANN and Random Forest. In this
paper, we propose a new AC algorithm, GwPAC, and investigate its performance

1850043-22
2nd Reading
November 1, 2018 9:11:03am WSPC/188-JIKM 1850043 ISSN: 0219-6492

A New Associative Classi¯cation Algorithm for Predicting Groundwater Locations

against 10 well-known algorithms on predicting new groundwater locations in Jor-


dan as a case study. We identify seven features to di®erentiate the groundwater
locations, namely elevation, slope, outcrop, valleys, average temperature, average
rainfall and faults.
Our experimental analysis, notably using numerical and graphical evaluation
by UNIVERSITY OF LIVERPOOL on 11/13/18. Re-use and distribution is strictly not permitted, except for Open Access articles.

methods, indicates that the GwPAC algorithm outperforms CBA, FACA and
CMAR in terms of classi¯cation accuracy, F 1 score, learning time and AUC. In
particular, in relation to classi¯cation accuracy, the GwPAC algorithm out-
performed CMAR, FACA and CBA by 4.6%, 2.2% and 5.9%, respectively. The F 1
score results show that the GwPAC algorithm performs better than the CMAR,
FACA and CBA algorithms, with the GwPAC algorithm having 10.8%, 2.1% and
13.4% higher F 1 scores than CMAR, FACA and CBA, respectively. Furthermore,
the GwPAC algorithm builds the model faster than CMAR and CBA but it is
J. Info. Know. Mgmt. Downloaded from www.worldscientific.com

similar with FACA. In fact, GwPAC employs the Di®sets method, which requires
only one scan of the input data to discover all the candidate rules. The AUCs
produced by GwPAC, CMAR, FACA and CBA algorithms were 0.892, 0.868, 0.832
and 0.551, respectively. Thus, the GwPAC algorithm performs better than CMAR,
FACA and CBA algorithms.
Further, the GwPAC algorithm produces classi¯cation accuracy lower than
SVM, NB, Random Forest, KNN, C4.5 and ANN algorithms by 4.6%, 3.5%, 3.5%,
3.4%, 3.4% and 2.5%, respectively. In contrast, all data mining algorithms produce
acceptable classi¯cation accuracy, i.e. higher than 79%.
Based on these results, we conclude that the data mining algorithms, especially
the GwPAC algorithm, can be a useful and appropriate method for addressing the
problem of exploring new groundwater locations in Jordan. The rule-pruning pro-
cedure in the GwPAC algorithm reduces the number of rules discovered, positively
impacting the classi¯cation accuracy of test cases so that the end-user can easily
understand and manipulate the results.
In the near feature, we would like to perform the following works:
(1) Investigate GwPAC and all considered algorithms in terms of time and space
complexity.
(2) Evaluate our proposed algorithm using the ¯lter strategy (García-Borroto et al.,
2010) pruning method, and compare the results with those obtained by our
pruning method.
(3) Implement the weighted prediction method (Loyola-Gonzalez et al., 2017)
within the GwPAC algorithm, and compare the results with those achieved by
our GmR prediction method.
(4) Investigate all comparisons against di®erent datasets from UCI machine learn-
ing repository (Lichman, 2013) to generalise the performance of the GwPAC
algorithm.
(5) Extend our groundwater dataset to include more governorates and more
features such as topographic and geological features (Rahmati et al., 2016).

1850043-23
2nd Reading
November 1, 2018 9:11:03am WSPC/188-JIKM 1850043 ISSN: 0219-6492

F. Aburub and W. Hadi

References
Abdelhamid, N, A Ayesh, F Thabtah, S Ahmadi and W Hadi (2012). MAC: A multiclass
associative classi¯cation algorithm. Journal of Information & Knowledge Management,
11(2), 1250011, doi:10.1142/S0219649212500116.
Abdelhamid, N, A Ayesh and W Hadi (2014). Multi-label rules algorithm based associative
by UNIVERSITY OF LIVERPOOL on 11/13/18. Re-use and distribution is strictly not permitted, except for Open Access articles.

classi¯cation. Parallel Processing Letters, 24(1), 1450001, doi:10.1142/S0129626414500017.


Abdulla, FA, MA Al-Khatib and ZD Al-Ghazzawi (2000). Development of groundwater
modeling for the Azraq Basin, Jordan. Environmental Geology, 40(1–2), 11–18, doi:
10.1007/s002549900105.
Aburub, F and W Hadi (2016). Predicting groundwater areas using data mining techniques:
Groundwater in Jordan as case study. International Journal of Computer, Electrical,
Automation, Control and Information Engineering, 10(9), 1475–1478.
Alazaidah, R, F Thabtah and Q Al-Radaideh (2015). A multi-label classi¯cation approach
based on correlations among labels. International Journal of Advanced Computer Science
J. Info. Know. Mgmt. Downloaded from www.worldscientific.com

and Applications, 6(2), 52–59, doi:10.14569/IJACSA.2015.060208.


Al-Zyoud, S, W Rühaak, E Forootan and I Sass (2015). Over exploitation of groundwater in
the centre of Amman Zarqa Basin — Jordan: Evaluation of well data and GRACE satellite
observations. Resources, 4(4), 819–830, doi:10.3390/resources4040819.
Breiman, L (2001). Random Forests. Machine Learning, 45(1), 5–32, doi:10.1023/A:1010933404324.
Brodie, K, D Fettes, B Harte and R Schmid (2007). Structural terms including fault rock
terms. In Metamorphic Rocks: A Classi¯cation and Glossary of Terms, D Fettes and J
Desmonds (eds.), pp. 24–31. Cambridge: Cambridge University Press.
Deng, H, G Runger, E Tuv and W Bannister (2014). CBC: An associative classi¯er with a small
number of rules. Decision Support Systems, 59(1), 163–170, doi:10.1016/j.dss.2013.11.004.
El-Naqa, A and A Al-Shayeb (2009). Groundwater protection and management strategy in
Jordan. Water Resources Management, 23(12), 2379–2394, doi:10.1007/s11269-008-9386-x.
El-Naqa, A (2010). Study of salt water intrusion in thes Upper Aquifer in Azraq Basin. Final
Report No. IUCN-Rep-2010-042, International Union for Conservation of Nature. Avail-
able at https://www.iucn.org/sites/dev/¯les/import/downloads/¯nal report azraq 2011.
pdf. Accessed on 19 June 2016.
Encyclopedia.com (2009). Elevation. Available at http://www.encyclopedia.com/doc/1O999-
elevation.html. Accessed on 15 March 2016.
Fawcett, T (2006). An introduction to ROC analysis. Pattern Recognition Letters, 27(8), 861–
874, doi:10.1016/j.patrec.2005.10.010.
García-Borroto, M, JF Martínez-Trinidad, JA Carrasco-Ochoa, MA Medina-Perez and J
Ruiz-Shulcloper (2010). LCMine: An e±cient algorithm for mining discriminative regu-
larities and its application in supervised classi¯cation. Pattern Recognition, 43(9), 3025–
3034, doi:10.1016/j.patcog.2010.04.008.
Hadadin, N, M Qaqish, E Akawwi and A Bdour (2010). Water shortage in Jordan: Sustain-
able solutions. Desalination, 250(1), 197–202, doi:10.1016/j.desal.2009.01.026.
Hadi, W (2015). ECAR: A new enhanced class association rule. Advances in Computational
Sciences and Technology, 8(1), 43–52.
Hadi, W, F Aburub and S. Alhawari (2016). A new fast associative classi¯cation algorithm
for detecting phishing websites. Applied Soft Computing, 48(1), 729–734, doi:10.1016/j.
asoc.2016.08.005.
Hadi, W, G Issa and A Ishtaiwi (2017). ACPRISM: Associative classi¯cation based on PRISM
algorithm. Information Sciences, 417, 287–300, doi:10.1016/j.ins.2017.07.025.
Hadi, W, QA Al-Radaideh and S Alhawari (2018). Integrating associative rule-based classi-
¯cation with Naive Bayes for text classi¯cation. Applied Soft Computing, 69, 344–356,
doi:10.1016/j.asoc.2018.04.056.

1850043-24
2nd Reading
November 1, 2018 9:11:03am WSPC/188-JIKM 1850043 ISSN: 0219-6492

A New Associative Classi¯cation Algorithm for Predicting Groundwater Locations

Hall, M, E Frank, G Holmes, B Pfahringer, P Reutemann and IH Witten (2009). The WEKA
data mining software. ACM SIGKDD Explorations Newsletter, 11(1), 10, doi:10.1145/
1656274.1656278.
Howell, JV (1957). Glossary of Geology and Related Sciences. Alexandria: American Geo-
logical Institute. Available at http://www.abebooks.co.uk/servlet/BookDetailsPL?
bi=18523214914&searchurl=an%3DHowell%252C%2520J.%2520V.%2520%2528Ameri-
by UNIVERSITY OF LIVERPOOL on 11/13/18. Re-use and distribution is strictly not permitted, except for Open Access articles.

can%2520Geological%2520Institute%2529. Accessed on 9 March 2016.


Israil, M, M Al-hadithi and DC Singhal (2006). Application of a resistivity survey and geo-
graphical information system (GIS) analysis for hydrogeological zoning of a piedmont area,
Himalayan foothill region, India. Hydrogeology Journal, 14(5), 753–759, doi:10.1007/
s10040-005-0483-0.
Joachims, T (1998). Text categorization with suport vector machines: Learning with many
relevant features. In Proceedings of the 10th European Conference on Machine Learning,
pp. 137–142. London, UK: Springer-Verlag.
Johnson, ZC, CD Snyder and NP Hitt (2017). Landform features and seasonal precipitation
J. Info. Know. Mgmt. Downloaded from www.worldscientific.com

predict shallow groundwater in°uence on temperature in headwater streams. Water


Resources Research, 53(7), 5788–5812, doi:10.1002/2017WR020455.
Karthik, D and K Vijayarekha (2014). Multivariate data mining techniques for assessing
water potability. Rasayan Journal of Chemistry, 7(3), 256–259.
Kolli, K and R Seshadri (2013). Ground water quality assessment using data mining techniques.
International Journal of Computer Applications, 76(15), 39–45, doi:10.5120/13324-0885.
Li, W, J Han and J Pei (2001). CMAR: Accurate and e±cient classi¯cation based on multiple class-
association rules. In Proceedings of the 2001 IEEE International Conference on Data Mining,
pp. 369–376. Washington, DC: IEEE Computer Society Press, doi:10.1109/ICDM.2001.989541.
Lichman, M (2013). UCI Machine Learning Repository. Available at https://archive.ics.
uci.edu/ml/citation policy.html. Accessed on 9 March 2016.
Liu, B, W Hsu, Y Ma and B Ma (1998). Integrating classi¯cation and association rule mining.
In Proceedings of the Fourth International Conference on Knowledge Discovery and Data
Mining, pp. 80–86. New York: AAAI Press, doi:10.1.1.48.8380.
Loyola-Gonzalez, O, MA Medina-Perez, JF Martínez-Trinidad, JA Carrasco-Ochoa, R
Monroy and M García-Borroto (2017). PBC4cip: A new contrast pattern-based classi¯er
for class imbalance problems. Knowledge-Based Systems, 115, 100–109, doi:10.1016/j.
knosys.2016.10.018.
Meganathan, S and TR Sivaramakrishnan (2013). Association rule mining and classi¯er
approach for 48-hour rainfall prediction over Cuddalore station of East Coast of India.
Research Journal of Applied Sciences, Engineering and Technology, 5(14), 3692–3696.
Mishra, SP and S Dwibedy (2015). Geo-hydrology of South Mahanadi Delta and Chilika Lake,
Odisha. International Journal of Advanced Research, 3(11), 430–444.
Mohsen, MS (2007). Water strategies and potential of desalination in Jordan. Desalination,
203(1–3), 27–46, doi:10.1016/j.desal.2006.03.524.
MWI (2015). Jordan Water Sector Facts and Figures 2013. Amman: Ministry of Water and
Irrigation. Available at http://www.mwi.gov.jo/sites/en-us/Documents/W.%20in%Fig.E%
20FINAL%20E.pdf. Accessed on 22 April 2016.
Nation Master (2015). Geography>Average rainfall in depth > Mm per year: Countries
compared. Available at http://www.nationmaster.com/country-info/stats/Geography/
Average-rainfall-in-depth/Mm-per-year. Accessed on 15 March 2016.
Peterson, LE and MA Coleman (2008). Machine learning-based receiver operating charac-
teristic (ROC) curves for crisp and fuzzy classi¯cation of DNA microarrays in cancer
research. International Journal of Approximate Reasoning, 47(1), 17–36, doi:10.1016/j.
ijar.2007.03.006.

1850043-25
2nd Reading
November 1, 2018 9:11:04am WSPC/188-JIKM 1850043 ISSN: 0219-6492

F. Aburub and W. Hadi

Quinlan, JR (1993). C4.5: Programs for Machine Learning. San Francisco, CA: Morgan
Kaufmann.
Rahmati, O, H Reza and AM Melesse (2016). Catena application of GIS-based data driven
random forest and maximum entropy models for groundwater potential mapping: A case
study at Mehran Region, Iran. Catena, 137, 360–372, doi:10.1016/j.catena.2015.10.010.
Sahoo, S and MK Jha (2013). Groundwater-level prediction using multiple linear regression
by UNIVERSITY OF LIVERPOOL on 11/13/18. Re-use and distribution is strictly not permitted, except for Open Access articles.

and arti¯cial neural network techniques: A comparative assessment. Hydrogeology Journal,


21(8), 1865–1887, doi:10.1007/s10040-013-1029-5.
Salahat, M, M Al-Qinna, K Mashal and N Hammouri (2014). Identifying major factors con-
trolling groundwater quality in semiarid area using advanced statistical techniques. Water
Resources Management, 28(11), 3829–3841, doi:10.1007/s11269-014-0712-1.
Sujay Raghavendra, N and PC Deka (2015). Forecasting monthly groundwater level °uc-
tuations in coastal aquifers using hybrid Wavelet packet: Support vector regression. Cogent
Engineering, 2(1), 999414, doi:10.1080/23311916.2014.999414.
Thabtah, F, Q Mahmood, L McCluskey and H Abdel-Jaber (2010). A new classi¯cation based
J. Info. Know. Mgmt. Downloaded from www.worldscientific.com

on association algorithm. Journal of Information & Knowledge Management, 9(1), 55–64,


doi:10.1142/S0219649210002486.
Thabtah, F, W Hadi, N Abdelhamid and A Issa (2011). Prediction phase in associative
classi¯cation mining. International Journal of Software Engineering and Knowledge
Engineering, 21(6), 855–876, doi:10.1142/S0218194011005463.
Wikipedia (2001). Valleys. Available at https://en.wikipedia.org/wiki/Valley. Accessed on 20
August 2016.
Wikipedia (2002). Slope. Available at https://en.wikipedia.org/wiki/Slope. Accessed on 20
August 2016.
World Health Organization (2010). Jordan: Water is Life. WHO. Available at http://www.
who.int/heli/pilots/jordan/en/#. Accessed on 15 March 2016.
Xu, T and AJ Valocchi (2015). Data-driven methods to improve base°ow prediction of a
regional groundwater model. Computers & Geosciences, 85, 1–13, doi:10.1016/j.
cageo.2015.05.016.
Yesilnacar, EK (2005). The Application of Computational Intelligence to Landslide Suscep-
tibility Mapping in Turkey. Parkville: University of Melbourne.
Yoo, K, S Kumar, J Joon, K Oh and J Park (2016). Decision tree-based data mining and rule
induction for identifying hydrogeological parameters that in°uence groundwater pollution
sensitivity. Journal of Cleaner Production, 122, 277–286, doi:10.1016/j.jclepro.2016.01.075.
Zaki, MJ and K Gouda (2003). Fast vertical mining using di®sets. In Proceedings of the Ninth
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp.
326–335. New York: ACM Press, doi:10.1145/956750.956788.
Zhang, H, X-H Wang and X-F Chen (2015). Support vector with ROC optimization method
based fuel consumption modeling for civil aircraft. Procedia Engineering, 99, 296–303,
doi:10.1016/j.proeng.2014.12.538.
Zhu, C, Q Luan, Z Hao and Q Ju (2011). Integration of grey with neural network model and its
application in data mining. Journal of Software, 6(4), 716–723, doi:10.4304/jsw.6.4.716-723.

1850043-26

You might also like