Professional Documents
Culture Documents
Faisal Aburub
Department of Management Information Systems
Faculty of Administrative and Financial Sciences
University of Petra, P. O. Box 961343
Amman, Jordan
faburub@uop.edu.jo
J. Info. Know. Mgmt. Downloaded from www.worldscientific.com
Wa'el Hadi*
Department of Computer Information Systems
Faculty of Information Technology
University of Petra, P. O. Box 961343
Amman, Jordan
whadi@uop.edu.jo
Abstract. In this paper, we study the problem of predicting new locations of groundwater in Jordan
through the application of a proposed new method, Groundwater Prediction using Associative Classi¯-
cation (GwPAC). We identify features that di®erentiate locations of groundwater wells according to
whether or not they contain water. In addition, we survey intelligent-based methods related to ground-
water exploration and management. Three experimental analyses were conducted with the objective to
evaluate the capability of data mining algorithms using real groundwater data from the Ministry of Water
and Irrigation. In the ¯rst experiment, we investigated the performance of GwPAC against three well-
known associative classi¯cation algorithms, namely CBA, CMAR and FACA. Furthermore, three rule-
based algorithms — C4.5, Random Forest and PBC4cip — were investigated in the second experiment;
further, so as to generalise the capability of using data mining for solving the groundwater detection
problem, four benchmark algorithms — SVMs, NB, KNN and ANNs — were evaluated in the third
experiment. From all the experiments, the results indicated that all considered data mining algorithms
predict locations of groundwater with acceptable classi¯cation rate (all classi¯cation accuracies > 79%),
and can be useful methods when seeking to address the problem of exploring new groundwater locations.
1. Introduction
Jordan is recognised as one of the poorest countries in the world in terms of water
resources. According to the World Health Organization (WHO) (2010), \Jordan has
one of the lowest levels of water resource availability, per capita, in the world". This
poses a serious challenge, which threatens all sectors that depend on the availability
* Corresponding author.
1850043-1
2nd Reading
November 1, 2018 9:10:27am WSPC/188-JIKM 1850043 ISSN: 0219-6492
for Jordanians from 3,600 m3 in 1946 to less than 123 m3 in 2014. This means that,
per-capita, water resources for Jordanians are less than 12% of the international
water poverty level (MWI, 2015).
The main water resources in Jordan are surface water and groundwater. Surface
water resources are found in 15 main basins, whilst groundwater resources are spread
across primary basins. The groundwater basins are Azraq, Hammad, North Jordan
Valley, Sirhan, Dead Sea, Jafer, North Wadi Araba, Yarmouk, Disi, Jordan Valley,
South Wadi Araba and Zarqa. According to the Jordanian Ministry of Water and
J. Info. Know. Mgmt. Downloaded from www.worldscientific.com
Irrigation (JMWI), the number of working wells is approximately 3,000; there are
also many illegal wells that exploit groundwater to the maximum level.
Rainfall is the only water supply resource for groundwater aquifers in Jordan.
According to the Jordanian Ministry of Water and Irrigation, the quantity of re-
newable water resources for di®erent purposes is around 750 MCM (million cubic metre).
One approach able to decrease the cost of extracting groundwater is to use data
mining techniques to predict groundwater areas. This paper aims to develop a new
approach based on data mining methods in order to predict new groundwater sites,
notably using the areas of Azraq, Zarqa and Mafraq in Jordan as a case study.
Classi¯cation using association — also referred to as Associative Classi¯cation
(AC) — is a study area in data mining that integrates both association rule dis-
covery (unsupervised learning) and classi¯cation (supervised learning) tasks. In fact,
AC uses association rule discovery tasks to ¯nd the knowledge and then chooses a
subset on which to build the classi¯er (Thabtah et al., 2011). The main goal for AC is
to construct a classi¯er, also known as a model, that consists of a number of
knowledge (rules) from labelled input data, referred to as the training dataset, in
order to predict the class value for a test data instance as accurately as possible
(Hadi, 2015). In this paper, the labelled input data consists of features related to
groundwater wells already dug in di®erent regions of Jordan. In Sec. 5 of this paper,
we provide details of the data collected, which relates to 900 groundwater wells
already dug in three di®erent governorates of Jordan, namely Zarqa, Al-Mafraq and
Jerash.
In the last few years, a number of AC algorithms have been developed, such as
FACA (Hadi et al., 2016), ECAR (Hadi, 2015), MCAC (Abdelhamid et al., 2014),
CBC (Deng et al., 2014) and LCA (Thabtah et al., 2010). These studies have shown
that the AC approach usually produces more accurate classi¯ers than the classic
classi¯cation data mining approaches, such as probabilistic, statistical and decision
tree methods (Quinlan, 1993; Joachims, 1998). However, AC algorithms normally
su®er from the exponential growth of rules; in other words, they derive large num-
bers of rules, which make the resulting models outsized, and, consequently, means
1850043-2
2nd Reading
November 1, 2018 9:10:27am WSPC/188-JIKM 1850043 ISSN: 0219-6492
¯ltering strategy, which aims at reducing the number of rules whilst improving
accuracy.
Most of the previous works relating to groundwater have concentrated on the
problems of water quality and resource management, rather than predicting new
groundwater locations. Furthermore, the approaches used in such researches are
mainly related to Geographic Information System's (GIS) statistical and spatial
analysis tools (Israil et al., 2006; Rahmati et al., 2016) or complex mathematical
models from machine learning and data mining or statistics, such as regression
J. Info. Know. Mgmt. Downloaded from www.worldscientific.com
(Mishra and Dwibedy, 2015), Support Vector Machines (SVMs) (Xu and Valocchi,
2015) and Arti¯cial Neural Networks (ANNs) (Zhu et al., 2011). According to Hadi
et al. (2017, 2018) and Abdelhamid et al. (2012), such approaches gain high clas-
si¯cation accuracy. However, they require extensive knowledge of mathematical
modelling and spatial analysis related to the di®erent types of maps in GIS, and
often produce black box and complex models that are di±cult for end-users to
understand and interpret. Therefore, the AC mining approach, which has been
proved by many scholars to be accurate in prediction, as well as producing easy-to-
interpret models, has the potential to be successful in applications, such as
groundwater prediction (Thabtah et al., 2011; Abdelhamid et al., 2014; Alazaidah
et al., 2015; Hadi, 2015). We ¯rmly believe that the AC mining approach has never
been explored in the ¯eld of groundwater, and that it can be useful in predicting new
groundwater locations.
In this paper, we propose a new AC algorithm, namely Groundwater Prediction
using Associative Classi¯cation (GwPAC). This ¯rst discovers hidden correlation
(rules) between groundwater location features and then cuts down the number of
rules generated during the model-building step by pruning useless and redundant
rules in order to derive moderate-size classi¯ers. This results in a controllable
number of rules that the end-user can better understand and manipulate. Second,
unlike other AC mining algorithms, such as LCA (Thabtah et al., 2010), which is
known to use a single rule for prediction, the GwPAC algorithm employs a new
prediction method that guarantees that only high-quality rules are used to predict
test instances. This prediction procedure is based on multiple rules rather than a
single rule, and, therefore, multiple rules are used to play a role in predicting the
class value for the unseen groundwater locations. This may improve the classi¯cation
accuracy rate of the resulting models in predicting new locations for groundwater.
Overall, the bene¯ts of this new AC method are twofold: the ¯rst bene¯t concerns
developing a new spatial AC approach to the di±cult problem of predicting new
locations of groundwater by examining features obtained from old groundwater
wells. In order to achieve this ¯rst goal, we must extend and enhance the performance
1850043-3
2nd Reading
November 1, 2018 9:10:27am WSPC/188-JIKM 1850043 ISSN: 0219-6492
reduce the number of rules by eliminating redundant and unnecessary rules from
playing any part in the class assignment process of test cases; and (3) the class
assignment method in the prediction step by utilising a group of rules for prediction
rather than a single rule, as in LCA.
Through our analysis, we answer the following critical questions:
(1) Is the new spatial AC method appropriate to the problem of exploration of
groundwater locations in Jordan?
J. Info. Know. Mgmt. Downloaded from www.worldscientific.com
(2) What are the most signi¯cant features for the prediction of groundwater loca-
tion?
(3) Will the new algorithm builder model of the spatial AC method reduce the
number of rules discovered without negatively impacting the classi¯cation ac-
curacy of test cases, so that the end-user can easily understand and manipulate
the results?
This paper makes the following contributions:
. Identi¯es a set of features that will discover new groundwater locations.
. Investigates commonly used AC algorithms on real groundwater data collected
from the Jordanian Ministry of Water and Irrigation. No previous research
has tackled the problem of discovering new groundwater locations using AC
algorithms, and the current study is therefore a pioneering work in this subject.
. Assesses popular data mining algorithms on the same dataset with the aim of
exploring the applicability of their use in predicting groundwater locations.
. Proposes the GwPAC algorithm to enhance the performance of well-known AC
algorithms, especially for predicting new groundwater locations.
. Conducts an extensive experimental study on groundwater data.
Section 2 of this paper sets out the problem statements. Section 3 surveys the recent
AC algorithms, data mining techniques that are commonly used to address the
groundwater problem, and related works in the study area. Section 4 presents the
proposed algorithm steps. Section 5 provides an overview of the experiments, a
description of the groundwater data, the evaluation measures, the compared algo-
rithms, the results and the discussion of the results. Finally, Sec. 6 addresses the
conclusion.
2. Problem Statements
An AC algorithm generally operates in three main phases. During the ¯rst phase, it
extracts the hidden relationships between the feature values and the class feature
1850043-4
2nd Reading
November 1, 2018 9:10:27am WSPC/188-JIKM 1850043 ISSN: 0219-6492
values in the input data and represents them in \IF–THEN" rules (Hadi, 2015).
Once all rules have been extracted, the ranking and pruning procedures (Phase 2)
begin. The ranking procedure ranks rules in line with their con¯dence values or,
where two or more rules have the same con¯dence value, by their support values. In
addition, in the pruning procedure, useless and con°icting rules are removed, and the
by UNIVERSITY OF LIVERPOOL on 11/13/18. Re-use and distribution is strictly not permitted, except for Open Access articles.
rules that remain represent the AC model. Finally, the AC model is evaluated on
new data, referred to as testing data, to investigate its performance in predicting the
class of new test instances. The result of the ¯nal phase is the classi¯cation accuracy
rate of the AC algorithm.
Our proposed algorithm assumes that the input groundwater data is a normal
relational table containing 900 instances described by seven distinct features, ex-
cluding the class feature. These 900 instances have been clustered into two known
classes (\Yes" and \No").
J. Info. Know. Mgmt. Downloaded from www.worldscientific.com
Formally, our proposed algorithm is de¯ned as follows: Let D be the input data;
let I be the set of all feature values in D and Y be the set of class values (labels). An
instance d is said to contain X, a set of feature values in I, if X D. The following
are the main terms that we study in this paper.
Term 1 (Support). The percentage of instances in D that contain the feature
value.
Term 2 (Frequent features). A feature value f 2 I is said to be frequent if
its support is greater than or equal to a speci¯ed minimum support
constraint.
Term 3 (K-itemset). An itemset that contains K items (features).
Term 4 (A class association rule (CAR)). An association of the form X ! y,
where X I and y 2 Y .
Term 5 (Rule support). The support of rule X ! y is the percentage of instances
in D that X and y hold together, X [ y.
Term 6 (Rule con¯dence). The con¯dence of the rule X ! y is the ratio of the
instances in D that contain X and also contain y, and is de¯ned as follows:
SupportðX [ yÞ
ConfidenceðX ! yÞ ¼ : ð1Þ
SupportðXÞ
Our aims are as follows: (1) to extract all CARs that have supports and con¯dences
greater than or equal to speci¯ed minimum support and minimum con¯dence con-
straints, respectively; and (2) to build a model for predicting new groundwater
locations from CARs.
3. Related Works
In the following subsections, we present a more detailed overview of the related
works from the two domains with which our research is concerned: groundwater
prediction and AC.
1850043-5
2nd Reading
November 1, 2018 9:10:30am WSPC/188-JIKM 1850043 ISSN: 0219-6492
governorate is dry for most of the year. Mafraq borders with Saudi Arabia in the
south, Syria in the north, Iraq in the east and Jerash governorate in the west.
Jerash governorate is located to the north of Amman, and is the smallest
governorate in Jordan in terms of area. Its capital is Jerash. The governorate's lands
are hilly and fertile. Its borders are with Irbid in the north, Ajloun governorate in the
west, Mafraq governorate in the east and Zarqa governorate in the south.
The main basins that cross these governorates are Azraq Basin and Zarqa Basin.
Azraq Basin is considered the main supplier of drinking water for the city of Amman.
This basin is located in the northeast of Jordan; more speci¯cally, it is the area that
1850043-6
2nd Reading
November 1, 2018 9:10:36am WSPC/188-JIKM 1850043 ISSN: 0219-6492
lies between 205 and 400 East and between 55 and 230 North, according to the
Palestine Grid. The town of Azraq is located in the middle of the basin, approxi-
mately 100 km east of Amman. Azraq Basin is the largest desert basin in Jordan,
with an area of about 12,710 km2 (El-Naqa, 2010). The area of Zarqa Basin is around
4,120 km2, with 95% of its area within Jordan and 5% within Syria.
by UNIVERSITY OF LIVERPOOL on 11/13/18. Re-use and distribution is strictly not permitted, except for Open Access articles.
As there is a water shortage in Jordan, there have been many studies carried out,
discussing this problem and seeking a feasible solution. Hadadin et al. (2010) considered
the Jordanian government's claim that the average annual water share of the Jorda-
nian people is very small in comparison with those of their neighbours, such as the
people of Egypt and Turkey. Therefore, they suggested a set of sustainable solutions,
namely the desalination of seawater and/or brackish water, importation of water from
Turkey, processing of wastewater and its reuse, particularly in agriculture, and the
reduction of water demand by using technology and awareness programmes/initiatives.
J. Info. Know. Mgmt. Downloaded from www.worldscientific.com
Mohsen (2007) stated that groundwater is a major water resource in Jordan, and
therefore suggested several strategies to protect this important resource. These
strategies focussed on water harvesting, wastewater recycling, water imports, sup-
plying the Dead Sea with ocean water, the desalination of seawater and brackish
water, and the reallocation of water resources based on sector and use.
Abdulla et al. (2000) developed a new model called MODFLOW to simulate the
level change in the complex aquifer systems, using Azraq Basin as a case study.
MODFLOW is a three-dimensional groundwater °ow model that aims at predicting
the level of aquifer water. The results show that, if water pumping continues at its
current rate, the level of water in the well-¯eld areas will fall by approximately 25 m by
2025. The worst-case scenario is if water-pumping increases to 1.5 times its current rate,
this will reduce the level of water in the well-¯eld areas by approximately 39 m by 2025.
Salahat et al. (2014) investigated the factors that control the quality of the
groundwater in semi-arid area. They used advanced statistical methods and hier-
archical clustering combined with GIS to predict quality based on three factors: land
use or land cover, aquifer type and soil texture. The results showed that the most
e®ective factor for predicting pollution of groundwater is land use or land cover,
followed by aquifer type and soil texture.
El-Naqa and Al-Shayeb (2009) indicated that groundwater resources are very
important in Jordan, and that protection plans and management are required in
order to save the groundwater from over-exploitation, which would, in turn, lead to a
decline in the water level. Al-Zyoud et al. (2015) used satellite data to estimate
groundwater over-exploitation in the Amman Zarqa Basin.
A new approach to detecting groundwater sites in Jordan will help to locate
groundwater for a reasonable cost — not only for Jordan itself but for all countries
that su®er from a lack of water.
1850043-7
2nd Reading
November 1, 2018 9:10:36am WSPC/188-JIKM 1850043 ISSN: 0219-6492
hydrogeological and pollution sensitivity data from the Woosan Industrial Complex
in Korea and identi¯ed seven hydrogeological features: net recharge, depth to water,
aquifer media, topography, soil media, hydraulic conductivity and vadose zone
media. The experimental results with four commonly used data mining algo-
rithms — decision tree, Arti¯cial Neural Network, multinomial logistic regression
by UNIVERSITY OF LIVERPOOL on 11/13/18. Re-use and distribution is strictly not permitted, except for Open Access articles.
and case-based reasoning — showed that the decision tree algorithm produced
higher classi¯cation accuracy than the other algorithms. The authors also utilised
the ordinal pairwise partitioning algorithm with the decision tree to increase the
classi¯cation accuracy. The proposed model results showed that the soil media, net
recharge and aquifer media were the main hydrogeological features a®ecting
groundwater sensitivity. Furthermore, the results indicated that the proposed
new algorithm gave more accurate and more consistent estimates of groundwater
pollution than other algorithms.
J. Info. Know. Mgmt. Downloaded from www.worldscientific.com
Rahmati et al. (2016) investigated two popular data mining techniques, namely
Random Forest and Maximum Entropy, on 163 groundwater wells in Mehran Re-
gion, Iran, using data collected from the Iranian Department of Water Resources
Management. The authors identi¯ed 10 features that a®ect the storage of ground-
water: slope aspect, slope percent, altitude, plan curvature, distance from rivers,
drainage density, topographic wetness index, lithology, soil texture and land use.
The area under the receiver operating characteristic (ROC) curve (AUC) was used
to evaluate the performance of Random Forest with Maximum Entropy. The ex-
perimental results indicated that the AUCs for the success rates of Random Forest
and Maximum Entropy were 86.5% and 91%, respectively, while the AUCs for the
prediction rates of Random Forest and Maximum Entropy were 83.1% and 87.7%,
respectively. Thus, the authors concluded that the data mining algorithms were
e®ective for detecting new groundwater locations.
Karthik and Vijayarekha (2014) introduced Principal Component Analysis
(PCA) with a supervised data mining algorithm called JRIP, which was imple-
mented to predict the groundwater in various locations of Thanjavur, Ariyalur and
Nagapattinam — the Cauvery Delta Regions of Tamil Nadu in India. The study
aimed to check whether or not the groundwater from these locations is potable. The
experimental results showed that machine learning techniques can be used for faster
classi¯cation of water portability on datasets containing 11 chemical and physical
features: electrical conductivity (EC), pH, alkalinity, total alkalinity, TDS and the
levels of calcium, magnesium, sodium, potassium, chloride and sulphates.
Meganathan and Sivaramakrishnan (2013) presented an association rule-mining
algorithm called predictive apriori for generating rules; these rules were tested using
the K classi¯er for predicting rain in Cuddalore station on the East Coast of India.
The experimental results indicated that the data mining techniques produce satis-
factory classi¯cation accuracy for rain prediction before 48 h of the actual occurrence
of the rain. Furthermore, data mining techniques can discover hidden relationships
between various atmospheric features, such as temperature, dew point, wind speed,
visibility and rainfall. The dataset investigated contained 3,039 instances belonging
1850043-8
2nd Reading
November 1, 2018 9:10:38am WSPC/188-JIKM 1850043 ISSN: 0219-6492
to two classes: \yes" for the occurrence of rainfall and \no" for the non-occurrence of
rainfall.
A total of 45 groundwater samples were collected by Kolli and Seshadri (2013) at
Tadepalli mandal in India in the period of September–November 2012. The authors
identi¯ed a number of parameters to assess the water quality: pH, chlorides, elec-
by UNIVERSITY OF LIVERPOOL on 11/13/18. Re-use and distribution is strictly not permitted, except for Open Access articles.
processing procedure that is required. The procedures for our proposed associative
classi¯er are as follows: extract frequent itemsets, produce classi¯cation rules and
predict a new groundwater location (Fig. 2).
1850043-9
2nd Reading
November 1, 2018 9:10:40am WSPC/188-JIKM 1850043 ISSN: 0219-6492
data are (1, 2, 4, 5, 8, 10) and (1, 3, 5, 8, 10), respectively. The Di®set for
h\Rainfall", \Semi-Dry"i is f3, 6, 7, 9g and the one for h\Faults", \Yes"i is f2, 4,
6, 7, 9g. These single-itemsets can be utilized to produce the two-itemset
hh\Rainfall", \Semi-Dry"i, h\Faults", \Yes"ii by intersecting their Di®sets, i.e.
f3, 6, 7, 9g and f2, 4, 6, 7, 9g. The result of this intersection is the set f2, 4g.
The support for a candidate K-itemset is calculated by subtracting the cardi-
nality of Di®sets between the ðK 1Þ-itemset and K-itemset itself. The support
for hh\Rainfall", \Semi-Dry"i, h\Faults", \Yes"ii is 4, because the support for
h\Rainfall", \Semi-Dry"i is 6 and the cardinality of hh\Rainfall", \Semi-Dry"i,
h\Faults", \Yes"ii is 2, thus 6 2 ¼ 4. In other words, if we have two single-
itemsets A and B, we would like to ¯nd Di®sets(AB). Di®sets(AB) ¼ Di®setðAÞ –
Di®setðBÞ, and support(AB) ¼ supportðAÞ – supportðABÞ. Now, if the two-itemset
support produced (4) passes the minimum support, the itemset becomes frequent.
. Produce classi¯cation rules from mined itemsets. From the set of frequent itemsets
ðF Þ, ¯nd all rules such that the head of the rule is a class value (\Yes" or \No").
Let R be the set of all classi¯cation rules, R ¼ f[ r such that r is of the form
A ! B, where B is a class value and con¯denceðrÞ minimum con¯denceg. Let
us consider that CARs are the rule sets generated from frequent itemsets. Rank
classi¯cation rules in CARs by con¯dence, support and then by a more general
rule (a rule with a smaller number of feature values in the body of the rule)
constraints. GwPAC then prunes the redundant and useless rules according to the
following process:
The GwPAC algorithm begins with the ¯rst ranked rule and checks it on the
input data; the rule will be inserted into the model if it matches at least one
instance from the input data. All input data instances that match the rule body
and its class value are removed, and the rule is added into the GwPAC model.
Otherwise, the rule will be pruned. This procedure is repeated on the remaining
rules until no more instances remain in the input data, or all rules are checked.
The rule-pruning procedure in the GwPAC algorithm guarantees to select a
1850043-10
2nd Reading
November 1, 2018 9:10:40am WSPC/188-JIKM 1850043 ISSN: 0219-6492
minimal representative subset of rules that cover the input data, and only high-
quality rules are inserted into the model of the algorithm, which may increase
the classi¯cation accuracy rate.
. Classify a new groundwater location using the set of classi¯cation rules (CARs).
The classi¯cation judgment is made according to a new scoring method. The
by UNIVERSITY OF LIVERPOOL on 11/13/18. Re-use and distribution is strictly not permitted, except for Open Access articles.
the class with the maximum score. We named this method the GmR (i.e.
geometric mean of the con¯dences values of the rules that match). The
GwPAC algorithm makes the prediction judgment using multiple rules, which
is considered by previous studies on AC approaches to be a bene¯t, since
multiple rules enhance and improve the prediction judgment (Thabtah et al.,
2011). Finally, in situations when no rules in the GwPAC model are matched
to the test instance (new groundwater location), the default class (majority
class in the input data) will be given to that instance.
In summary, our proposed algorithm has many advantages over normal AC algo-
rithms. These advantages are as follows:
(1) The GwPAC algorithm uses multiple rules to predict test cases. This may en-
hance the classi¯cation accuracy of the resulting models in predicting new test
cases. On the other hand, most of the current AC algorithms use a single rule
with highest con¯dence to predict test cases. FACA algorithm (Hadi et al.,
2016) predicts class with the highest number of rules to a test case. FACA
prediction method is sensitive to the majority class.
(2) The GwPAC algorithm proposes a GmR prediction procedure that uses both
con¯dence and support constraints to evaluate the rules, unlike other AC
algorithms, which use only the con¯dence constraint to evaluate the rules.
(3) The GwPAC extracts new hidden rules that current associative classi¯cation
algorithms are unable to extract. These rules might play a substantial role in
a decision-making process especially in real-life applications such as medical
diagnosis, weather forecasting and groundwater detection.
1850043-11
2nd Reading
November 1, 2018 9:10:41am WSPC/188-JIKM 1850043 ISSN: 0219-6492
(Li et al., 2001) and FACA (Hadi et al., 2016), are used to investigate the perfor-
mance of the GwPAC algorithm on predicting groundwater locations. The selection
of these algorithms is based on the fact that they use similar learning methodologies
for a fair investigation and the implementations of these algorithms are publicly
available.
by UNIVERSITY OF LIVERPOOL on 11/13/18. Re-use and distribution is strictly not permitted, except for Open Access articles.
follows:
2 precision recall
F1 ¼ ; ð2Þ
recall þ precision
where recall is de¯ned as the ratio of correct predictions divided by the total number
of predictions, and precision is the ratio of correct predictions divided by the total
number of the system's predictions.
Let us illustrate the performance measures using an example in which a classi¯-
cation system has been found to predict groundwater locations. The sample has 18
groundwater locations, where 10 locations are labelled \Yes" and eight are labelled
\No".
For the 10 locations labelled \Yes", the classi¯cation system predicted seven as
\Yes" and three as \No", and for the eight labelled \No", the system predicted six as
\No" and two as \Yes". Precision for class \Yes" ¼ 7/9, recall for class \Yes" ¼ 7/10,
F 1 for class \Yes" ¼ 0.737; precision for class \No" ¼ 6/9, recall for class \No" ¼ 6/8,
F 1 for class \No" ¼ 0.706.
In the second experiment, we used C4.5 (Quinlan, 1993), Random Forest (Brei-
man, 2001) and PBC4cip (Loyola-Gonzalez et al., 2017), as tests in our groundwater
dataset. The selection of C4.5 and Random Forest algorithms was owing to them
being two well-known easy-to-understand classi¯ers (rule-based algorithms), which
exhibit excellent performance in many contexts of applications. Moreover, we
evaluate the PBC4cip classi¯er that is suitable for this investigation because of three
reasons:
(1) It is a rule-based algorithm.
(2) It was evaluated in more than 90 datasets, proving that it signi¯cantly out-
performed other 11 di®erent algorithms for class imbalance problems.
(3) The source code is publicly available.1
1 https://sites.google.com/site/octavioloyola/papers/PBC4cip.
1850043-12
2nd Reading
November 1, 2018 9:10:41am WSPC/188-JIKM 1850043 ISSN: 0219-6492
2009) was used to implement the algorithms considered in our experiments. WEKA
is known as a landmark system in data mining and machine learning. It has achieved
widespread acceptance within academia and business circles, and has become a
widely used tool for data mining research (Hall et al., 2009).
Here, we use 10-fold cross-validation to evaluate the algorithms considered in our
experiments. Experiments are performed on an Intel I7 machine with 3-GHz pro-
cessor and 16 GB of main memory in a Windows-8 environment.
J. Info. Know. Mgmt. Downloaded from www.worldscientific.com
2 https://drive.google.com/drive/folders/0B0g0LP5sLwQhdnJxLUt2S24xOWM.
1850043-13
2nd Reading
November 1, 2018 9:10:45am WSPC/188-JIKM 1850043 ISSN: 0219-6492
(a) (b)
J. Info. Know. Mgmt. Downloaded from www.worldscientific.com
(c) (d)
(e) (f)
(g)
Fig. 3. Groundwater features: (a) average rainfall, (b) average temperature, (c) elevation, (d) slope,
(e) faults, (f) valleys and (g) outcrop.
1850043-14
2nd Reading
November 1, 2018 9:10:55am WSPC/188-JIKM 1850043 ISSN: 0219-6492
(3) Elevation. The height of a geographic location above or below a ¯xed reference
point, most commonly a reference geoid, and a mathematical model of the
J. Info. Know. Mgmt. Downloaded from www.worldscientific.com
Earth's sea level (Encyclopedia.com, 2009). The elevation level values for study
area vary between 343 m and 1,014 m, as displayed in Fig. 3(c). These values are
replaced according to the following rules.
(5) Slope of the Earth's surface. Calculated by ¯nding the ratio of the \vertical
change" to the \horizontal change" between (any) two distinct points on a line.
Sometimes, the ratio is expressed as a quotient (\rise over run"), giving the same
number for every two distinct points on the same line (Wikipedia, 2002). The
slope feature values are numerical, as shown in Fig. 3(d). These values are
replaced according to the following rules.
Rule: If the value of the slope between 0 and 2.5 ! Very easy to dig
Else if the value of the slope between 2.6 and 5 ! Easy to dig
Else ! Hard to dig
1850043-15
2nd Reading
November 1, 2018 9:10:55am WSPC/188-JIKM 1850043 ISSN: 0219-6492
(6) Valleys (Wadis). A depression that is longer than it is wide. The terms U-
shaped and V-shaped are descriptive geographical terms to characterise the form
of valleys. Most valleys belong to one of these two main types, or a mixture of
them, (at least) with respect to the cross-section of the slopes or hillsides
(Wikipedia, 2001). The valleys feature values are numerical, as shown in Fig. 3(f).
by UNIVERSITY OF LIVERPOOL on 11/13/18. Re-use and distribution is strictly not permitted, except for Open Access articles.
(7) Geological outcrop. The part of a rock-formation that appears above the
surface of the surrounding land (Howell, 1957). Extracted feature numerical
values [Fig. 3(g)] are replaced according to the following rules.
1850043-16
2nd Reading
November 1, 2018 9:10:56am WSPC/188-JIKM 1850043 ISSN: 0219-6492
rithms, where the CBA algorithm is the worst algorithm in terms of predicting
groundwater locations. More speci¯cally, the GwPAC algorithm outperformed
FACA, CMAR and CBA by 2.2%, 4.6% and 5.9%, respectively.
The F 1 measures of the FACA, CBA, CMAR and GwPAC algorithms are also
shown in Table 1. It is obvious from Table 1 that the GwPAC algorithm outperforms
FACA, CMAR and CBA algorithms. In addition, the GwPAC algorithm has 2.1%,
10.8% and 13.4% higher F 1 scores than FACA, CMAR and CBA, respectively.
There are two fundamental reasons for the higher classi¯cation accuracy rate
achieved by the GwPAC algorithm: ¯rst, it uses multiple rules to predict ground-
water locations, unlike the CBA algorithm, which uses only one rule for classi¯ca-
tion. In addition, it di®ers from the CMAR and FACA algorithms, CMAR algorithm
uses multiple rules to predict cases based on chi-square method (Li et al., 2001). One
of the main drawbacks of this prediction method is its biases to the minority class,
whilst the FACA prediction method has biases to the majority class.
Another disadvantage of using a single rule is that the highest con¯dence rule is
occasionally fruitless, particularly for datasets that have an imbalanced distribution
of classes such as groundwater datasets (Thabtah et al., 2010). Therefore, handling a
small subset of rules for predicting groundwater locations appears to be more
fruitful. The second reason for GwPAC's higher accuracy rate is that our GmR
prediction method uses both con¯dence and support constraints to evaluate the
rules, unlike other AC algorithms, which use only the con¯dence constraint to
evaluate the rules.
Table 1. The classi¯cation accuracies (%), F 1 scores, numbers of rules and learning times of AC algorithms.
AC algorithm Classi¯cation accuracy (%) F 1 score Number of rules Learning time (s)
1850043-17
2nd Reading
November 1, 2018 9:10:56am WSPC/188-JIKM 1850043 ISSN: 0219-6492
Table 1 summarises the numbers of rules used by the GwPAC, FACA, CBA and
CMAR algorithms on groundwater datasets. It is clear that the CMAR algorithm
generates the highest number of rules whilst the CBA algorithm generates the
lowest. It is obvious from Fig. 4 that the CMAR model contains many redundant
rules that reduce its classi¯cation accuracy, such as Rule 3, Rule 5, Rule 7 and Rule
10, whilst the CBA model contains a subset of rules that are not representative of the
groundwater locations in the input data. However, our proposed ranking and
pruning methods, which implement the GwPAC algorithm, extract the minimal
number of rules that are representative of all groundwater location datasets; these
methods guarantee that the remaining rules are high-quality rules that enhance the
outputted classi¯cation accuracy rate of the GwPAC algorithm. Besides, only two
instances are classi¯ed in the default class because no rules cover them; this re°ects
the goodness of the remaining rules in the GwPAC model.
Furthermore, the GwPAC algorithm outperforms CMAR and CBA in terms of
the learning time measure. In fact, the proposed fast rule discovery method that
implements the GwPAC algorithm (¯rst step) needs only one scan of the input data
and then implements a simple intersection among the Di®sets of frequent rules of size
N 1 to discover candidate rules of size N.
Finally, another notable result that was reported is that all algorithms produce
acceptable classi¯cation accuracy rates and F 1 scores; this re°ects the features'
relevance for the groundwater dataset.
1850043-18
2nd Reading
November 1, 2018 9:10:59am WSPC/188-JIKM 1850043 ISSN: 0219-6492
Fig. 5. The accuracy rates of GwPAC, CBA, CMAR and FACA as the minimum support changes with
¯xed minimum con¯dence ¼ 0.60.
J. Info. Know. Mgmt. Downloaded from www.worldscientific.com
Fig. 6. The accuracy rates of GwPAC, CBA, CMAR and FACA as the minimum con¯dence changes
with ¯xed minimum support ¼ 0.05.
1850043-19
2nd Reading
November 1, 2018 9:11:02am WSPC/188-JIKM 1850043 ISSN: 0219-6492
algorithms. The classi¯cation accuracy rate decreases for the CMAR algorithm when
the minimum con¯dence changes from 0.70 to 0.80. These results indicate that a rule
with a con¯dence value greater than 0.80 is considered a high-quality rule for the
CBA, FACA and GwPAC algorithms. Furthermore, a rule with a con¯dence value
greater than 0.70 is considered a high-quality rule for CMAR, where pruning such
by UNIVERSITY OF LIVERPOOL on 11/13/18. Re-use and distribution is strictly not permitted, except for Open Access articles.
rate) and costs (false positive rate) in which the true positive rate is plotted on the
y-axis and the false positive rate is plotted on the x-axis. The true positive rate (also
called recall) of an algorithm is calculated as follows:
positives correctly classified
True positive rate ¼ : ð3Þ
total positives
The false positive rate of an algorithm is calculated as follows:
negatives incorrectly classified
False positive rate ¼ : ð4Þ
total negatives
Several aspects of the ROC curve are worthy of mention. The lower left point (0, 0)
represents the classi¯cation of all instances as negative; point (1, 1) represents the
classi¯cation of all instances as positive. Point (0, 1) represents the best classi¯cation.
Figure 7 depicts the ROC curves for the GwPAC, FACA, CBA and CMAR
algorithms. It is clear from the ¯gure that the GwPAC algorithm performs better
than CBA, FACA and CMAR, and that CBA is the poorest algorithm for detecting
new groundwater locations. More speci¯cally, GwPAC is generally better than the
CMAR algorithm, except at 0.20 < false positive rate < 0.65, where the CMAR
algorithm has a minor bene¯t.
1850043-20
2nd Reading
November 1, 2018 9:11:03am WSPC/188-JIKM 1850043 ISSN: 0219-6492
Figure 7 also shows the areas under the three ROC curves. The GwPAC has
a larger area and, therefore, better average performance. The AUC values for
GwPAC, FACA, CMAR and CBA are 0.892, 0.832, 0.868 and 0.551, respectively.
In comparison with the AUC classi¯cation in the study by Yesilnacar (2005), it can
be seen that the GwPAC, FACA and CMAR algorithms (all AUCs > 80%) applied
by UNIVERSITY OF LIVERPOOL on 11/13/18. Re-use and distribution is strictly not permitted, except for Open Access articles.
in this study produce reasonably good classi¯cation accuracy rates in the prediction
of new groundwater locations. Based on the classi¯cation accuracies achieved,
it can be observed that the AC algorithms — especially GwPAC, FACA and
CMAR — can be applied as e±cient data mining algorithms in predicting new
groundwater locations in Jordan. However, the CBA algorithm (AUC ¼ 0.551) is
shown to have a poor classi¯cation accuracy rate in predicting new groundwater
locations.
J. Info. Know. Mgmt. Downloaded from www.worldscientific.com
1850043-21
2nd Reading
November 1, 2018 9:11:03am WSPC/188-JIKM 1850043 ISSN: 0219-6492
6. Conclusions
The problem of the exploration of groundwater locations is an important topic of
research, especially in arid and semi-arid regions. Scholars have used several algo-
rithms to address this problem, such as SVMs, ANN and Random Forest. In this
paper, we propose a new AC algorithm, GwPAC, and investigate its performance
1850043-22
2nd Reading
November 1, 2018 9:11:03am WSPC/188-JIKM 1850043 ISSN: 0219-6492
methods, indicates that the GwPAC algorithm outperforms CBA, FACA and
CMAR in terms of classi¯cation accuracy, F 1 score, learning time and AUC. In
particular, in relation to classi¯cation accuracy, the GwPAC algorithm out-
performed CMAR, FACA and CBA by 4.6%, 2.2% and 5.9%, respectively. The F 1
score results show that the GwPAC algorithm performs better than the CMAR,
FACA and CBA algorithms, with the GwPAC algorithm having 10.8%, 2.1% and
13.4% higher F 1 scores than CMAR, FACA and CBA, respectively. Furthermore,
the GwPAC algorithm builds the model faster than CMAR and CBA but it is
J. Info. Know. Mgmt. Downloaded from www.worldscientific.com
similar with FACA. In fact, GwPAC employs the Di®sets method, which requires
only one scan of the input data to discover all the candidate rules. The AUCs
produced by GwPAC, CMAR, FACA and CBA algorithms were 0.892, 0.868, 0.832
and 0.551, respectively. Thus, the GwPAC algorithm performs better than CMAR,
FACA and CBA algorithms.
Further, the GwPAC algorithm produces classi¯cation accuracy lower than
SVM, NB, Random Forest, KNN, C4.5 and ANN algorithms by 4.6%, 3.5%, 3.5%,
3.4%, 3.4% and 2.5%, respectively. In contrast, all data mining algorithms produce
acceptable classi¯cation accuracy, i.e. higher than 79%.
Based on these results, we conclude that the data mining algorithms, especially
the GwPAC algorithm, can be a useful and appropriate method for addressing the
problem of exploring new groundwater locations in Jordan. The rule-pruning pro-
cedure in the GwPAC algorithm reduces the number of rules discovered, positively
impacting the classi¯cation accuracy of test cases so that the end-user can easily
understand and manipulate the results.
In the near feature, we would like to perform the following works:
(1) Investigate GwPAC and all considered algorithms in terms of time and space
complexity.
(2) Evaluate our proposed algorithm using the ¯lter strategy (García-Borroto et al.,
2010) pruning method, and compare the results with those obtained by our
pruning method.
(3) Implement the weighted prediction method (Loyola-Gonzalez et al., 2017)
within the GwPAC algorithm, and compare the results with those achieved by
our GmR prediction method.
(4) Investigate all comparisons against di®erent datasets from UCI machine learn-
ing repository (Lichman, 2013) to generalise the performance of the GwPAC
algorithm.
(5) Extend our groundwater dataset to include more governorates and more
features such as topographic and geological features (Rahmati et al., 2016).
1850043-23
2nd Reading
November 1, 2018 9:11:03am WSPC/188-JIKM 1850043 ISSN: 0219-6492
References
Abdelhamid, N, A Ayesh, F Thabtah, S Ahmadi and W Hadi (2012). MAC: A multiclass
associative classi¯cation algorithm. Journal of Information & Knowledge Management,
11(2), 1250011, doi:10.1142/S0219649212500116.
Abdelhamid, N, A Ayesh and W Hadi (2014). Multi-label rules algorithm based associative
by UNIVERSITY OF LIVERPOOL on 11/13/18. Re-use and distribution is strictly not permitted, except for Open Access articles.
1850043-24
2nd Reading
November 1, 2018 9:11:03am WSPC/188-JIKM 1850043 ISSN: 0219-6492
Hall, M, E Frank, G Holmes, B Pfahringer, P Reutemann and IH Witten (2009). The WEKA
data mining software. ACM SIGKDD Explorations Newsletter, 11(1), 10, doi:10.1145/
1656274.1656278.
Howell, JV (1957). Glossary of Geology and Related Sciences. Alexandria: American Geo-
logical Institute. Available at http://www.abebooks.co.uk/servlet/BookDetailsPL?
bi=18523214914&searchurl=an%3DHowell%252C%2520J.%2520V.%2520%2528Ameri-
by UNIVERSITY OF LIVERPOOL on 11/13/18. Re-use and distribution is strictly not permitted, except for Open Access articles.
1850043-25
2nd Reading
November 1, 2018 9:11:04am WSPC/188-JIKM 1850043 ISSN: 0219-6492
Quinlan, JR (1993). C4.5: Programs for Machine Learning. San Francisco, CA: Morgan
Kaufmann.
Rahmati, O, H Reza and AM Melesse (2016). Catena application of GIS-based data driven
random forest and maximum entropy models for groundwater potential mapping: A case
study at Mehran Region, Iran. Catena, 137, 360–372, doi:10.1016/j.catena.2015.10.010.
Sahoo, S and MK Jha (2013). Groundwater-level prediction using multiple linear regression
by UNIVERSITY OF LIVERPOOL on 11/13/18. Re-use and distribution is strictly not permitted, except for Open Access articles.
1850043-26