You are on page 1of 9

Journal of Computational Science 63 (2022) 101779

Contents lists available at ScienceDirect

Journal of Computational Science


journal homepage: www.elsevier.com/locate/jocs

Assessment of groundwater arsenic contamination level in Jharkhand, India


using machine learning
Siddharth Kumar ∗, Jayadeep Pati
Department of Computer Science and Engineering, Indian Institute of Information Technology Ranchi, Namkum, Ranchi, 834010, Jharkhand, India

ARTICLE INFO ABSTRACT

Keywords: This paper presents a machine learning approach for assessing groundwater arsenic contamination levels in
Naive Bayes Jharkhand, India. The water is essential for sustaining life, and the presence of heavy metals like arsenic
Multilayer Perceptron poses a carcinogenic and non-carcinogenic risk. In this study, various machine learning models viz Decision
Random Forest
tree, Random Forest, Multilayer Perceptron, and Naive Bayes algorithms were applied to classify the samples
Decision tree
as safe or unsafe, considering a provisional guide value of 0.01 mg/l as the benchmark. For classification,
different parameters viz DEM, subsoil clay content, subsoil silt content, subsoil sand content, subsoil organic
content, type of soil, and LULC were considered. Pearson correlation exhibited a positive and a negative
relation between considered parameters and arsenic occurrence. Parameters obtained were considered for the
classification of arsenic, and various evaluation criteria, such as accuracy, sensitivity, and specificity, were used
to analyze models’ performance. Among the models, the Random Forest classifier outperforms other classifier
models in terms of performance. Thus, the Random Forest model can be used to approximation people prone
to arsenic contamination.

1. Introduction groundwater based on the spatial distribution map of arsenic. Ground-


water arsenic contamination is due to geologic, topography, sediment
Sustaining terrestrial and aquatic ecosphere groundwater plays an characteristics, biogeochemical, hydro-geologic, and anthropogenic fac-
important role. Ninety-eight percent of freshwater resource available tors [7,8]. Various literature suggests that due to the dissolution of
on the planet is in the form of groundwater; the rest, 2 percent, are minerals found in Quaternary deposits, groundwater aquifers are con-
part of the cryosphere. Contamination of arsenic (As) in groundwater taminated with arsenic [9,10]. As the Himalayan Glaciers melt the
is a growing concern because of its carcinogenic and non-carcinogenic arsenic-rich sediments carried by rivers are flooded in the plains of
risk, specifically in the plains of the Indo-Gangetic region of India [1]. Uttar Pradesh [11], Bihar [12], Jharkhand [13], West Bengal [14–
Degradation of groundwater quality is due to runoffs and leaching 16], Assam [17], Punjab and Haryana [18]. There are various po-
of minerals into groundwater aquifer due to weathering of rocks [2]. tential mechanisms responsible for aquifers being contaminated with
Other factors, i.e., geochemical, mineralogical composition of aquifers, arsenic in the plains of river Ganga and other parts of India. Firstly,
and anthropogenic sources, are also responsible for water quality degra- due to oxidation of arsenic-rich pyrite present in aquifers [19]. Sec-
dation. Elevated arsenic concentration above the guideline set by World ondly, due to the reduction of arsenic-bearing Fe (III) oxyhydroxides
Health Organization 10 μg/L or 0.01 mg/l [3] in aquifer poses a severe present in aquifers [20]. Moreover, other intrinsic factors like ge-
health risk [4]. Continuous water consumption with elevated arsenic ology, hydro-geochemistry, geomorphology, etc., and exogenous pa-
levels affects the body’s liver, bladder, and cardiovascular system. It rameters like mining-related activities, agricultural pumping, LULC,
also affects the nervous system of children, is responsible for acute etc., are accountable for the contamination of arsenic in groundwater
skin-related conditions, and causes lung cancer [5]. According to the aquifers [20].
Agency for Toxic Substances and Disease Registry (ATSDR) [6], arsenic In recent years several machine learning models have been created
contamination in groundwater has been graded as a top substance to assess the quality of arsenic-contaminated groundwater. The logistic
affecting humans’ health. regression model was used by [21] to create a spatial distribution map
In the literature, most of the work assesses the risk associated due to of arsenic worldwide. Further, the logistic regression model has been
continuous consumption or dermal contact with arsenic-contaminated used to assess the risk of Shanxi province of China [22] considering

∗ Corresponding author.
E-mail addresses: siddharth.rs@iiitranchi.ac.in (S. Kumar), jayadeeppati@iiitranchi.ac.in (J. Pati).

https://doi.org/10.1016/j.jocs.2022.101779
Received 6 December 2021; Received in revised form 5 July 2022; Accepted 11 July 2022
Available online 20 July 2022
1877-7503/© 2022 Elsevier B.V. All rights reserved.
S. Kumar and J. Pati Journal of Computational Science 63 (2022) 101779

soil, topology, and geological parameters. Some studies considered Table 1


Database sources used for classification.
Holocene sediments, soil’s salinity, texture, and the topographic wet-
ness index to predict the quality of arsenic groundwater contamina- Database Organization Website

tion [23]. Arsenic distribution in Cambodia’s groundwater [24] used Clay content Soil grids [33] https://soilgrids.org/
Silt content Soil grids https://soilgrids.org/
principal component and regression-kriging. In 2013 [25], the Linear
Sand content Soil grids https://soilgrids.org/
Regression model was used to assess the arsenic evolution in ground- Organic content Soil grids https://soilgrids.org/
water of China’s Datong and Hetao basins. Moreover, Artificial Neural Type of soil Soil grids https://soilgrids.org/
Networks (ANN) were used to detect arsenic in groundwater [26] of LULC Esri [34] https://www.arcgis.com/
Cambodia, Laos, and Thailand. Furthermore, Bayesian modeling [27] DEM United State Geological https://earthexplorer.usgs.gov/
Survey (USGS) [35]
approach was used to characterize the arsenic groundwater contami-
nation of the Mekong river basin. A hybrid Random forest model was
proposed by [28] to create an arsenic distribution model for the Uttar
Pradesh district of India. Logistic regression was used [29] to model the which is 16.06%. Rest households, mostly in rural areas, depend on
arsenic distribution in Gujarat state of India. bore wells, wells, lakes, and ponds for day-to-day uses.
Machine learning models are more accurate in establishing the A total of 32 620 arsenic concentration data was collected from
relationship between different parameters. The country level model January 2012 to December 2020 at the village level for a broader and
created by [30] using random forest considered field observation of more robust analysis of arsenic contamination levels in Jharkhand. The
arsenic data collected from March to July 2019 from the Ministry of Jal groundwater samples were collected from the Ministry of Jal Shakti,
Shakti [31] government of India database. Moreover, very few water National Rural Drinking Water Programme (NRDWP) Government of
samples from wells and tube wells were tested for Jharkhand during India database [32]. NRDWP monitors groundwater quality through a
that period. Furthermore, the country-level model created by [32] con- network of wells at the village level throughout the year. The ground-
sidered secondary data collected from various published sources. Some water samples were laboratory tested to detect the presence of heavy
arsenic data for different states were collected from Central Ground metal such as arsenic using Atomic Absorption Spectrophotometry
Water Board, Groundwater Year Book [32] from 2015–2016 and 2019– (AAS) and Ultraviolet-Visible (UV–VIS) spectrophotometry.
2020. However, no detailed model had been created to assess the risk of
arsenic contamination in Jharkhand, India. The primary groundwater 2.2. Spatial analysis and data extraction
source in Jharkhand are wells, borewells, and surface water like rivers,
lakes, and ponds. Various projects had been implemented with the Based on various literature and findings, seven independent pre-
world bank’s assistance for proper drinking water supply and sanitation dictor variables were considered for classification. The data collected
in Jharkhand. As per the latest IMIS reports, out of 24 districts, only from multiple sources used for modeling are shown in Table 1. The
4 districts have tap water connections in the range of 25%–50%, 14 data of independent variables, i.e., subsoil clay content, subsoil silt
districts have 10%–25%, and 8 districts have less than 10% tap water content, subsoil sand content, subsoil organic content, Land Use and
connections. Therefore, most of the population depends on groundwa- land Cover (LULC), and Digital Elevation Model (DEM) of Jharkhand
ter for daily use. An arsenic investigation approach is needed to monitor was acquired from various databases in raster format. The datasets were
and manage the public health related to arsenic poisoning in the state projected in Universe Transverse Mercator (UTM) coordinate system
of Jharkhand. with the World Geodetic System (WGS) 1984 to maintain uniformity
Thus, the study is significant as the machine learning model cre- among the parameters. These independent variables were indirectly or
ated uses relevant parameters as inputs selected based on Pearson’s directly associated with the occurrence of arsenic in groundwater.
correlation and classify the arsenic contamination level in Jharkhand. ArcGis10.8 software was used to create the spatial maps of subsoil
The model created can be used for the classification analysis of new clay content, subsoil silt content, subsoil sand content, subsoil organic
groundwater samples. Moreover, based on the model’s accuracy, the content, Land Use, and land Cover (LULC), and Digital Elevation Model
health risk of the people living in affected zones was approximated. (DEM). A spatial map was created for all parameters at 30–60 cm
This work involves the application of different classification algorithms, depth in the raster(.tif) file. After downloading all the raster files as
namely Decision tree, Random Forest, Multilayer Perceptron, and Naive per the latitude and longitude of Jharkhand, the mosaic operation
Bayes, to classify the samples as safe or unsafe. The models created was performed for uniformity in raster data. A shapefile of Jharkhand
were analyzed based on assessment criteria like accuracy, sensitivity, was layered on the combined raster files, and a clip operation was
and specificity. performed to get the spatial maps of the parameters.
After creating a spatial variability map of all the independent pa-
2. Materials and methods rameters, the village-level boundary shapefile of Jharkhand was pro-
jected on the top of each independent variable. The shapefile as Jhark-
2.1. Study area and arsenic data acquisition hand has a total 32 620 number of villages, so as many data points were
created on the top of each variable. Using the extraction tool in ArcGis
Jharkhand is located in the eastern part of India and has an area software, the village-level data of Jharkhand for each independent
of 79,714 km2 (30,778 sq mi) Fig. 1. It lies between 83◦ 329646′ to variable were extracted.
87◦ 962287′ E longitude and 21◦ 970346′ to 25◦ 349452′ N latitude. The
state of Jharkhand is made of Proterozoic, Late Paleozoic–Mesozoic, 2.3. Statistical analysis
Archean, and Tertiary rock succession. The state is rich in mineral
deposits, i.e., coal, iron ore, bauxite, limestone, copper, etc. Out of Before using the independent variables for model creation, all in-
the total area of Jharkhand, 23,611.41 km2 , i.e., 29.62% of the state’s dependent variables were statistically tested using IBM SPSS Statistical
geographical region, are covered with forest. As per the report of Software 25 to find the arsenic relation and occurrence. The Univariate
the Indian census, the state’s population is 32.96 million, and the feature selection method Pearson’s correlation was used to find the di-
population density is 414 per km2 . There are a total of 32 620 villages in rect or indirect association between independent variables and arsenic
Jharkhand. Out of the total population, 75.95% of people live in rural occurrence. Before performing the feature selection method, the Null
areas, whereas 24.05% people live in urban areas. As per the Jal Jeevan Hypothesis (H0 ) was assumed that there is no relation between the oc-
Mission report, there are 59,23,320 households in Jharkhand, and to currence of arsenic to independent variables, i.e., the correlation value
date, 9,51,317 households had been provided tap water connection, will be 0. Moreover, Alternate Hypothesis (H1 ) was assumed that there

2
S. Kumar and J. Pati Journal of Computational Science 63 (2022) 101779

Fig. 1. Location map of the study area.

is either a positive or negative relationship between the occurrence 2.4.4. Naive Bayes
of arsenic to independent variables, i.e., the value ranges from −1 to Naive Bayes is a classifier algorithm based on the Bayes theorem.
1 except 0. The Correlation obtained from Pearson’s correlation was It uses the Bayes theorem to predict the class based on the probabil-
significant at the 0.01 level (2-tailed). ity of occurrence of a class. The extracted data were pre-processed
and divided into 70:30 training and testing sets. The variables con-
2.4. Implementation of machine learning classifiers tain continuous data; therefore, Gaussian Naive Bayes was used for
classification.

2.4.1. Decision tree


3. Results and discussion
The decision tree is the most preferred algorithm for classification
problems. Internal nodes represent the parameters in the decision tree, 3.1. Spatial analysis of parameters
and branches represent the decision rule. Leaf nodes represent the
classification result. A decision tree takes decisions based on the result, Soil is a heterogeneous material found on the upper surface of the
and to get a class; it further splits trees into sub trees. The extracted earth. It consists of a mixture of air, water, and solid particles. Clay
data were pre-processed and divided into 70:30 training and testing is the fine-grained inorganic fraction with a particle size less than
sets. Entropy criteria for splitting the nodes and the maximum depth of 0.0002 mm. The subsoil clay fraction value in the state of Jharkhand
5 were chosen for fitting a decision tree. ranges from 0 to 523 g/kg. From the spatial analysis of subsoil clay
fraction, as shown in Fig. 2.a., it was observed that the area near water
2.4.2. Random forest bodies had low clay content compared to the site away from water
Random Forest is a machine learning algorithm used for classifica- bodies.
tion and regression tasks. Random Forest divides a dataset into several The organic content in soil is derived from the decomposition of
subsets, and based on those subsets, the number of decision trees was plant and animal debris into the soil. It has less than 10% of composi-
created. Random forest takes the average accuracy of all the decision tion in soil. It consists of two types of fractions living and non-living.
trees to give a final prediction. The extracted data were pre-processed The organic content value in the state of Jharkhand ranges from 0 to
and divided into 70:30 training and testing sets. For fitting a Random 383 dg/kg. From the spatial analysis of subsoil organic content Fig. 2.b.,
it was observed that the area near water bodies and forests had high
Forest, 300 numbers of estimators, entropy as a criterion, square root
organic content compared to the site away from water bodies.
as a max feature, and a maximum depth of 10 were chosen for better
Sands are coarse-grained inorganic fragments with a particle size
accuracy.
of 0.06–2 mm. The subsoil sand content of Jharkhand (Fig. 3.a) was
derived from the soil grid database in digital format. The value of sand
2.4.3. Multilayer perceptron content ranges from 0 to 611 g/kg. From the spatial analysis of subsoil
MLP is a feed-forward Neural network. The previous layer’s output sand content, it was observed that the area with less elevation from sea
was fed as input to the next layer, and their weights were associated level had high sand content compared to the site with high elevation.
with them. The output of the nodes was multiplied by these weights DEM represents the earth’s surface elevation, excluding buildings,
before reaching the input for the next layer. For training and testing the other surface objects, and forests. DEM of Jharkhand was derived
model, two hidden layers with Relu activation function and a maximum from the USGS database in the form of raster files. The elevation of
of 1000 iterations were used for classification. The extracted data were Jharkhand ranges from −150 m to 1308 m with reference to sea level.
pre-processed and divided into 70:30 training and testing sets. The spatial analysis of DEM of Jharkhand is as shown in Fig. 3.b.

3
S. Kumar and J. Pati Journal of Computational Science 63 (2022) 101779

Fig. 2. Spatial distribution map of (a) Subsoil clay fraction (g/kg), (b) Subsoil organic content (dg/kg) in Jharkhand.

Fig. 3. Spatial distribution map of (a) Subsoil sand fraction (g/kg), (b) DEM (m) of Jharkhand.

Different type of soils found in Jharkhand was downloaded from i.e., water, trees, grass, flooded vegetation, crops, shrubs, built area,
the soilgrid database in raster files. In Jharkhand, 30 different types bare ground, and cloud. The Spatial analysis of LULC of Jharkhand is
of soils were found; as per literature, Fluvisols are responsible for the shown in Fig. 5.a.
occurrence of arsenic. The deposits form fluvisols bought down by the A spatial variability map of arsenic in groundwater of Jharkhand
rives and are younger alluvial deposits rich in Iron. The spatial map of (Fig. 5.b) was created using the Inverse distance weighting (IDW) spa-
various types of soils of Jharkhand is shown in Fig. 4.a.
tial Interpolation method using ArcGis 10.8 software. A total of 32 620
Subsoil silt is fine-grained inorganic fragments with a 0.002–0.06
data was collected from January 2012 to December 2020 for each vil-
mm particle size. The spatial analysis of subsoil silt content is as shown
in Fig. 4.b. The areas near water bodies had very low silt content lage in Jharkhand from the Ministry of Jal Shakti, National Rural Drink-
compared to areas away from water bodies. ing Water Programme (NRDWP) Government of India database [32].
LULC database of Jharkhand was extracted from the Esri website As per the World Health Organisation (WHO) guideline, the arsenic
and published on July 2021. The database was prepared using Sentinel concentration ≥0.01 mg/l poses serious health concerns. Based on these
2 satellite imagery. The LULC of Jharkhand consists of 9 classes, criteria, the data points where the arsenic concentration is <0.01 mg/l

4
S. Kumar and J. Pati Journal of Computational Science 63 (2022) 101779

Fig. 4. Spatial distribution map of (a) Types of soil, (b) Subsoil silt content (g/kg) in Jharkhand.

Fig. 5. Spatial distribution map of (a) LULC, (b) Arsenic concentration (mg/l) of Jharkhand.

were termed as safe(0), otherwise marked unsafe(1). The Inverse dis- 3.2. Data extraction from each independent variable
tance weighting method presumes that the distance between neighbors
is proportionate to the similarities and the proportion of association
between them. The samples’ weights in IDW decrease as the distance A total of 32 620 samples were extracted for eight independent
between the samples and the projected points decreases. The weights variables. The descriptive statistic and data description is as shown in
in IDW are distributed as per the distance of the estimated points. A Table 2. The extracted value of subsoil clay content for the state of
point with high power reduces to farther estimated points, distribut- Jharkhand is of continuous type and was measured in g/kg. Similarly,
ing smaller power to neighboring points. Inverse distance weighting DEM is of continuous type and was measured in meters. Moreover, the
assumes that the neighbor’s estimated points are more related to its type of soil was categorical type data, and a total of 30 types of soils
neighbors’ weight than distanced weights. were found in Jharkhand.

5
S. Kumar and J. Pati Journal of Computational Science 63 (2022) 101779

Table 2 Table 4
Descriptive statistic and data description. Confusion matrix and results obtained for machine learning algorithms.
Variable Type Max Min Mean Mode Median Algorithms Safe Unsafe Accuracy Sensitivity Specificity
Arsenic Continuous 0.98 0 0.19 0 0.12 Safe 3327 819
Random forest 90.11 80.24 97.33
Clay Continuous 526 0 356.01 325 351 Unsafe 151 5518
DEM Continuous 1308 −43 222.65 144 184
Safe 3284 862
Organic content Continuous 383 0 61.13 52 56 Decision tree 84.65 79.20 88.63
Unsafe 644 5025
Sand Continuous 611 0 305.68 324 318
Silt Continuous 482 0 336.35 335 339 Safe 3012 1149
MLP 82.77 72.64 90.17
Type of soil Categorical NA NA NA NA NA Unsafe 542 5112
LULC Categorical NA NA NA NA NA Safe 3793 953
Naive Bayes 86.65 77.01 93.70
Unsafe 357 5312

Table 3
Pearson correlation with occurrence of arsenic.
Variable Pearson’s correlation of arsenic of the model to correctly classify the content of arsenic in groundwater
Clay content −.020 of Jharkhand as ≥0.01 mg/l was represented by sensitivity or True Pos-
Silt content .035 itive Rate(TPR). Moreover, after applying machine learning algorithms,
Sand content −.081
Organic content −.063
it was observed that the Random Forest algorithm had the highest
Type of soil .021 sensitivity value of 80.24% compared with other algorithms Fig. 7.a.
LULC .036 Furthermore, the specificity measures the ability of a model to correctly
DEM .027 classify the content of arsenic in groundwater of Jharkhand as <0.01
mg/l. From the simulation, it was observed that Random Forest had
the highest specificity value of 97.33% as compared to other models
LULC is a categorical data type, and there are 9 classes of land Fig. 7.b.
use and land cover patterns found in Jharkhand. The value of subsoil Finally, from the results obtained, it was concluded that the Random
organic content extracted for Jharkhand is of continuous data type Forest model had the overall highest accuracy (%) of 90.11%, sensi-
and was measured in dg/kg. The value subsoil sand content extracted tivity (%) of 97.33%, and specificity (%) of 80.24% with respect to
for Jharkhand is the continuous data type and was measured in g/kg. other algorithms. Thus, the Random Forest model accuracy result can
Moreover, the value of subsoil silt content extracted for Jharkhand approximate the number of people prone to arsenic contamination.
is of continuous data type and was measured in g/kg. The value
arsenic distribution extracted for Jharkhand is of continuous type and 3.5. Comparison of model performance with previous studies
was measured in mg/l. The data extracted from arsenic distribution
across Jharkhand was converted into two classes(0 and 1) based on the Very few research had been performed to assess the groundwater
guideline value (0.01 mg/l) of WHO. Total 32 620 arsenic data were quality and potential risk of arsenic contamination in Jharkhand. The
extracted for each village of Jharkhand, based on binary class 13 825 previous study by [13,36] assessed groundwater quality and health risk
data were less than 0.01 mg/l and 18 795 data points were ≥0.01 mg/l. assessment in the Sahibganj and Ranchi districts of Jharkhand, India.
The feature engineering process had been performed on the dataset to This study used machine learning techniques based on the groundwater
get better accuracy before applying attribute selection to remove the arsenic contamination in Jharkhand, India. Four machine learning
outliers and missing values. algorithms viz Decision tree, Random Forest, Multilayer Perceptron,
and Naïve Bayes were trained and tested to classify data as safe or
3.3. Variable selection using Pearson’s correlation unsafe. Based on the result obtained Random Forest model performance
was best compared to other algorithms. A comparison of the result
Pearson’s correlation found that subsoil clay, organic, and subsoil obtained in this study and the previous research is shown in Table 5.
sand content were negatively associated with the arsenic occurrence. The accuracy obtained by this study was better as compared to other
Moreover, DEM, LULC, Type of Soil, and Silt were positively associated state-level studies carried by [26,29]. Moreover, the performance of
with the arsenic occurrence. this study in terms of accuracy was better compared to the county-level
Thus, the Null Hypothesis(H0 ) was rejected based on the result research carried out by [32].
obtained, and the Alternate Hypothesis(H1 ) was accepted. Therefore, Country-level model created by [30] considered field observation
independent variables, as shown in Table 3, are considered for model of arsenic data collected from March to July 2019 from the Ministry
creation to classify arsenic as safe or unsafe in Jharkhand. of Jal Shakti government of India database. Moreover, after careful
data analysis in those periods, very few water samples from wells
3.4. Classification using machine learning algorithms and tube wells were tested for Jharkhand. For a broader and more
robust analysis of arsenic levels in Jharkhand, a total of 32 620 arsenic
The result obtained from Pearson’s correlation shows that all inde- concentration data were collected from January 2012 to December
pendent variables were responsible for the arsenic occurrence. Machine 2020 at the village level. The accuracy result obtained by this study
learning algorithms viz Decision tree, Random Forest, Multilayer Per- is comparable to the accuracy obtained by [30].
ceptron, and Naïve Bayes were trained and tested after correlation
analysis on the extracted dataset. The confusion matrix and result of 3.6. Application of machine learning classifier model outcomes
various parameters viz accuracy, sensitivity, and specificity were used
to analyze the performance of various machine learning classifiers. The 3.6.1. Approximation of population affected by arsenic contamination
model that had overall high performance based on the accuracy (%), As per the India census, the population of Jharkhand is 32 988 134
sensitivity (%) (True Positive Rate), and specificity (%) (True Negative
people and has a population density of 414 per sq km. Moreover, this
Rate) was termed the best-suited model for the classification of arsenic
study found that the Jharkhand had 79 714 sq. km, out of which 5593
in Jharkhand.
sq km had a high arsenic concentration (≥0.01 mg/l).
The Random forest algorithms had the highest accuracy of 90.11%
(Table 4) compared to other machine learning models Fig. 6. The ability 𝑃 𝐸 = 𝑅𝑢𝑟𝑎𝑙 𝑃 𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 ∗ 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 𝑜𝑓 𝑅𝑎𝑛𝑑𝑜𝑚 𝐹 𝑜𝑟𝑒𝑠𝑡 𝑀𝑜𝑑𝑒𝑙 (1)

6
S. Kumar and J. Pati Journal of Computational Science 63 (2022) 101779

Fig. 6. Accuracy score of Random forest, Decision tree, MLP and Naive Bayes algorithms.

Fig. 7. (a) Sensitivity score of random forest, Decision tree, MLP and Naive Bayes algorithms (b) Specificity score of random forest, Decision tree, MLP and Naive Bayes algorithms.

Table 5
Result comparison of previous studies and this study.
Reference Study region Models used Accuracy (%)
Bindal et al. (2019) [28] Uttar Pradesh, India Hybrid random forest 84.6
Chakraborty et al. (2020) [37] Ganges River Delta, India and Bangladesh Random forest 93.0
Podgorski et al. (2020) [32] India Random forest 79.0
Wu et al. (2021) [29] Gujarat, India Logistic regression 70.0
Mukherjee et al. (2021) [30] India Random forest 94.0
Kumar et al Jharkhand, India Random forest 90.11

The approximate number of people exposed (PE) to arsenic was calcu- 3.6.2. Mitigation based on spatial map of arsenic
lated using Eq. (1) [28], which is the product of the Random Forest Based on the spatial distribution map of arsenic, various infrastruc-
model’s accuracy and the study area’s population. Using Eq. (1), ap- tures must be developed in the affected regions. Creation of community
proximately 2,086 499 people live in high arsenic concentration region oxidation plants in affected regions that uses O3 , H2 O2 , Cl2 , NH2 Cl, that
of (≥0.01 mg/l) and are at high risk of developing carcinogenic and oxidizes As𝐼𝐼𝐼 to As𝑣 [38]. As𝑣 is absorbed and removed more easily
non-carcinogenic diseases. onto the solid surface as compared to As𝐼𝐼𝐼 , giving arsenic-free water

7
S. Kumar and J. Pati Journal of Computational Science 63 (2022) 101779

to the community. Moreover, the Coagulation–Flocculation plant can [8] M. Shamsudduha, A. Uddin, J.A. Saunders, M.-K. Lee, Quaternary stratigra-
be built for the community to treat water. It uses Fe and Al based coag- phy, sediment characteristics and geochemistry of arsenic-contaminated alluvial
aquifers in the Ganges–Brahmaputra floodplain in central Bangladesh, J. Contam.
ulants [39] that combine with floc created, and arsenic is precipitated
Hydrol. 99 (1–4) (2008) 112–136.
to get arsenic-free water. Iron oxides and hydroxides absorbents like [9] A. Mukherjee, A.E. Fryar, W.A. Thomas, Geologic, geomorphic and hydrologic
Zero Valent Iron (ZVI) Fe(0) [40], granular ferric hydroxide [41] and framework and evolution of the Bengal basin, India and Bangladesh, J. Asian
hydrous ferric oxide [42] are most widely used to remove As𝐼𝐼𝐼 and As𝑣 Earth Sci. 34 (3) (2009) 227e244, http://dx.doi.org/10.1016/j.jseaes.2008.05.
from water. People should be encouraged to switch to another source 011.
of arsenic-free water wells. Moreover, policymakers and government [10] D. Postma, T.K.T. Pham, H.U. Sø, V.H. Hoang, M.L. Vi, T.T. Nguyen, F. Larsen,
H.V. Pham, R. Jakobsen, A model for the evolution in water chemistry of an
can devise some plans to supply piped water and other arsenic removal
arsenic contaminated aquifer over the last 6000 years, red river floodplain,
techniques of filtering water to the affected areas of Jharkhand. Vietnam, Geochem. Cosmochim. Acta 195 (2016) 277e292, http://dx.doi.org/
10.1016/j.gca.2016.09.014.
4. Conclusion [11] V.S. Chauhan, R.T. Nickson, D. Chauhan, L. Iyengar, N. Sankararamakrishnan,
Ground water geochemistry of Ballia district, Uttar Pradesh, India and mecha-
nism of arsenic release, Chemosphere 75 (1) (2009) 83e91, http://dx.doi.org/
This research article assesses the groundwater arsenic concentration
10.1016/j.chemosphere.2008.11.065.
level of Jharkhand using machine learning. Arsenic poses a severe [12] D. Saha, Arsenic groundwater contamination in parts of middle Ganga plain,
threat to the population living in the affected areas. Based on the find- Bihar, Curr. Sci. 2597 (6) (2009) 753–755.
ing of this study, parameters responsible for the occurrence of arsenic [13] P. Tirkey, T. Bhattacharya, S. Chakraborty, S. Baraik, Assessment of groundwater
such as subsoil clay content, subsoil silt content, subsoil sand content, quality and associated health risks: a case study of Ranchi city, Jharkhand,
subsoil organic content, type of soil, LULC, and DEM were considered. India, Groundw. Sustain. Dev. (ISSN: 2352-801X) 5 (2017) (2017) 85–100,
http://dx.doi.org/10.1016/j.gsd.2017.05.002.
Moreover, the overall performance of the Random Forest Model was
[14] A. Mukherjee, A.E. Fryar, B.R. Scanlon, P. Bhattacharya, A. Bhattacharya,
the best, with an accuracy of 90.11%. Based on model performance, ap- Elevated arsenic in deeper groundwater of the western Bengal basin, India: extent
proximately 2,086 499 people live in high arsenic concentration region and controls from regional to local scale, ApplGeochem 26 (4) (2011) 600–613.
of (≥0.01 mg/l) and are at increased risk of developing carcinogenic [15] D. Chakraborti, B. Das, M.M. Rahman, U.K. Chowdhury, B. Biswas, A.B.
and noncarcinogenic diseases. A spatial map of arsenic can be used Goswami, et al., Status of groundwater arsenic contamination in the state of
West Bengal, India:a 20-year study report, Mol. Nutr. Food Res. 53 (5) (2009)
to identify the regions affected, and various infrastructures can be
542–551.
developed in the affected areas to safeguard the populations. [16] R. Nickson, C. Sengupta, P. Mitra, S.N. Dave, A.K. Banerjee, A. Bhattacharya,
et al., Current knowledge on the distribution of arsenic in groundwater in five
CRediT authorship contribution statement states of India, J. Environ. Sci. Health A 42 (12) (2007) 1707–1718.
[17] S. Verma, A. Mukherjee, C. Mahanta, R. Choudhury, R.P. Badoni, G. Joshi,
Siddharth Kumar: Feature engineering, Statistical analysis, Spa- Arsenic fate in the Brahmaputra river basin aquifers: controls of geogenic
processes, provenance and water-rock interactions, ApplGeochem 107 (2019)
tial analysis, Modeling, Risk assessment, Authored the manuscript.
171–186.
Jayadeep Pati: Conceptualized the idea of the study and the [18] A. Kumar, C.K. Singh, Arsenic enrichment in groundwater and associated health
manuscript. risk in Bari doab region of Indus basin, Punjab, India, Environ. Pollut. 256 (2020)
113324, http://dx.doi.org/10.1016/j.envpol.2019.113324.
Declaration of competing interest [19] S.K. Acharyya, P. Chakraborty, S. Lahiri, B.C. Raymahashay, S. Guha, A.
Bhowmik, Arsenic poisoning in the Ganges delta, Nature 401 (6753) (1999) 545.
[20] F.S. Islam, A.G. Gault, C. Boothman, D.A. Polya, J.M. Charnock, D. Chatterjee,
The authors declare that they have no known competing financial
J.R. Lloyd, Role of metal-reducing bacteria in arsenic release from Bengal delta
interests or personal relationships that could have appeared to sediments, Nature 430 (6995) (2004) 68–71.
influence the work reported in this paper. [21] T.J.B. Dummer, Z.M. Yu, L. Nauta, J.D. Murimboh, L. Parker, Geostatisti-
cal modelling of arsenic in drinking water wells and related toenail arsenic
Acknowledgments concentrations across nova scotia. Canada, Sci. Total Environ. 505 (2014)
1248e1258.
[22] Q. Zhang, L. Rodríguez-Lado, C.A. Johnson, H. Xue, J. Shi, Q. Zheng, G. Sun,
The author would like to acknowledge the fellowship provided by
Predicting the risk of arsenic contaminated groundwater in shanxi province,
University Grant Commission (UGC), New Delhi, in the form of Junior northern China, Environ. Pollut. 165 (2012) 118e123.
Research Fellowship (JRF) and Senior Research Fellowship (SRF), to [23] J.J. Lee, C.S. Jang, C.W. Liu, C.P. Liang, S.W. Wang, Determining the probability
conduct this research. of arsenic in groundwater using a parsimonious model, Environ. Sci. Technol.
43 (17) (2009) 6662e6668.
References [24] L.R. Lado, D. Polya, L. Winkel, M. Berg, A. Hegan, Modelling arsenic hazard
in Cambodia: a geostatistical approach using ancillary data, Appl. Geochem. 23
(11) (2008) 3010e3018, http://dx.doi.org/10.1016/j.apgeochem.2008.06.028.
[1] A. Chattopadhyay, A.P. Singh, S.K. Singh, A. Barman, A. Patra, B.P. Mondal, K.
[25] T. Luo, S. Hu, J. Cui, H. Tian, C. Jing, Comparison of arsenic geochemical
Banerjee, Spatial variability of arsenic in indo-gangetic basin of varanasi and its
evolution in the datong basin (shanxi) and hetao basin (inner Mongolia), China,
cancer risk assessment, Chemosphere 238 (2020) 124623.
Appl. Geochem. 27 (12) (2012) 2315e2323.
[2] S.A. Lone, A.A. Lone, G. Jeelani, Characterization of groundwater potential of
[26] K.H. Cho, S. Sthiannopkao, Y.A. Pachepsky, K.W. Kim, J.H. Kim, Prediction of
Sindh Watershed Western Himalayas, J. Res. Dev. 16 (2016) 29–41.
[3] WHO, Guidelines for Drinking-Water Quality, Vol. 216, World Health contamination potential of groundwater arsenic in Cambodia, Laos, and Thailand
Organization, 2011, pp. 303–304. using artificial neural network, Water Res. 45 (2011) 5535–5544.
[4] S. Bhowmick, S. Pramanik, P. Singh, P. Mondal, D. Chatterjee, J. Nriagu, Arsenic [27] Y.K. Cha, Y.M. Kim, J.W. Choi, S. Sthiannopkao, K.H. Cho, Bayesian modeling
in groundwater of West Bengal, India: a review of human health risks and approach for characterizing groundwater arsenic contamination in the mekong
assessment of possible intervention options, Sci. Total Environ. 612 (2018) river basin, Chemosphere 143 (2016) http://dx.doi.org/10.1016/j.chemosphere.
148e169, http://dx.doi.org/10.1016/j.scitotenv.2017.08.216. 2015.02.045.
[5] A. Das, S.S. Das, N.R. Chowdhury, M. Joardar, B. Ghosh, T. Roychowdhury, [28] S. Bindal, C.K. Singh, Predicting groundwater arsenic contamination: regions at
Quality and health risk evaluation for groundwater in Nadia district, West risk in highest populated state of India, Water Res. 159 (2019) 65–76, http://dx.
Bengal: An approach on its suitability for drinking and domestic purpose, doi.org/10.1016/j.watres.2019.04.054, Basin: insight from surface complexation
Groundw. Sustain. Dev. 10 (2020) 100351. modeling. Water research 55, 30–39.
[6] ATSDR, Substance priority list, 2019, https://www.atsdr.cdc.gov/spl/index.html. [29] R. Wu, J. Podgorski, M. Berg, D.A. Polya, Geostatistical model of the spatial
(Accessed 26 July 2021). distribution of arsenic in groundwaters in Gujarat State, India, Environ. Geochem.
[7] M. Chakraborty, S. Sarkar, A. Mukherjee, M. Shamsudduha, K.M. Ahmed, A. Health 43 (7) (2021) 2649–2664.
Bhattacharya, A. Mitra, Modeling regional-scale groundwater arsenic hazard [30] A. Mukherjee, S. Sarkar, M. Chakraborty, S. Duttagupta, A. Bhattacharya, D.
in the transboundary Ganges River Delta, India and Bangladesh: infusing Saha . . ., S. Gupta, Occurrence, predictors and hazards of elevated groundwater
physically-based model with machine learning, Sci. Total Environ. 748 (2020) arsenic across India through field observations and regional-scale AI-based
141107. modeling, Sci. Total Environ. 759 (2021) 143511.

8
S. Kumar and J. Pati Journal of Computational Science 63 (2022) 101779

[31] National Rural Drinking Water Programme, 2021, https://ejalshakti.gov.in/ [41] W. Driehaus, M. Jekel, U. Hildebrandt, Granular ferric hydroxide—a new
imisreports/ (Accessed 7 September 2021). adsorbent for the removal of arsenic from natural water, J. Water Supply: Res.
[32] J. Podgorski, R. Wu, B. Chakravorty, D.A. Polya, Groundwater arsenic distri- Technol.—Aqua (1998) 30–35.
bution in India by machine learning geospatial modeling, Int. J. Environ. Res. [42] J.A. Wilkie, J.G. Hering, Adsorption of arsenic onto hydrous ferric oxide: effects
Public Health 17 (19) (2020) 7119. of adsorbate/adsorbent ratios and co-occurring solutes, Colloids Surf. A (1996)
[33] N.H. Batjes, E. Ribeiro, A. van Oostrum, Standardised soil profile data to support 97–110.
global mapping and modelling (WoSIS snapshot 2019), Earth Syst. Sci. Data 12
(2020) 299–320, http://dx.doi.org/10.5194/essd-12-299-2020.
Siddharth Kumar received the M.Tech. Degree in Computer
[34] Esri sentinel-2 10-meter land use land cover, 2021, https://livingatlas.arcgis.
Science and Engineering from the Central University of
com/landcover/ (Accessed 18 September 2021).
Punjab, Bhatinda, India, in 2015. He is currently pursuing
[35] United States Geological Survey, 2021, https://earthexplorer.usgs.gov/ (Accessed
Ph.D. degree in Computer Science and Engineering from
10 October 2021).
Indian Institute of Information Technology, Ranchi. His
[36] M. Alam, W.A. Shaikh, S. Chakraborty, K. Avishek, T. Bhattacharya, Groundwater
research interests include Data Science, Machine learning
arsenic contamination and potential health risk assessment of Gangetic Plains of
and Data Mining.
Jharkhand, India, Expo. Health 8 (1) (2016) 125–142.
[37] M. Chakraborty, S. Sarkar, A. Mukherjee, M. Shamsudduha, K.M. Ahmed, A.
Bhattacharya, A. Mitra, Modeling regional-scale groundwater arsenic hazard
in the transboundary Ganges River Delta, India and Bangladesh: Infusing
physically-based model with machine learning, Sci. Total Environ. 748 (2020)
141107. Jayadeep Pati received the B.Tech. Degree from the In-
[38] Y. Lee, I.H. Um, J. Yoon, Arsenic (III) oxidation by iron (VI)(ferrate) and stitute of Technical Education and Research, Bhubaneswar,
subsequent removal of arsenic (V) by iron (III) coagulation, Environ. Sci. Technol. India, in 2010, the M.Tech. Degree from the National
(2003) 5750–5756. Institute of Technology, Rourkela, India, in 2012, and the
[39] V. Pallier, G. Feuillade-Cathalifaud, B. Serpaud, J.C. Bollinger, Effect of organic Ph.D. degree from IIT (BHU), Varanasi. He is currently an
matter on arsenic removal during coagulation/flocculation treatment, J. Colloid Assistant Professor with the Indian Institute of Information
Interface Sci. (2010) 26–32. Technology, Ranchi. His research interests include machine
[40] D.E. Giles, M. Mohapatra, T.B. Issa, S. Anand, P. Singh, Iron and aluminium learning and software engineering.
based adsorption strategies for removing arsenic from water, J. Environ. Manag.
(2011) 3011–3022.

You might also like