Capstone Project Report

ASSESSMENT OF GROUNDWATER QUALITY USING SOFT
COMPUTING TECHNIQUES
Submitted in partial fulfillment of the requirements for the degree of
Bachelor of Technology
in
Civil Engineering
by
Ashish Malegaon - 19BCL0181
Pratham Kumar - 19BCL0085
Rishu Kumar Takur - 19BCL0169
Under the guidance of

Prof. Amit Mahindrakar Baburao
School of Civil Engineering
VIT, Vellore.
April 2023
DECLARATION
I hereby declare that the thesis entitled “ Assessment of

Groundwater quality using soft computing Techniques” submitted by me, for the
award of the degree of Bachelor of Technology in Civil Engineering to VIT is a
record of bonafide work carried out by me under the supervision of Amit
Mahindrakar Baburao.
I further declare that the work reported in this thesis has not been submitted
and will not be submitted, either in part or in full, for the award of any other degree or
diploma in this institute or any other institute or university.
Place: Vellore
Date:
MALEGAON ASHISH
RISHU KUMAR THAKU
PRATHAM KUMAR
Signature of the candidate
CERTIFICATE
This is to certify that the thesis entitled “ Assessment of Groundwater
Quality using soft computing Techniques ” submitted by Ashish Malegaon,
Pratham Kumar, and Rishu Kumar Thakur, School of Civil Engineering, VIT,
for the award of the degree of Bachelor of Technology in Civil Engineering, is a
record of bonafide work carried out by him under my supervision during period, 01.
12. 2022 to 30.04.2023, per the VIT academic and research ethics code.
The contents of this report have not been submitted. They will not be
submitted either in part or in full, for the award of any other degree or diploma in this
institute or any other institute or university. The thesis fulfills the requirements and
regulations of the University and in my opinion, meets the necessary standards for
submission.
Place: Vellore
Date: Signature of the Guide
Internal Examiner External Examiner
Head of the Department

Civil Engineering
ACKNOWLEDGEMENTS
I would like to express my deepest gratitude to my capstone project guide, Mr. Amit
Mahindrakar, for his unwavering support and guidance throughout the entire duration
of this project. His expertise, patience, and dedication were instrumental in shaping
and bringing the project to fruition. His insightful feedback and valuable suggestions
helped me overcome challenges and advance the project.
I would also like to extend my heartfelt thanks to Professor Uma Shankar for his
constant encouragement, motivation, and valuable input that enriched my
understanding of the project's scope and significance. His mentorship and wisdom
have been invaluable in shaping my project and my overall academic journey.
I would also like to acknowledge the lab assistants who provided me with timely and
relevant information whenever required. Their assistance in gathering and analyzing
data, conducting experiments, and managing the laboratory resources was invaluable
and greatly contributed to the success of this project.
Once again, I express my deepest gratitude to everyone who has contributed to the
successful completion of this capstone project. Your support and guidance have been
invaluable, and I am truly honored and privileged to have had the opportunity to work
with such amazing mentors and colleagues.
Student Name:
(i).MALEGAON ASHOSH
(ii).RISHU KUMAR THAKUR
(iii).PRATHAM KUMAR
I
Executive Summary
The assessment of groundwater quality is crucial for ensuring the safety and
sustainability of our water resources. Soft computing techniques provide a valuable
tool for analyzing complex data sets and making predictions about water quality. In
this report, we explore the use of soft computing techniques, including Artificial Neural
Networks (ANN), Fuzzy Logic (FL), and Genetic Algorithms (GA), to assess
groundwater quality. The results of our analysis demonstrate the effectiveness of these
techniques in predicting water quality indicators, such as pH, turbidity, and Total
Dissolved Solids (TDS). These findings provide valuable insights for researchers and
policymakers seeking to improve water resource management and ensure the safety of
our drinking water supply.
Through an extensive review of existing research papers, it has been identified that
Geographic Information System (GIS) technology is making rapid advancements but is
unable to accurately process data that contains missing information, leading to
inconsistencies. This poses a challenge when determining water quality as it requires
the measurement of several parameters to calculate a Water Quality Index (WQI). To
overcome this challenge, soft computing techniques have been employed to predict
WQI, simplifying the process. Four different soft computing models were utilized to
predict WQI, and the effectiveness of each model was assessed using various plots.
The results revealed that the Artificial Neural Network (ANN) model exhibited a high
level of agreement between predicted and regression values, with over 80% accuracy.
This finding highlights the potential of soft computing techniques as a useful tool for
predicting water quality, which could greatly enhance our ability to manage and
protect our water resources. Furthermore, By measuring multiple hydro-chemical
parameters and utilizing soft computing techniques, WQI can be predicted with high
precision, making it a viable alternative to costly water quality measurement station
II
TABLE OF CONTENTS
SR DESCRIPTION PAGE NO.

NO.
i. Acknowledgment I
ii. Executive Summary II
iii. Table of Contents III-IV
iv. List of Figures V-VI
v. List of Tables VII
vi. Abbreviations VIII
1. Introduction 1
1.1 Literature Review 2
1.2 Objectives 4
1.3 Motivation 5
1.4 Background 5
2. Project Description and Goals 6
2.1 Methodology 7
2.2 Primary Goals 7
3. Technical Specification 8
3.1 Collection of Data 8
3.1.1 Pre-processing of collected data 9
3.1.2 Detection of Outliers 9
3.2.The conventional method of finding water 11

quality index
4. Advancements in water quality Prediction using 14

ANN
4.1 Advantages of Machine learning (ANN) 16
4.2 Working of Different Models 18
4.3 Result and discussion 20
5. Variation of Ions in water sample 30
5.1 Hydrogeochemical characterization of a water 30

sample using Hill-Piper
III
5.2 Differentiation based on a variation of ions 32
(QGIS)
5.3 Differentiation using a Machine Model (SOM) 33
5.4 Variable correlation Diagram 37
6. Schedule tasks and milestones. 37
7. Conclusion 38
8. Project Demonstration 38
9. References 39
IV
List of Figure
Fig. No Caption Page No.
Fig 1.1 Study area 6
Fig 2.1 Methodology 7
Fig 3.1 Detection of outliers 10
Fig 4.1 Architecture of the NARX neural network 18
Fig 4.2 Architecture of the Elman Backpropagation Neural 19

Network
Fig 4.3 Architecture of the Cascade Forward Backpropagation 19

neural network
Fig 4.4 Architecture of the Feed forward neural network 20
Fig 4.5 Performance plots of NAXR network for Pre monsoon 21
Fig 4.6 Performance plots of NAXR network for Post monsoon 22
Fig 4.7 Training state plots of NARX network for Pre monsoon 22
Fig 4.8 Training state plots of NARX network for Post monsoon 23
Fig 4.9 Performance plots of Elman network for Pre monsoon 23
Fig Performance plots of Elman network for Post monsoon 24

4.10
Fig Training state plots of Elman network for Pre monsoon 24

4.11
Fig Training state plots of Elman network for Post monsoon 24

4.12
Fig Performance plots of Cascade network for Pre monsoon 25

4.13
Fig Performance plots of Cascade network for Post monsoon 25

4.14
Fig Training state plots of Cascade network for Pre monsoon 26

4.15
Fig Training state plots of Cascade network for Post monsoon 26

4.16
Fig Regression plots of Cascade network for Pre monsoon 27

4.17
V
Fig Regression plots of Cascade network for post monsoon 27
4.18
Fig Performance plots of Feed forward network for Pre 28

4.19 monsoon
Fig Performance plots of Feed forward network for Post 28

4.20 monsoon
Fig Training state plots of Feed forward network for Pre 28

4.21 monsoon
Fig Training state plots of Feed forward network for Post 29

4.22 monsoon
Fig Regression plots of feed forward network for Pre 29

4.23 monsoon
Fig Regression plots of feed forward network for Post 29

4.24 monsoon
Fig 5.1 Illustration Of Hill piper Diagram 30
Fig 5.2 Hill piper Diagram for Pre-monsoon 2019-2020 31
Fig 5.3 Hill piper Diagram for Post-monsoon 2019-2020 31
Fig 5.4 Spatial Variation of Calcium ions 32
Fig 5.5 Temporal variation of WQI 33
Fig 5.6 Architecture of SOM 33
Fig 5.7 Output of SOM 33
Fig 5.8 Allocation of neurons for input samples 34
Fig 5.9 Neuron to Neuron variation for each parameters 35
Fig Distances between neurons 36

5.10
Fig Correlation Plot 37

5.11
Fig 6.1 Time Plan 37
VI
List of Tables
Tab. No Caption Page No.
Tab: 3.1 Raw data collected for the pre-monsoon period of 8

2019-2020
Tab: 3.2 Raw data collected for the post-monsoon period of 9

2019-2020
Tab: 3.3 Statistical measures of input parameters 10
Tab: 3.4 Data Normalistaion 11
Tab: 3.5 Assigned weight for input parameters 12
Tab: 3.6 Calculated WQI for the pre monsoon period of 2019- 13
2020
Tab: 3.7 Calculated WQI for the post monsoon period of 2019- 13
2020
Tab: 3.8 WQI Range 14
Tab: 3.9 WQI Indian Standards 14
VII
List of Abbreviations
ANN Artificial Neural Network
WQI Water Quality Index
TDS Total Dissolved solids
Wi Relative weight
QGIS Quantum Geographic Information System
HHR Human Health Risk
ML Machine Learning
NARX Nonlinear AutoRegressive with eXogeneous outputs
SOM Self Organizing Model
MSE Mean Squared Error
VIII
1. INTRODUCTION
The assessment of groundwater quality is a crucial aspect of water resource

management, as groundwater is a major source of fresh water for human consumption,
agricultural use, and industrial applications. Traditional methods of water quality
assessment often involve collecting water samples and analyzing them in laboratories
for various parameters such as pH, turbidity, Total Dissolved Solids (TDS), and
chemical composition. However, these methods can be time-consuming, costly, and
sometimes challenging to implement in large-scale studies. Furthermore, they may not
always provide accurate results due to various environmental factors, such as
fluctuations in water quality over time and the impact of external factors such as
human activities.
With the advent of technological advancements, innovative approaches to water

quality assessment have emerged, such as Soft Computing techniques. Soft Computing
techniques are computational methods that utilize artificial intelligence algorithms to
make predictions based on incomplete, uncertain, or imprecise information. These
methods have shown great potential for water quality assessment, as they can process
large amounts of data quickly and accurately, without requiring a large amount of
prior knowledge about the system being analyzed. Moreover, these techniques have
been found to be more efficient and cost-effective than traditional laboratory-based
methods.
This capstone project focuses on evaluating the effectiveness of Soft Computing

techniques, including different models in Artificial Neural Networks (ANN) for the
assessment of groundwater quality. The primary aim of this study is to compare the
performance of these Soft Computing techniques with traditional laboratory-based
methods of water quality assessment. Specifically, the project will investigate the
accuracy of these techniques in predicting the water quality Index.
The findings of this study are expected to provide valuable insights for researchers and
policymakers in developing effective strategies for water resource management and
ensuring the safety of our drinking water supply. The use of Soft Computing techniques
in water quality assessment could revolutionize the way we monitor and manage our
water resources, making it more efficient, cost-effective, and accurate.
1
1.1.Literature Review
SR AUTHOR TITLE/JOURNAL KEY FINDINGS

NO
.
1. SAEEDI M, “Development Of Groundwater Quality To formulate water quality index

et. al Index”, Environmental Monitoring And for suitability of drinking water
Assessment, 18, 204-215
2. ARUNPRAK “Impact Of Urbanization In Groundwater Seasonal variation of quality of

ASH M. Of South Chennai City”, Environmental groundwater
et. al Earth Science
ADIMALLA “Groundwater quality evaluation using Evaluation of groundwater quality

3. N.,& (WQI) for drinking purposes and human for drinking purposes with use of
QIAN H. health risk (HHR) assessment in an WQI also assess the human health
agricultural region of Nanganur, south risk exposures through
India”, Ecotoxicology and Environmental contaminated drinking water
Safety 176(2019) 153-161 pathway
Lakshmipr “Groundwater Quality Analysis”, The sample were tested to find

4. iyA.R International Journal of Engineering chemical parameters and
et. al Research & Technology (IJERT) Vol 4, compared the values to desirable
(2016) limits as per IS standards
Varadaraja “Groundwater Quality Investigations – A They did a chemical analysis of

n N. & Case Study” samples collected from Belgaum
5. Purandara and Bijapur districts of Karnataka
B.K. by standard methods then
prepared a Hill-piper diagram to
2
determine the type they fall under
then calculated salinity using US
salinity Diagram
Prasanna “Study of evaluation of groundwater in The samples were collected from

6. M.V Gadilam basin using hydrogeochemical different seasons, and test results
Et. al and isotope data”, Environmental were compared on correlation
Monitoring and Assessment (2010) plots. Different sources and
168:63–90 surroundings were compared and
determined their composition
based on that
Bharani R. “Hydrogeochemistry and groundwater Using fuzzy logic and estimating

Et . al quality appraisal of part of south Chennai WQI along with the use of ArcGIS
7. coastal aquifers, Tamil Nadu, India using and Hill-piper Trilinear diagram
WQI and fuzzy logic method”, Applied
water sciences
Ramakrish “Assessment of Water Quality Index for Water Quality Index values was
naiah C.R the Groundwater in Tumkur Taluk, estimated by considering 17
8. et. al Karnataka State, India”, Coden Ecjhao E- parameters and forming
Journal of Chemistry, 2009, 6(2), 523- Regression Analysis Equation
530
Tiwari S.K. “Groundwater quality assessment using Considering 20 parameters for

9. et. al water quality index (WQI) finding WQI and using GIS
under GIS framework”, Applied Water software
Science (2021)
Balraj “Soft Computing Technique-Based Water quality index values were

10. Singh et.al, Prediction Of Water Quality Index”, predicted using 3 soft computing
Water Supply Vol 21 No 8, 4015 techniques, they considered 10
parameters and they are
compared based on six fitness
3
criteria
Through these research papers it reflected that, due to inaccuracies and omissions,
there was a problem with the quality of the data since GIS was constantly being
developed.
Therefore, it was challenging to handle the inconsistent data using traditional
computing techniques.Therefore, a mechanism that can handle such conflicting data is
needed.
Soft computing is made up of approaches that complement one another and offers a
flexible information processing capability for dealing with ambiguous scenarios that
arise in everyday life. These models allow for inconsistent, error-filled, noisy, and
missing value data. Thus, soft computing may offer a potent tool for GIS to solve the
inconsistent data issue.
1.2.OBJECTIVE
● To examine the suitability of groundwater for drinking and irrigation

purposes using conventional methods of obtaining the Water Quality
Index (WQI) using 44 samples from open wells in the Nagapattinam
district.
● Mapping the spatial and temporal variations of groundwater ions

including pH, Total Dissolved Solids (TDS), Bicarbonate, Chloride,
Sulphate, Calcium, Magnesium, Sodium, Potassium, and Nitrate using
QGIS.
● Predict groundwater quality using the most suitable machine model

from Artificial Neural Networks techniques.
4
1.3. Motivation
A key component in managing water resources and ensuring people have access to
clean drinking water is groundwater quality evaluation. But the traditional procedures
used for groundwater purity are difficult due to the presence of noisy information.
Additionally, those approaches take a long time to get right and may not handle all the
variables linked to water systems. Therefore, Machine learning models present
captivating responses for dealing with the complexity and variability of water bodies.
The choice to use soft computing strategies to evaluate the quality of groundwater for
my capstone project in my senior year was motivated by two major factors.
First off, software-based computing addresses prove helpful for determining water
quality because they are able to interact with ambiguous and fuzzy data. They can
rapidly examine such data, simulate nonlinear relationships, and generate accurate
projections. I am curious to learn how they can be utilized for actual data.
Furthermore, this capstone project allows us to practice utilizing cutting-edge

computer instruments and techniques. I can improve my technical expertise and
understanding in the discipline of water resources engineering, particularly in the
implementation of soft computing methods, through the conclusion of this project.
1.4. Background
Tamil Nadu, a state in southern India, includes the coastal district of Nagapattinam. It
is located in the Bay of Bengal between 10.7668° N latitude and 79.8447° E longitude
10.7668° N latitude and 79.8447° E longitude, it stands at the Bay of Bengal. The
district is rich in farmland and grows products like rice, sugar cane, and coconuts over
an area of around 2715 square kilometers. The Nagapattinam district receives
moderate to substantial rainfall from October to December during the monsoonal
season. The district of Nagapattinam is primarily flat and low-lying, rising on average 5
meters above sea level. It has a tropical climate with hot, muggy weather all year.
Roughly 1248mm of precipitation falls on average every year in the district. Due to
factors like El Nino or La Nina, there could be periodic variations typically. Over the
year, relative humidity fluctuates from 70 to 90 percent, causing it to be generally
humid. Owing to the monsoon rains, the state constantly observes 90% greater
moisture from July to September. Nagapattinam district's median temperature
5
oscillates between 27° C to 32° C. The two warmest periods are typically April and
May, with midday peaks usually leading to 35° c.
Fig: 1.1 Study Area
2. Project Descriptions and Goals
The capstone project implements soft computing algorithms to judge groundwater

quality. To derive predictive models for the water's features, the project will use input
obtained from numerous underground water sources in the selected region and soft
computing techniques such as fuzzy logic and neural networks. Quantitative criteria
like the mean absolute error and regression value of the prediction will be employed to
confirm the simulations. The experiment mainly uses sensitivity testing to investigate
the components that trigger contaminating groundwater and their respective
importance. The research will likely lead to the formulation of productive groundwater
administration strategies and reveal insight into the prospective usage of computer-
based tools for groundwater quality assessment.
6
2.1. Methodology
Fig: 2.1 Methodology

A well-structured plan is essential for the successful execution of the project. The
initial step involves obtaining sample data specific to the chosen region. The collected
data is then subjected to different analytical techniques to eliminate any errors. Once
the data is cleansed, a conventional method of the weighted average is employed to
calculate the water quality index. In addition, QGIS is utilized to generate maps
depicting the variation of ions. Finally, various Artificial Neural Network models are
employed to predict the water quality index based on the collected data.
2.2. Primary goals

The public water supply and drainage board of Tamil Nadu provided the information
for this study, and the weighted average approach was applied to calculate the water
quality index. The information was then thoroughly examined using a variety of
methodologies to discover more about the different physiochemical traits. Listed
below are the project's fundamental targets:
● The use of Hill Piper has been chosen to illustrate the hydrochemical attributes
present in water samples.
● Based on the input criteria, an ANN model (artificial neural network) can be
designed to project an index of water quality. Multiple algorithms can be drilled
with varying layer sizes to examine performance outcomes.
● It is feasible to figure out which elements are more reactive to fluctuations in
the water's quality score with visuals like correlation and machine models.
● Based on the threshold values of permitted and desired limits, the distinction of
water samples can be accomplished.
7
3. Technical Specifications
Examining an immense amount of data is a challenging but essential step in

determining the quality of groundwater. However, it is common to employ standard
quantitative metrics, using machine learning models to predict results also follows
similar steps:
3.1. Collection of Data

We are conducting our studies in Nagapattinum district, Tamil Nadu, which comprises
44 open wells, and we are analyzing data for the years 2019-2020. The raw data has
been sourced from the Tamil Nadu Water Supply Board, and it includes 10 different
parameters, such as pH, TDS, bicarbonate, chlorine, sulfate ion, calcium, magnesium,
sodium, potassium, and nitrate ion. We have collected the data for both the pre-
monsoon and post-monsoon periods of the specified years.
Table: 3.1 Raw Data Collected for the pre-monsoon period of 2019-2020
8
Table: 3.2. Raw Data Collected for the post-monsoon for period of 2019-2020
3.1.1. Pre-processing of collected data

The data collected from a third-party source may have some discrepancies, and since it
involves a large number of samples, there may be a possibility of missing data. It would
be cumbersome to manually inspect for errors and calculate the values. Therefore, we
employed Python programming to process and clean the data and to remove any
anomalies present in the dataset. Additionally, the statistical measures of mean and
standard deviation were utilized to normalize the information. This approach enabled
us to efficiently process the data and derive meaningful insights.
3.1.2. Detection of Outliers

One of the simplest approaches to tackling the problem of errors and missing data in
the dataset is to substitute the missing values with the mean of the input values. To
achieve this, we first computed the basic statistical measures of mean, standard
deviation, and minimum and maximum values for the dataset. Following this, we
identified any outliers present in the dataset, which are values that deviate
significantly from the central tendency of the data. This process of outlier detection
and mean substitution is a commonly used technique in data cleaning and preparation.
9
By leveraging these statistical methods, we were able to address the inconsistencies in
the data and obtain more reliable and consistent results.
Tab: 3.3 Statistical measures of input parameters
Fig: 3.1 Detection of outliers
The identification of outliers in the dataset was accomplished by computing the z-

score using a standard formula. Any values that were greater than 3 or less than -3
standard deviations from the mean were flagged for review and possible correction. As
depicted in Figure 3.4, some outliers were identified that deviated significantly from
the mean values, which is an accepted occurrence since it is impossible to obtain
completely pure water in the natural environment. These outliers may represent the
presence of contaminants in the water sample, and their identification is crucial in the
assessment of the water's quality. Through this process of outlier detection and
analysis, we were able to identify potential sources of contamination and make
informed decisions regarding the remediation of the affected water sources.
10
Following the removal of outliers and correction of erroneous values, the next step
involved data normalization, which is a crucial technique used to bring the data values
to a common scale. This is achieved by calculating the mean and standard deviation of
each sample and scaling the data to a comparable, equivalent scale. The process of
normalization is essential when dealing with data that has different ranges and units,
as it facilitates effective comparison and analysis of the data. By utilizing the statistical
measures of mean and standard deviation, we were able to normalize the data and
obtain reliable and consistent results that could be used for further analysis and
evaluation.
Tab: 3.4 Data Normalization
3.2. The conventional method of finding the water quality index
Upon completion of the data cleaning process for all 44 samples comprising 10
parameters, the next step involved assigning weights to each parameter based on their
relative importance and determining the permissible and desirable limits for each. By
utilizing these assigned weights, the relative weight Wi was computed for each
parameter, which takes into consideration their individual significance in the overall
assessment of groundwater quality. This approach enables a more accurate and
comprehensive evaluation of the water quality parameters and can help in identifying
potential sources of contamination and taking appropriate measures to address them.
11
Tab: 3.5 Assigned weight for input parameters
Desirable Limit Highest Assigned Relative

Permissible Weight weight Wi
Limit
pH 6.5 8.5 1 0.029411765
TDS 500 2000 5 0.147058824
HCO3 200 600 1 0.029411765
Cl 250 1000 5 0.147058824
SO4 200 400 5 0.147058824
Ca 75 200 3 0.088235294
Mg 30 100 3 0.088235294
Na 0 200 5 0.147058824
K 0 12 2 0.058823529
NO3 0 45 4 0.117647059
34
Using relative weight and the highest permissible value for each parameter, the sample
values are converted from mg per liter to milliliters.
Milliliter = Ov*100*Wi/ Sn
Where Ov= observed value for the ith parameter of the sample, Sn = standard
permissible value of the ith parameter (Refer Tab: 3.6)
Adding all the resultant values of 10 parameters will give the water quality index.
12
Tab: 3.6 Calculated WQI for the pre-monsoon period of 2019-2020
Tab: 3.7 Water Quality Index for the post-monsoon period of 2019-2020
After verifying the results, we can determine the water quality based on the calculated
index and assess its suitability for drinking. To gain deeper insights, we can create a
13
correlation. plot and analyze which parameters have a significant impact on the water
quality index. Based on this information, we can implement appropriate measures to
improve the quality of water and make it cleaner. It is important to take necessary
actions to ensure that the water is safe for consumption.
Tab:3.8 WQI Range
WQI Range Wells
0-25 19 samples
26-50 18 samples
51-75 6 samples
76-100 NIL
>100 1 samples
Tab:3.9 WQI Indian Standards
>100 unsuitable for Drinking
76-100 Very Poor
51-75 Poor
26-50 Good
0-25 Excellent
4. Advancements in Water Quality Prediction Using ANN

In order to address complicated problems, soft computing techniques leverage the
principles of biology derived from the natural world around us. Inconsistencies and
errors in real-life circumstances limit conventional approaches from always being
competent to tackle key problems since they necessitate complete information as well
as precise mathematical frameworks. To solve this issue, methods of soft computing
that can deal with such partial and faulty data are implemented. Some examples of soft
computing techniques are fuzzy logic, neural networks, genetic algorithms, swarm
intelligence, etc.
Fuzzy logic- Due to its use of imperfect truth and levels of inclusion notions, fuzzy logic
can deal with unclear and unreliable data. This makes it a powerful tool for dealing
14
with decision-making issues and identifying trends where specific information is
necessary but not readily available.
Neural networks- are composed of multiple layers of interdependent nodes that
collaborate cooperatively in order to generate the outcomes of an issue, analogous to
the brain of humans. It takes inputs in the input layer, does the processing in one or
multiple hidden layers, and finally produces an output layer with the output. Typically,
70% of the dataset, referred to as the training dataset, is used during training, and
input-output pairings are used to modify the connectivity among the neurons. Here,
training the network to generalize the patterns is the primary goal. The remaining
section of the dataset is divided into testing and validation datasets, where a fresh
batch of data is given and the network's performance is evaluated based on its
accuracy. It is primarily employed in forecasting problems owing to its capacity to
comprehend patterns and correlations between variables.
Genetic algorithm- The theory of genetics and natural selection are the underlying
principles of the genetic algorithm, a method of optimization. In this approach, an
assortment of potential solutions to an issue develops in the form of chromosomes,
which are strings of genes. The methods of selection (in which a set of parent
chromosomes with ideal characteristics chooses), crossover (in which the genetic code
of different pairs of parents is switched around to produce one with the most effective
genes), and mutation (in which the values of the genes are modified in a random
manner) are used to evolve these solutions based on how well they function. This is an
iterative process that is carried out until a population of the best solutions is reached.
Swarm intelligence- The technique utilizes the notion of interpersonal interactions
that can be observed in ecological systems like ants, bees, and fish, among others,
where simple species that are subject to basic laws cooperate. It is feasible to solve
extremely intricate issues through these exceptionally sophisticated synergic
interactions. It is additionally capable of adjusting to changes in situations and
responding via local interactions. It is mostly employed for issues like routing,
scheduling, and categorization that call for optimization.
4.1. Advantages of Machine Learning(ANN)

Accurate prediction: One of machine learning's biggest benefits is its capacity to
15
correctly forecast water quality measurements based on prior data. Regression
models, decision trees, and neural networks are a few examples of machine learning
algorithms that can analyze data patterns and predict outcomes with high levels of
accuracy. This can assist in identifying potential issues with water quality before they
become severe issues.
For instance, using information gathered from numerous sources, including satellite
imaging, water quality sensors, and weather data, machine learning models can
forecast the concentration of pollutants in water bodies. Machine learning algorithms
can use these datasets to analyze the possibility of water pollution and aid in stopping
the development of pollutants in the water.
Efficient Analysis: Monitoring water quality involves collecting an extensive volume
of information on multiple parameters, including pH, temperature, turbidity, dissolved
oxygen, and contaminants, from diverse sources, including rivers, lakes, and
groundwater. Data on water quality are often analyzed manually, which can be
laborious and prone to inaccuracy.
However, machine learning algorithms are an effective tool for monitoring water
quality because they can quickly and effectively analyze huge datasets. In large data
sets, machine learning algorithms can identify patterns and trends that could take
humans a long time or a lot of effort to notice. These algorithms can also determine
correlations between various aspects of water quality, which can be useful in locating
probable sources of contamination and forecasting problems with water quality.
Improved decision-making: Machine learning algorithms' insights can help decision-
makers find the best measures for maintaining or improving water quality. For
instance, information on the causes and sources of water contamination can be
provided by machine learning algorithms, which could help policymakers create
focused solutions to these problems. Machine learning may additionally provide
insight into the effects of various water quality management measures, such as the
efficiency of various treatment methods or the effects of changing land use on water
quality.
Cost-effective: Machine learning could help in improving resource utilization and
lowering the cost of managing and monitoring water quality. Machine learning, for
instance, can assist in lowering the need for expensive laboratory testing, which can be
time- and resource-intensive. Machine learning can also assist in lowering the need for
regular manual monitoring of water quality parameters, which can be expensive, by
offering accurate predictions and real-time monitoring.
16
The ability of ANNs to represent both linear and non-linear relationships is one of their
key advantages. This means that even when such relations are not evident or clear to
describe using conventional statistical approaches, ANNs may uncover complicated
patterns and relationships in data.
In order to find patterns and correlations between various groundwater quality
statistics, ANNs can be trained to analyze data from a variety of sources, such as water
quality monitoring stations, geological data, and land-use information. ANNs can offer
a simple technique to model groundwater quality and produce precise predictions
about the quality of groundwater at specific locations by recognizing these
relationships directly from data.An additional benefit of ANNs is their ability to present
simulated values for desirable places where measured data are requested but
necessary for water quality estimates to be unavailable. This is especially helpful for
assessing the quality of groundwater because it can be difficult to get information from
all relevant areas. Models for the water quality in these areas can be made more
thorough and precise by simulating data using ANNs.
Additionally, ANNs develop knowledge by themselves and generate results
independent of the input. This means that even when these patterns are not obvious or
widely known, ANNs can nevertheless find hidden patterns and relationships in data.
As a result, ANNs are able to recognize complex relationships between groundwater
quality metrics and make predictions about that quality that are more precise.
Finally, since ANNs store input in their own networks rather than a database, they are
unaffected by data loss in terms of how they operate. As a result, ANNs are particularly
helpful in situations where data collection is challenging or costly, as they may
continue to produce precise predictions even in the presence of missing or insufficient
data.Being able to represent both linear and non-linear connections, generate
simulated values for desired areas, learn automatically, and perform well even in the
absence of complete data are just a few of the benefits that ANNs offer in terms of
evaluating groundwater quality. Utilizing these benefits, ANNs can assist water quality
professionals in making better decisions that better safeguard environmental and
human health.
4.2. Working of Different Models

The soft computing technique used for this study is the Artificial Neural Network
17
(ANN), which has numerous models in it. The processing was done in, and the neural
networks chosen were Cascade forward backpropagation, feed-forward, Elman
backpropagation, the NARX (Nonlinear AutoRegressive with eXogeneous Outputs)
neural network, and Self-organizing maps, for the purpose of training and predicting
the water quality index of the Nagapattinam district located in Tamil Nadu.
NARX neural network: nonlinear system simulation is primarily accomplished using
this. It has an input layer that houses the network's input. In order to fully comprehend
the functioning of the framework, the input for this network consists of the prior
inputs and outputs. The input data is transformed into the output through the hidden
layer, and the outcome, which is what is projected for the current time, is indicated in
the output layer. In order to train the network, input-output pairs are fed into it.
Backpropagation weight adjustments are then made in order to optimize the network
by minimizing the discrepancy between estimated and actual output. After the
completion of the training process, a new dataset is introduced by feeding the relevant
historical input-output pairs, and the output is then predicted and reported. Problems
involving time series and signal processing are the principal applications for this.
Fig:4.1 Architecture of the NARX neural network
Elman backpropagation: This particular recurrent neural network has applications in

the area of soft computing and includes three interconnected types of layers: input,
hidden, and output. The quantity of input values in the input layer varies depending on
how intricate the problem being studied is; for tasks of high complexity, there may be
more than one input layer. The difficulty of the problem also determines the number
of hidden layers, which are fed information from the output of the input layer and the
previously hidden layer. The output layer produces the system's final output after
receiving the output of the hidden layers. Additionally, this approach makes use of the
backpropagation technique, which minimizes the discrepancy between the expected
and actual output values by modifying the network weights. The backpropagation
methodology is used to eliminate errors as input and output values are fed into the
system during training. After the training phase has concluded, the system can be
18
tested by feeding it new input data to anticipate outputs, known as the "testing phase."
Fig:4.2 Architecture of the Elman backpropagation neural network
Cascade forward backpropagation: Since this operates as a feedforward network, it

features just input to the output data flow. The input layer receives the data, which
proceeds to be transmitted to the hidden layer. Each neuron in the system is connected
using weights, and because the output of one layer functions as the input of the next,
the weights are allocated appropriately, limiting overfitting. Each neuron calculates the
weighted total after receiving the inputs and then executes an activation function in
order to generate the output. Given the variety of activation functions available,
including sigmoid, tanh, and ReLU, among others, this activation function is selected
depending on its objective, such as the binary categorization that the sigmoid function
is capable of performing. Each projected output is compared to the actual output after
its generation in order to identify discrepancies. The weights of the connections
undergo modifications to minimize the variation between the actual and predicted
output as these errors move in a backward pass. As long as precise outcomes are
obtained, this iterative procedure is maintained.
Fig:4.3 Architecture of a Cascade Forward Backpropagation Neural Network
Feed-forward neural network- Also known as the multilayer perceptron, this network
is identical to the cascade forward backpropagation neural network, and functions
similarly. It has an output layer, several hidden layers, and one input layer comprising
the input values. All of the above layers are linked by weights, and the neuron in the
following layer utilizes the output from the layer prior as its input before the activation
function is used to produce the output. First, the weights are initialized with random
values for the input data, which is done through the forward pass procedure. Here, the
inputs are provided, and the current weights are used to construct the output. The
19
generated and the actual outputs are compared, and the error is conveyed back by
backward pass, just as in the cascade forward approach. The weights are then changed
as necessary to reduce inaccuracies up until precise results are generated. The only
distinction between a feed-forward neural network and a cascade-forward neural
network is in the learning algorithm and network design. Feed forward has a set
number of layers that are chosen before the training phase, unlike cascade forward
which has an adaptive architecture. This method is faster and less expensive when
performing calculations than the cascade forward method since it uses fewer
computing materials.
Fig:4.4 Architecture of a Feedforward Neural Network
4.3. Results
The soft computing technique used here was ANN (Artificial Neural Network) due to
the advantages mentioned in section 3.2. A total of five neural networks, namely
Cascade forward backpropagation, Feed-forward, Elman backpropagation, the NARX
(Nonlinear AutoRegressive with eXogeneous outputs) neural network, and Self-
organizing maps, were compared, and the most suitable network was selected based
on accuracy and performance. The training function used was TRAINLM, and the
adaptation learning function used was LEARNGDA. The mean squared error (MSE) was
used as the measure of performance as the performance function, and the training,
validation, and testing were done for 5, 10, and 30 layers in order to obtain the best
results, and the network was trained only once since the sample size is small. As
mentioned, 70% of the data was used as a training dataset, 15% was used as a
validation dataset with which the network was not familiar, and 15% was used for the
testing phase. Out of 44 values of output, which is the WQI, only 5 values were made
known to the network, so it could predict the remaining values, which could help in
determining the level of precision of the network. The three figures in results for each
models represents 5 layer, 10 layer and 30 layer respectively.
20
NARX neural network
Upon executing the network, 2 plots were generated to understand the accuracy of
thenetwork for the given input and output. They were the performance plot and the
training state plot. A comparison of 5, 10, and 30 layers for both pre-monsoon and
post-monsoon is depicted below:
Fig:4.5 Performance plots of NARX network for the pre-monsoon

The y-axis in this graph is the mean squared error (MSE), which is one of the measures
to analyze the performance of the system, and the x-axis is epochs, or the number of
iterations the network runs. The blue line represents the training dataset, which is
70% of the main dataset, the green line represents the validation dataset, which
comprises 15% of the data, the red line depicts the testing dataset, which uses the
remaining proportion of the data; and the dotted line is the best value, showing the
point where the network attains the least error. As the network is trained, the errors in
the training dataset continuously reduce, so for the analysis of the performance, the
validation dataset is examined which is a new dataset which the system isn't familiar
with. Also, the excess training of the network might cause overfitting where the
training accuracy is high. Still, validation accuracy is low which happens as the
network begins memorizing the values instead of generalizing the pattern. Here it can
be noted that the least error while using 5, 10, and 30 layers is 212.3752, 131.7722,
and 90.4549 respectively which was attained after 2, 3, and 40 epochs respectively.
When using 10 layers, the error generated was far less than with 5 layers but slightly
higher than that of 30 layers which might be because it underwent more iterations.
21
Fig:4.6 Performance plots of NARX network for post-monsoon
Here the comparison in the performance plots of different numbers of layers in the
post-monsoon is done. The least errors obtained while using 5, 10, and 30 layers are
102.8155, 8.6166, and 3660.5048 respectively. It is clear that 10 layers produced the
least amount of errors after only 24 epochs and hence it was the most effective and
precise arrangement.
Fig:4.7 Training state plots of NARX network for pre-monsoon

The training state plot consists of 3 main graphs namely the gradient plot, the Mu plot,
and the validation check to determine how well the network is performing. The
gradient plot makes it possible to know how well the network is being trained and
what possible changes could improve
the accuracy. A higher gradient implies that the training is done very fast and the
generated solutions might not be precise whereas a low gradient shows that the
network is being trained very slowly therefore increasing the learning rate might
improve the process. The Mu plot deals with the learning rate of the network i.e., the
speed of the training process. A higher learning rate implies that the network is
learning very quickly which initially starts to generate overabundance of the most
favorable solutions. On the other hand, with a low learning rate, the network can
produce proper output. The validation check is done on the validation dataset. From
the above graph, it can be seen that the 5 layer plot, with the gradient of 2086.7484
after the final iteration, had a more accurate training than the 10 and 30-layer plots
with their gradients of 9124.4382 and 11847.3048 respectively as the gradient was
also smooth comparatively. Layer 10 plot had the highest Mu overall indicating that
22
the network was learning rapidly and generating quicker results compared to the
other 2 plots.
Fig:4.8 Training state plots of NARX network for post-monsoon

In the case of the post-monsoon data of groundwater, the gradient curve is very
irregular throughout the 3 networks which implies that the training wasn't smooth.
However, a decreasing trend in the 10-layer network is roughly visible. Also, the layer
10 plot can be seen to have a minimum gradient of 7.5361 at the end of the training
phase. The network with 30 layers had the fastest learning rate and updating the
weights and bias compared to the other two tested networks.
Elman backpropagation neural network
In this network two governing plots were generated as well which were the training
state and performance plot. Comparison between the plots for different layers for both
pre-monsoon and post-monsoon is illustrated below:
Fig:4.9 Performance plots of Elman network for pre-monsoon

Similar to the NARX neural network, the y axis in this plot represents the mean
squared error which is the measure of performance and the x-axis represents the
number of times the data was passed through the whole network called the epochs.
For the pre-monsoon data, the least MSE for 5, 10, and 30 layers were 33.6709,
80.4438, and 1923.9966 respectively. When using 10 layers, the errors compared to
using 30 layers were significantly low which might be due to the small sample size and
excess layers in 30 layers. The 5 layer plot seemed to perform better than the 10-layer
plot with a lesser number of iterations in this case.
23
Fig:4.10 Performance plots of Elman network for post-monsoon
In the post-monsoon dataset, it is clear that the 5 layer network performed better, with
MSE only 10.6312, than 10 and 30 layer networks whose MSE values were 178.0775
and 215.572 at 11 and 7 epochs respectively. From the performance plot of both pre-
monsoon and post-monsoon, it can be noted that the network with 5 layers was the
best with the least errors.
Fig: 4.11 Training state plots of Elman network for pre-monsoon

For the pre-monsoon dataset, it can be observed that the smoothest training was of the
network with 30 layers and it also has the least gradient implying that the training
process was precise. The 5 and 10 layer plots had a similar trend in Mu graph
indicating the rate of learning was not too varied from each other whereas the Mu
graph in the 30 layer plot decreased in a linear manner.
Fig:4.12 Training state plots of Elman network for post-monsoon
Cascade forward backpropagation neural network

This type of network produces 3 plots in total: the performance plot, training state
plot, and the regression plot. The regression plot is an additional plot here and it is a
useful tool to properly understand the accuracy of the network as it facilitates the
ability to predict the output as well unlike the Elman and NARX neural networks. This
was the selected network out of all other selected networks due to its accuracy. The
24
comparison for 5, 10, and 30 layers is depicted below for all types of plots for the pre-
monsoon and post-monsoon dataset:
Fig:4.13 Performance plots of Cascade network for pre-monsoon

Just like the previous performance plots, the network is analyzed based on the
validation dataset (green line) as the errors for the training dataset, which is depicted
by the blue line, will continue plummeting upon more training. This might be ideal for
the training data but might cause overfitting by memorizing the data and lowering the
validation accuracy rather than generalizing the pattern, which is the main purpose of
the network. Here it can be observed that the best performance in the validation
dataset for 5, 10, and 30-layer networks was 17.634, 49.0322, and 4.8084 at 4, 3, and
14 epochs respectively. In any case, if the network is trained more than the epochs
given, the MSE starts increasing for the validation dataset as a result of overfitting.
From the given plots it can be identified that the 30-layer network had the least
amount of error after 14 iterations making it more precise than others. However, it
should be noted that the performance plot is a vague measure of accuracy compared to
the regression plot.
Fig:4.14 Performance plots of Cascade network for post-monsoon

When analyzing post-monsoon data, it can be recognized that the 30-layer network
had the MSE value of 143.7315 which is higher compared to 5 and 10-layer network
whose best performance on validation dataset was 4.2066 and 88.7382 respectively
making the network with 5 layers the network with the least MSE.
25
Fig:4.15 Training state plots of Cascade network for pre-monsoon
The declining trend of the plots indicates that the speed of the training of the network
gradually decreased with time. This means that the weights of the network had to be
constantly updated to reduce the errors in the start of the training process and as the
network was trained, the errors reduced and therefore minute updating was required
by the end. From the generated plots, it can be observed that the training process of
the 10 layer plot was the smoothest and the errors in training was the least for the 5
layer plot for the training dataset with only 5.1162 as the gradient by the end of the
training. From the Mu plot which represents the learning rate, it can be noted that the
5 and 10 layer network had a similar trend as well as the same value of learning rate, 1,
towards the end implying that the weights were being updated and response was
generated much quicker towards the end of the process. The system with 30 layers
was learning slower compared to the other two.
Fig:4.16 Training state plots of Cascade network for post-monsoon

The gradient values for the 5, 10 and 30 layer networks were 2.6013, 40.5493 and
15.4323 respectively. The gradient plot of the 5 layer network was the best as towards
the end the weights were not updated frequently as a result of less errors, implying
that the network had been trained properly. The training of the 30 layer plot was
smooth but still had some errors at the end and the 10 layer network had the most
amount of errors by the end of the training process.
26
Fig:4.17 Regression plots of Cascade network for pre-monsoon
Through the regression plot, the variation between the actual output and the predicted
output could be determined, as it has the ability to predict the outputs. For precision
purposes, regression values greater than 0.8 are considered suitable and accurate. In
the regression plot, the x axis is the number of samples, and the y axis represents the
WQI. The dotted line in the background in each plot represents the actual outputs and
therefore has a regression value of 1 indicating the values are perfect. There are 4 plots
in each regression plot: the blue line represents the performance and prediction for
the training dataset, the green line for the validation dataset whose values are not
familiar to the network, the red line is the performance on the testing dataset, and the
last plot with the black line is the overall performance. The regression values can range
from -1 to 1. The regression values should be closer to 1, indicating that the variation
between the predicted and actual outputs is minimum and hence the errors are also
minimum. The results of the pre-monsoon data for all 3 networks were highly
accurate, as all values were greater than 0.8 and the variations between the plots were
very minute.
Fig:4.18 Regression plots of Cascade model for post-monsoon

For the post-monsoon data as well the performance of all 3 networks with the training,
validation, and testing dataset was greater than 0.8 and very close to 1 which would
mean the network was flawless.
27
Feedforward Neural Network
This type of network also facilitates the prediction of the outputs and therefore the
precision of the model can be examined in a much more concrete manner. Networks
with 5, 10 and 30 layers were compared against each other. The plots generated here
were the performance plot, training state plot and the regression plot. Comparison for
each plot for the pre-monsoon and post-monsoon is depicted below:
Fig:4.19 Performance plots of Feed-forward network for pre-monsoon

The best performance with the least MSE for the validation dataset was for the
network with 5 layers with the MSE value at the best validation point was 17.6762.
Fig:4.20 Performance plots of Feed-forward network for post-monsoon

Upon the analysis of the post-monsoon data, it can be noted that the network with 10
layers proved to have the least error for the validation dataset after 3 epochs making it
the most precise among others.
Fig:4.21 Training state plots of feed-forward network for pre-monsoon

For the pre-monsoon data, the training error for the 30 layer network was the least
indicating the network was trained accurately and the changes to be made at the end
of the training process was lesser than that of the other networks and the rate of
learning was the least at the end.
28
Fig:4.22 Training state plots of feed-forward network for post-monsoon
For the post-monsoon data as well the network with the least errors for the training
dataset was the one with 30 layers whereas the learning rate of the 5 layer network
was the least towards the end.
Fig:4.23 Regression plots of Feed-forward network for pre-monsoon

From the regression plots of pre-monsoon data, it can be observed that the accuracy of
the networks with layers 5 and 10 were high as their values for each dataset was more
than 0.8 whereas, the 30 layer network had regression values for all the dataset lesser
than 0.8 indicating the high amount of variation between the predicted and the actual
outputs. This could be because of the small sample size and excess number of layers.
Fig:4.24 Regression plots of Feed-forward network for post-monsoon

For the post-monsoon dataset, the 10 layer network performed the best with
minimum deviation of the predicted values from the actual values of the output. The
regression values were greater than 0.8 for each dataset for the 10 layer network
whereas, the precision of the 5 and 30 layer network was very less comparatively
indicating the huge difference between the predicted and the actual outputs.
29
5. Variation of Ions in water samples
In places where access to clean water is a concern, the study of hydrogeochemical
variation of ions in water samples has grown in significance. The Hill Piper diagram,
which depicts the dominating ions in the water samples, can be used to analyze this
variation. Utilizing Geographic Information Systems (GIS) to spatially analyze and map
the change of ions in water samples is another method. In order to forecast the change
of ions in water samples, self-organizing models and artificial neural network (ANN)
machine models have also been used. These models can help with water resource
management and conservation efforts by helping to comprehend the complex
relationships between various ions and their sources.
5.1. Hydrogeochemical characterization of a water sample using Hill piper
Fig: 5.1 Illustration of Hill-Piper Diagram
The Hill-Piper diagram, often also known as the Piper trilinear diagram or the Piper
plot, is a graphical representation of water chemistry data that illustrates the relative
proportions of various chemical components, such as cations and anions, in a water
sample. The illustration was created in 1942 by Arthur D. Hill and Albert F. Piper and is
frequently used in hydrogeology, environmental science, and water resource
management.
Three triangular axes, one for each of the relative proportions of cations, anions, and
total dissolved solids (TDS) in the water sample, make up the diagram. The main
cations and anions, such as sodium (Na+), potassium (K+), calcium (Ca2+), magnesium
(Mg2+), chloride (Cl-), sulfate (SO42-), and bicarbonate (HCO3-), are represented by
the triangles' vertices.
30
● Magnesium bicarbonate + Mixed + Calcium chloride = Alkaline earth exceed
alkalies
● Sodium chloride+mixed+sodium bicarbonate = Alkalies exceed alkaline earth
● Magnesium bicarbonate + Sodium bicarbonate + mixed = weak acids exceeds
strong acids
● Calcium chloride+sodium chloride + mixed = strong acids exceed weak acids
These are chemical reactions between different compounds, resulting in changes in the
relative proportions of different chemical species. In the first reaction, there is an
excess of alkaline earth over alkalies, while in the second reaction, there is an excess of
alkalies over alkaline earth. The third reaction results in weak acids exceeding strong
acids, while the fourth reaction results in strong acids exceeding weak acids. These
reactions have implications for the overall chemistry of the water and can be used to
classify water types based on their chemical composition.
Fig: 5.2 Hill piper Diagram for Fig: 5.3 Hill piper Diagram for
Pre-monsoon 2019-2020 Post-monsoon 2019-2020
These diagrams demonstrate that sodium and potassium ions (Na+-K+) dominate over
other cations in the water samples analyzed. According to this, the concentration of
these two elements in the water is higher than that of any other cation.
Another graphical representation of water chemistry data that identifies the
predominant water type is the diamond diagram. According to the graphic, all of the
analyzed water samples were of the sodium chloride type, indicating that their
concentrations of sodium and chloride ions were higher than those of other cations
and anions.
These results collectively imply that the sodium and chloride ions in the water samples
analyzed are quite high and that the relative proportions of various cations and anions
31
can change based on the particular chemical components of the water. This data can be
helpful in spotting possible problems with water quality and developing effective
management plans. But it's crucial to remember that these results only apply to the
water samples that were examined, and they might not apply to other water sources.
To confirm these results and their wider implications for the management of water
resources, additional study and analysis may be required.
5.2. Differentiation based on a variation of ions (GIS)

There are various methods followed by researchers to evaluate water quality, and with
advancements in technology, several advanced techniques can be used to gather
information about the variability of parameters in the data. To address this challenge,
we decided to utilize the QGIS software to analyze how various ions in water samples
collected at different times vary. Specifically, we collected results for the pre-monsoon
and post-monsoon seasons of 2019-2020, and by comparing the data from both
periods, we can determine how the levels of ions are fluctuating over time.
Our primary focus was on analyzing the spatial and temporal variations of ions. The
term spatial refers to the changes in the levels of ions across different locations,
whereas temporal variation pertains to the changes in the levels of ions over time. This
was the main area of interest that we aimed to investigate.
Fig: 5.4 Spatial variation of calcium Ions
32
Fig: 5.5 Temporal Variation of WQI
The spatial variation map we have generated is specifically for the calcium ion, and it
allows us to identify areas where the values are exceeding the permissible limit, as
indicated by the legend in the figure. By observing the map, we can see regions where
the ion levels are denoted by yellow, pink, and red, indicating that they are beyond the
desirable limit and potentially nearing the permissible limit. This information could
serve as a starting point for conducting further studies to understand the reasons for
these observations. Similarly, one can set limits for other parameters and analyze their
spatial variation using different colors on the map.
5.3. Differentiation using machine model (SOM)
Fig: 5.6 Architecture of SOM Fig: 5.7 Output of SOM
33
An artificial neural network that learns patterns and relationships in data on its own is
called a "self-organizing network." Self-organizing neural networks can learn the
underlying structure of the data on their own, as opposed to supervised neural
networks, which need labeled input to train from. A self-organizing network's primary
principle is to modify the weights of the network's neurons in search of the best
possible representation of the input data. A layer of input neurons and a layer of
output neurons make up the network. While the output neurons represent various
groups of related data points, the input neurons take in the raw data. A set of input
samples is given to the network during training. The output layer's neurons' weights
are initially initialized at random and subsequently modified according to how close
they are to the input samples. More updates are made to neurons nearby the input
samples than to neurons farther away.
As a result of this process, the neurons eventually organize into clusters that stand in
for various collections of related input data samples. These clusters can then be used to
group incoming data points according to how closely they resemble the clusters that
already exist.
Clustering, dimensionality reduction, and feature extraction are examples of
unsupervised learning tasks that frequently make use of self-organizing networks.
They have been applied to a variety of tasks, such as voice and picture recognition and
anomaly detection.
The data used for this research is from 2019-2020, and it can be compared with future
datasets to identify any changes over time and variations of parameters across
locations.The following results shown are for Pre-monsoon and post-monsoon
respectively.
Fig: 5.8 Allocation of Neurons for Input samples

The Sample Hits plot is a visualization tool that displays the location of each neuron in
34
a two-dimensional grid and the number of observations connected to it in order to
evaluate the performance of a Self-Organizing Map (SOM). No neuron has more than
two observations associated with it in either the pre-monsoon plot or the post-
monsoon plot, which shows that the distribution of observations is fairly balanced
across the neurons.
Fig: 5.9 Neuron to neuron variation for each parameter
The Weight Planes plot displays a grid with two dimensions that shows the weights of
different input parameters, such as TDS, pH, HCO3, Cl, SO4, Ca, Mg, Na, K, and NO3.
Neurons with similar weights are depicted using a light yellow color, while those with
dissimilar weights are shown using darker shades of red and orange. When the
weights are similar, it indicates a high correlation between the neurons. In the pre-
monsoon Weight Planes plot, the pH input parameter has similar weights across the
grid, while in the post-monsoon plot, the weights vary significantly. This plot is useful
for comparing the weights of different input parameters to identify seasonal and
temporal variations.
35
Fig: 5.10 Distances between Neurons
A visualization tool called the Neighbouring Weight Distances plot shows the
separations between neighboring nodes in a SOM. It comprises nodes, which are
shown by blue circles, and the red lines that connect them. Different yellow hues can
be seen in the hexagons that the lines cross; lighter hues denote nodes that are closer
together and more likely to influence one another, while darker hues denote nodes
that are farther apart and less likely to impact one another. While there are
irregularities and greater distances between nodes in the pre-monsoon plot, the
weights are more uniformly distributed and closer together in the post-monsoon
figure. In general, the Self-Organizing Map methodology works well for determining
the standard of groundwater.
The information offered by the presented data makes it easy to comprehend how the
input weights change over the pre-monsoon and post-monsoon periods, as well as the
spatial distribution and distances between samples. These findings are essential for
identifying and tracking water contamination, especially in light of the present global
water crisis. Monitoring changes in the distribution and make-up of water samples
over time and place can help focus future research and lead to the development of
workable remedies to decrease the effects of water-related disasters. These results
significantly advance our knowledge of the intricate dynamics governing water quality
and availability and have substantial implications for global sustainability and
resilience in the face of environmental issues.
5.4. Variable correlation Diagram
Fig: 5.11 Correlation Plot

By analyzing the correlation plot, it is possible to determine the sensitivity of each
element toward the water quality index. The values on the plot, which range from -1 to
1, provide insights into the degree of correlation between variables. A correlation
coefficient of 1 indicates a direct and proportional relationship between the variables.
36
Based on the results of the analysis, it can be observed that TDS, Chlorine, and Sodium
exhibit high correlation coefficients above 0.94, indicating a strong positive correlation
with water quality. This implies that changes in these variables are likely to have a
significant impact on the overall quality of water. Furthermore, the correlation plot can
be used to identify potential outliers and relationships between variables that may not
be immediately apparent through other means of data analysis.
6. Schedule tasks and milestones.

The tasks were initially listed, along with a deadline for finishing each one, and each
work was completed on schedule or earlier. The total time for the completion of this
project, including the report, was 32 weeks. Following is the time plan for the project
work:
Fig: 6.1 Time plan
7. Conclusion
Computer-based models called Artificial Neural Networks imitate how biological
neurons in the human brain work. A set of input parameters can be used to train these
models to forecast an output or categorize an item. The Water Quality Index , a single
number that measures the overall quality of a water sample based on numerous
physicochemical factors, has been predicted using ANN models in the context of water
quality assessment. The capacity to conserve resources is one advantage of employing
ANN models for water quality assessment. Traditionally, measuring numerous
physicochemical properties on water samples requires performing multiple
experiments. This procedure can be costly and time-consumingIn contrast, ANN
models can be trained using a relatively small dataset, which can then be used to
predict WQI values for new water samples without conducting additional tests.
In this case, ANN models have been trained using a set of 10 parameters specifically
chosen to determine the WQI in the Indian subcontinent(Nagapattinam). These
37
parameters were selected based on their relevance to water quality and availability in
water quality datasets. The models demonstrated their ability to predict WQI values
when WHO-defined parameters were used, indicating the potential for global
applications.However, it is important to note that variations in network parameters
can impact the results. This means that a larger training dataset with more members
may yield better regression values with learning rate and gradient. In addition, studies
have shown that a 10-layer network model has the highest regression when predicting
WQI values.Despite these considerations, the use of ANN models in water quality
assessment has the potential to simplify the complexity associated with interpreting
WQI. This can make the assessment process more efficient and cost-effective,
particularly in regions where access to water quality testing equipment is limited.
8. Project demonstration
As this study makes intensive use of soft computing techniques, different ANN models,
namely, Cascade forward backpropagation, feed-forward, Elman backpropagation,
NARX neural networks, and Self-organizing maps, for the purpose of training and
predicting the water quality index of the Nagapattinam district located in Tamil Nadu.
Other tools included GIS for the mapping of spatial and temporal variations, a decision
tree, a correlation plot for determining the impact of each parameter on the WQI,
cleaning and normalizing of the data, and a Python code for the classification of the
samples from the machine model.
38
9. References
● M. Arunprakash, R. R. Krishnamurthy, & M. Jayaprakash (2013) “Impact of

urbanization in groundwater of south Chennai City, Tamil Nadu, India”,
Environmental Earth Science 71 (2).
● Saeedi M., Abessi O., Sharifi F., & Meraji H. (2009). “Development of
groundwater quality index”, Environmental Monitoring and Assessment 327-
335 (2010).
● Ramakrishnaiah C.R., Sadashivaiah C. & Ranagana G.(2009) “Assessment of
Water Quality Index for the Groundwater in Tumkur Taluk, Karnataka State,
India”
● Kumar S.K., Bharani R., Magesh N.S., Godson P.S., & Chandrasekar N. (2014)
“Hydrogeochemistry and groundwater quality appraisal of part of south
Chennai coastal aquifers, Tamil Nadu, India using WQI and fuzzy logic
method”, Applied Water Science 341–350 (2014).
● Walczak S., & Cerpa N. (2003) “Artificial Neural Networks”, Encyclopedia of
Physical Science and Technology (Third Edition) (2003).
● Ram A., Tiwari S.K., Pandey H.K., Chaurasia A.K., Singh S., Singh Y.V. (2020).
“Groundwater quality assessment using water quality index (WQI) under GIS
framework”
● Malekian A., Chitsaz N. (2021). “Concepts, Procedures, and Applications of
artificial neural network models in streamflow forecasting”, Advances in
Streamflow Forecasting (2021).
● D.Mostaza-Colado., F.Carreno-Conde., R.Rasines-Ladero., S.lepure.
(2020).“Science of the Total Environment”
39

Capstone Project Report

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Capstone Project Report

Uploaded by

Copyright:

Available Formats

ASSESSMENT OF GROUNDWATER QUALITY USING SOFT

Submitted in partial fulfillment of the requirements for the degree of

Under the guidance of

School of Civil Engineering

I hereby declare that the thesis entitled “ Assessment of

RISHU KUMAR THAKU

Signature of the candidate

Internal Examiner External Examiner

Head of the Department

SR DESCRIPTION PAGE NO.

ii. Executive Summary II

iii. Table of Contents III-IV

iv. List of Figures V-VI

v. List of Tables VII

vi. Abbreviations VIII

1.1 Literature Review 2

2. Project Description and Goals 6

2.2 Primary Goals 7

3.1 Collection of Data 8

3.1.1 Pre-processing of collected data 9

3.1.2 Detection of Outliers 9

3.2.The conventional method of finding water 11

4. Advancements in water quality Prediction using 14

4.1 Advantages of Machine learning (ANN) 16

4.2 Working of Different Models 18

4.3 Result and discussion 20

5. Variation of Ions in water sample 30

5.1 Hydrogeochemical characterization of a water 30

5.3 Differentiation using a Machine Model (SOM) 33

5.4 Variable correlation Diagram 37

6. Schedule tasks and milestones. 37

Fig. No Caption Page No.

Fig 1.1 Study area 6

Fig 2.1 Methodology 7

Fig 3.1 Detection of outliers 10

Fig 4.1 Architecture of the NARX neural network 18

Fig 4.2 Architecture of the Elman Backpropagation Neural 19

Fig 4.3 Architecture of the Cascade Forward Backpropagation 19

Fig 4.4 Architecture of the Feed forward neural network 20

Fig 4.5 Performance plots of NAXR network for Pre monsoon 21

Fig 4.6 Performance plots of NAXR network for Post monsoon 22

Fig 4.9 Performance plots of Elman network for Pre monsoon 23

Fig Performance plots of Elman network for Post monsoon 24

Fig Training state plots of Elman network for Pre monsoon 24

Fig Training state plots of Elman network for Post monsoon 24

Fig Performance plots of Cascade network for Pre monsoon 25

Fig Performance plots of Cascade network for Post monsoon 25

Fig Training state plots of Cascade network for Pre monsoon 26

Fig Training state plots of Cascade network for Post monsoon 26

Fig Regression plots of Cascade network for Pre monsoon 27

Fig Performance plots of Feed forward network for Pre 28

Fig Performance plots of Feed forward network for Post 28

Fig Training state plots of Feed forward network for Pre 28

Fig Training state plots of Feed forward network for Post 29

Fig Regression plots of feed forward network for Pre 29

Fig Regression plots of feed forward network for Post 29

Fig 5.1 Illustration Of Hill piper Diagram 30

Fig 5.2 Hill piper Diagram for Pre-monsoon 2019-2020 31

Fig 5.3 Hill piper Diagram for Post-monsoon 2019-2020 31

Fig 5.4 Spatial Variation of Calcium ions 32

Fig 5.5 Temporal variation of WQI 33

Fig 5.6 Architecture of SOM 33

Fig 5.7 Output of SOM 33

Fig 5.8 Allocation of neurons for input samples 34