Capstone Report 2

ASSESSMENT OF GROUNDWATER QUALITY USING SOFT
COMPUTING TECHNIQUES
Submitted in partial fulfillment of the requirements for the degree of
Bachelor of Technology
in
Civil Engineering
by
Ashish Malegaon - 19BCL0181
Pratham Kumar - 19BCL0085
Rishu Kumar Takur - 19BCL0169
Under the guidance of

Prof. Amit Mahindrakar Baburao
School of Civil Engineering
VIT, Vellore.
April 2023
DECLARATION
I hereby declare that the thesis entitled “ Assessment of

Groundwater quality using soft computing Techniques” submitted by me, for the
award of the degree of Bachelor of Technology in Civil Engineering to VIT is a
record of bonafide work carried out by me under the supervision of Amit
Mahindrakar Baburao.
I further declare that the work reported in this thesis has not been submitted
and will not be submitted, either in part or in full, for the award of any other degree or
diploma in this institute or any other institute or university.
Place: Vellore
Date:
MALEGAON ASHISH
RISHU KUMAR THAKUR
PRATHAM KUMAR
Signature of the candidate
CERTIFICATE
This is to certify that the thesis entitled “ Assessment of Groundwater
Quality using soft computing Techniques ” submitted by Ashish Malegaon,
Pratham Kumar, and Rishu Kumar Thakur, School of Civil Engineering, VIT,
for the award of the degree of Bachelor of Technology in Civil Engineering, is a
record of bonafide work carried out by him under my supervision during period, 01.
12. 2022 to 30.04.2023, per the VIT academic and research ethics code.
The contents of this report have not been submitted. They will not be
submitted either in part or in full, for the award of any other degree or diploma in this
institute or any other institute or university. The thesis fulfills the requirements and
regulations of the University and in my opinion, meets the necessary standards for
submission.
Place: Vellore
Date: Signature of the Guide
Internal Examiner External Examiner
Head of the Department

Civil Engineering
ACKNOWLEDGEMENTS
I would like to express my deepest gratitude to my capstone project guide, Mr. Amit
Mahindrakar, for his unwavering support and guidance throughout the entire duration
of this project. His expertise, patience, and dedication were instrumental in shaping
and bringing the project to fruition. His insightful feedback and valuable suggestions
helped me overcome challenges and advance the project.
I would also like to extend my heartfelt thanks to Professor Uma Shankar for his
constant encouragement, motivation, and valuable input that enriched my
understanding of the project's scope and significance. His mentorship and wisdom
have been invaluable in shaping my project and my overall academic journey.
I would also like to acknowledge the lab assistants who provided me with timely and
relevant information whenever required. Their assistance in gathering and analyzing
data, conducting experiments, and managing the laboratory resources was invaluable
and greatly contributed to the success of this project.
Once again, I express my deepest gratitude to everyone who has contributed to the
successful completion of this capstone project. Your support and guidance have been
invaluable, and I am truly honored and privileged to have had the opportunity to work
with such amazing mentors and colleagues.
Student Name:
(i) MALEGAON ASHISH
(ii) RISHU KUMAR THAKUR
(iii) PRATHAM KUMAR
.I
Executive Summary
In this report, the use of soft computing techniques, including Artificial Neural
Networks (ANN), Fuzzy Logic (FL), and Genetic Algorithms (GA), to assess
groundwater quality is explored, with the aim of demonstrating their effectiveness in
predicting water quality indicators, such as pH, turbidity, and Total Dissolved Solids
(TDS). It is emphasized that the assessment of groundwater quality is crucial for
ensuring the safety and sustainability of water resources, and soft computing
techniques provide a valuable tool for analyzing complex data sets and making
predictions about water quality.
Existing research papers were extensively reviewed, and it was identified that
although Geographic Information System (GIS) technology is making rapid
advancements, it is unable to accurately process data that contains missing
information, leading to inconsistencies. This poses a challenge when determining
water quality as it requires the measurement of several parameters to calculate a
Water Quality Index (WQI). To overcome this challenge, soft computing techniques
have been employed to predict WQI, simplifying the process. Various plots were used
to assess the effectiveness of four different soft computing models that were utilized to
predict WQI.
It was revealed by the results that a high level of agreement between predicted and
regression values was exhibited by the Artificial Neural Network (ANN) model, with
over 80% accuracy. This finding highlights the potential of soft computing techniques
as a useful tool for predicting water quality, which could greatly enhance our ability to
manage and protect our water resources. Furthermore, WQI is predicted with high
precision by measuring multiple hydro-chemical parameters and utilizing soft
computing techniques, making it a viable alternative to costly water quality
measurement stations.
.II
TABLE OF CONTENTS
SR DESCRIPTION PAGE NO.

NO.
i. Acknowledgment I
ii. Executive Summary II
iii. Table of Contents III-IV
iv. List of Figures V-VI
v. List of Tables VII
vi. Abbreviations VIII
1. Introduction 1
1.1 Literature Review 2
1.2 Objectives 5
1.3 Motivation 5
1.4 Background 6
2. Project Description and Goals 7
2.1 Methodology 8
2.2 Primary Goals 8
3. Technical Specification 9
3.1 Collection of Data 9
3.1.1 Pre-processing of collected data 10
3.1.2 Detection of Outliers 10
3.2.The conventional method of finding water 12

quality index
4. Advancements in water quality Prediction using 15

ANN
4.1 Advantages of Machine learning (ANN) 17
4.2 Working of Different Models 19
4.3 Result and discussion 21
5. Variation of Ions in water sample 31
5.1 Hydrogeochemical characterization of a water 31

sample using Hill-Piper
.III
5.2 Differentiation based on a variation of ions 33
(QGIS)
5.3 Differentiation using a Machine Model (SOM) 34
5.4 Variable correlation Diagram 38
6. Schedule tasks and milestones. 38
7. Conclusion 39
8. Project Demonstration 39
9. References 40
.IV
List of Figure
Fig. No Caption Page No.
Fig 1.1 Study area 7
Fig 2.1 Methodology 8
Fig 3.1 Detection of outliers 11
Fig 4.1 Architecture of the NARX neural network 19
Fig 4.2 Architecture of the Elman Backpropagation Neural 20

Network
Fig 4.3 Architecture of the Cascade Forward Backpropagation 20

neural network
Fig 4.4 Architecture of the Feed forward neural network 21
Fig 4.5 Performance plots of NAXR network for Pre monsoon 22
Fig 4.6 Performance plots of NAXR network for Post monsoon 23
Fig 4.7 Training state plots of NARX network for Pre monsoon 23
Fig 4.8 Training state plots of NARX network for Post monsoon 24
Fig 4.9 Performance plots of Elman network for Pre monsoon 24
Fig Performance plots of Elman network for Post monsoon 25

4.10
Fig Training state plots of Elman network for Pre monsoon 25

4.11
Fig Training state plots of Elman network for Post monsoon 25

4.12
Fig Performance plots of Cascade network for Pre monsoon 26

4.13
Fig Performance plots of Cascade network for Post monsoon 26

4.14
Fig Training state plots of Cascade network for Pre monsoon 27

4.15
Fig Training state plots of Cascade network for Post monsoon 27

4.16
Fig Regression plots of Cascade network for Pre monsoon 28

4.17
.V
Fig Regression plots of Cascade network for post monsoon 28
4.18
Fig Performance plots of Feed forward network for Pre 29

4.19 monsoon
Fig Performance plots of Feed forward network for Post 29

4.20 monsoon
Fig Training state plots of Feed forward network for Pre 29

4.21 monsoon
Fig Training state plots of Feed forward network for Post 30

4.22 monsoon
Fig Regression plots of feed forward network for Pre 30

4.23 monsoon
Fig Regression plots of feed forward network for Post 30

4.24 monsoon
Fig 5.1 Illustration Of Hill piper Diagram 31
Fig 5.2 Hill piper Diagram for Pre-monsoon 2019-2020 32
Fig 5.3 Hill piper Diagram for Post-monsoon 2019-2020 32
Fig 5.4 Spatial Variation of Calcium ions 33
Fig 5.5 Temporal variation of WQI 34
Fig 5.6 Architecture of SOM 34
Fig 5.7 Output of SOM 34
Fig 5.8 Allocation of neurons for input samples 35
Fig 5.9 Neuron to Neuron variation for each parameters 36
Fig Distances between neurons 37

5.10
Fig Correlation Plot 38

5.11
Fig 6.1 Time Plan 38
.VI
List of Tables
Tab. No Caption Page No.
Tab: 3.1 Raw data collected for the pre-monsoon period of 9

2019-2020
Tab: 3.2 Raw data collected for the post-monsoon period of 10

2019-2020
Tab: 3.3 Statistical measures of input parameters 11
Tab: 3.4 Data Normalization 12
Tab: 3.5 Assigned weight for input parameters 13
Tab: 3.6 Calculated WQI for the pre monsoon period of 2019- 14
2020
Tab: 3.7 Calculated WQI for the post monsoon period of 2019- 14
2020
Tab: 3.8 WQI Range 15
Tab: 3.9 WQI Indian Standards 15
.VII
List of Abbreviations
ANN Artificial Neural Network
WQI Water Quality Index
TDS Total Dissolved Solids
Wi Relative weight
QGIS Quantum Geographic Information System
HHR Human Health Risk
ML Machine Learning
NARX Nonlinear AutoRegressive with eXogeneous outputs
SOM Self Organizing Model
MSE Mean Squared Error
GA Genetic Algorithm
FL Fuzzy Logic
.VIII
1. INTRODUCTION
The assessment of groundwater quality is a crucial aspect of water resource

management, as groundwater is a major source of freshwater for human consumption,
agricultural use, and industrial applications. Traditional methods of water quality
assessment often involve collecting water samples and analyzing them in laboratories
for various parameters such as pH, turbidity, Total Dissolved Solids (TDS), and
chemical composition. However, these methods proved to be time-consuming, costly,
and sometimes challenging to implement in large-scale studies. Furthermore, they do
not always provide accurate results due to various environmental factors, such as
fluctuations in water quality over time and the impact of external factors such as
human activities.
With the advent of technological advancements, innovative approaches to water
quality assessment have emerged, such as Soft Computing techniques. Soft Computing
techniques are computational methods that utilize artificial intelligence algorithms to
make predictions based on incomplete, uncertain, or imprecise information. These
methods have shown great potential for water quality assessment, as they can process
large amounts of data quickly and accurately, without requiring a large amount of
prior knowledge about the system being analyzed. Moreover, these techniques have
been found to be more efficient and cost-effective than traditional laboratory-based
methods.
This capstone project was mainly focused on evaluating the effectiveness of Soft
Computing techniques, including different models in Artificial Neural Networks (ANN)
for the assessment of groundwater quality. The primary aim of this study was to
compare the performance of these Soft Computing techniques with traditional
laboratory-based methods of water quality assessment. Specifically, the project also
investigated the accuracy of different models in predicting the water quality Index.
The findings of this study provides valuable insights for researchers and policymakers
in the development of effective strategies for water resource management and ensures
the safety of drinking water supply. The use of Soft Computing techniques in water
quality assessment could revolutionize the methods of monitoring and management of
water resources, making it more efficient, cost-effective, and accurate
.— 1 —
1.1.Literature Review
(Saeedi M,et. al) in the journal "Development Of Groundwater Quality Index", present a
study focused on formulating a water quality index that specifically assesses the suitability
of drinking water. The authors recognize the significance of reliable assessments of water
quality, especially for groundwater sources that are widely used for drinking water supply.
The article outlines the selection of 13 water quality parameters, including pH, total
dissolved solids, and heavy metal concentrations, that have a significant impact on the
quality of water. These parameters are then weighted and aggregated to create a single index
value that represents the overall quality of the water. The development of such an index is
crucial in ensuring that the public is provided with safe and potable water, as well as
identifying and addressing any potential issues with the water supply. [1]
(Arunprakash M., et al.) in the journal "Impact Of Urbanization In Groundwater Of South

Chennai City" discusses the impact of urbanization on the groundwater quality in the South
Chennai City area. The authors analyze the seasonal variation of groundwater quality and
compare it to the city's growing urbanization. They investigate the presence of various
pollutants such as nitrates, chlorides, and total dissolved solids (TDS) and how they are
affected by urbanization. The study shows that there is a significant impact of urbanization
on groundwater quality, with increased levels of pollutants found in areas of high
urbanization. The article highlights the need for sustainable urban development and the
implementation of effective water management strategies to ensure the preservation of
groundwater quality in the region. [2]
(Qian H. and Dimalla N.) in their article "Groundwater quality evaluation using (WQI) for
drinking purposes and human health risk (HHR) assessment in an agricultural region of
Nanganur, South India." show a research study focused on evaluating the quality of
groundwater for drinking purposes in an agricultural region of Nanganur, South India, The
authors conducted a Human Health Risk (HHR) evaluation to determine the potential health
risks related to consuming contaminated water, as well as a Water Quality Index (WQI) to
determine if groundwater was suitable for drinking. The investigation discovered that the
area's groundwater had poor quality, and high concentrations of heavy metals and other
pollutants. The writers emphasize the significance of effective water management and
treatment. [3]
(Lakshmipriya A.R et al.) in their article "Groundwater Quality Analysis", present a study
focused on analyzing the quality of groundwater by testing samples for various chemical
.— 2 —
parameters. The authors compared the values of these parameters to the desirable limits set
by the Indian Standards (IS) to assess the suitability of groundwater for different purposes.
The study found that the groundwater in the area was contaminated with high levels of
pollutants such as nitrates, fluoride, and iron. The authors suggest the implementation of
appropriate water management strategies and regular monitoring of water sources to ensure
that the local population has access to safe and potable water. The study highlights the
importance of conducting regular assessments of groundwater quality and emphasizes the
need for adherence to national standards and guidelines to prevent potential health risks
associated with contaminated water. [4]
(Varadarajan N. and Purandara B.K.) present a case study titled "Groundwater Quality
Investigations," where they collected water samples from Belgaum and Bijapur districts of
Karnataka and analyzed them using standard chemical testing methods. The authors then
used the Hill-Piper diagram to classify the water samples based on their geochemical
composition and determine their suitability for various purposes, such as irrigation and
drinking. Additionally, they calculated the salinity levels of the water samples using the US
salinity diagram. The study provides insight into the quality of groundwater in these regions
and can aid in the development of effective water management strategies to ensure access to
safe and clean water for the local population. [5]
(Prasanna M.V. et al.) investigated the quality of groundwater in the Gadilam basin by
gathering water samples throughout the year for their study titled "Study of Evaluation of
Groundwater in Gadilam Basin Using Hydrogeochemical and Isotope Data". In the study,
test results were compared using correlation plots after the hydrogeochemical and isotope
data of the samples were analyzed. In order to find any potential causes of pollution, the
authors examined the chemical makeup of water from various sources and environments.
The study offers insightful data on the Gadilam basin's groundwater quality and can aid in
the creation of long-term water management strategies to guarantee the community's access
to clean, safe water. [6]
(Bharani R. et al.) published an article titled "Hydrogeochemistry and groundwater quality

appraisal of part of south Chennai coastal aquifers, Tamil Nadu, India, using WQI and the
fuzzy logic method" in the journal Applied Water Sciences. The study involved evaluating
the hydrogeochemistry and quality of groundwater in the coastal aquifers of South Chennai
using a combination of methods, including fuzzy logic and the Water Quality Index (WQI).
The authors also employed tools such as ArcGIS and the Hill-Piper trilinear diagram to map
.— 3 —
the distribution and characteristics of the groundwater samples. The study provides valuable
information on the quality of groundwater in this region and can assist in the development
of effective water management strategies to ensure access to clean and safe water for the
local population. [7]
(Ramakrishnaiah C.R. et al.) published an article titled "Assessment of Water Quality

Index for the Groundwater in Tumkur Taluk, Karnataka State, India" in the E-Journal of
Chemistry, which focused on assessing the quality of groundwater in Tumkur Taluk. The
authors considered 17 different parameters and used regression analysis to estimate the
Water Quality Index (WQI) values for the groundwater samples. The study found that the
groundwater in the region was moderately to severely polluted, with high concentrations of
pollutants such as nitrates, fluoride, and iron. The article highlights the need for regular
monitoring of water quality to ensure the safety and health of the local population. [8]
(Tiwari S.K. et al.) conducted a study titled "Groundwater quality assessment using water
quality index (WQI) under a GIS framework," published in the journal Applied Water
Science in 2021. The study focused on assessing the quality of groundwater using the Water
Quality Index (WQI) by considering 20 different parameters. The authors used GIS
(Geographic Information System) software to analyze the spatial distribution of water
quality and identify areas that were most affected by poor water quality. The study found
that the groundwater quality in the study area was generally poor, with high levels of
pollutants such as nitrates, fluoride, and iron. The article highlights the need for regular
monitoring of groundwater quality and the use of advanced tools such as GIS to facilitate
more efficient and effective water management strategies. [9]
(Balraj Singh et al.) in their article "Soft Computing Technique-Based Prediction Of

Water Quality Index," and colleagues focus on predicting water quality index (WQI) values
using three soft computing techniques. The authors considered ten parameters to calculate
WQI values and compared the results based on six fitness criteria. The study highlights the
potential of using soft computing techniques as an efficient and reliable method for
predicting water quality index values, which can aid in making informed decisions
regarding water management and preservation. [10]
Through these research papers, it was evident that, due to inaccuracies and omissions,
there was a problem with the quality of the results. Since then, GIS has been constantly
developed. Therefore, it was challenging to handle the inconsistent data using
.— 4 —
traditional computing techniques. Hence, a mechanism that could handle such
conflicting data was needed.
Soft computing is made up of approaches that complement one another and offer a
flexible information processing capability for dealing with ambiguous scenarios that
arise in everyday life. These models allow for inconsistent, error-filled, noisy, and
missing value data. Thus, soft computing may offer a potent tool for GIS to solve the
inconsistent data issue.
1.2.OBJECTIVE
● To examine the suitability of groundwater for drinking purposes using

conventional methods of obtaining the Water Quality Index (WQI) using
44 samples from open wells in the Nagapattinam district.
● Mapping the spatial and temporal variations of groundwater ions

including pH, Total Dissolved Solids (TDS), Bicarbonate, Chloride,
Sulphate, Calcium, Magnesium, Sodium, Potassium, and Nitrate using
QGIS.
● Predict groundwater quality using the most suitable machine model

from Artificial Neural Networks techniques.
1.3. Motivation
A key component in managing water resources and ensuring people have access to
clean drinking water is groundwater quality evaluation. But the traditional procedures
used for groundwater purity turned out to be difficult due to the presence of noisy
information. Additionally, those approaches took a long time to get right results and
were not capable of handling all the variables linked to water systems. whereas
machine learning models presented captivating responses for dealing with the
complexity and variability of water bodies. The choice to use soft computing strategies
to evaluate the quality of groundwater for the capstone project was motivated by two
major factors.
First off, software-based computing addresses prove helpful for determining water
quality because they are able to interact with ambiguous and fuzzy data. They can
.— 5 —
rapidly examine such data, simulate nonlinear relationships, and generate accurate
projections. Hence, there was curiosity to learn how they could be utilized for actual
data.
Furthermore, this capstone project allows us to practice utilizing cutting-edge

computer instruments and techniques. Improvement in the technical expertise and
understanding in detail in the discipline of water resources engineering, particularly in
the implementation of soft computing methods, could be done through the conclusion
of this project.
1.4. Background
Tamil Nadu, a state in southern India, includes the coastal district of Nagapattinam. It
is located in the Bay of Bengal between 10.7668° N latitude and 79.8447° E longitude.
10.7668° N latitude and 79.8447° E longitude, it stands at the Bay of Bengal. The
district is rich in farmland and grows products like rice, sugar cane, and coconuts over
an area of around 2715 square kilometers. The Nagapattinam district receives
moderate to substantial rainfall from October to December during the monsoonal
season. The district of Nagapattinam is primarily flat and low-lying, rising on average 5
meters above sea level. It has a tropical climate with hot, muggy weather all year.
Roughly 1248 mm of precipitation falls on average every year in the district. Due to
factors like El Nino or La Nina, there could be periodic variations Over the year,
relative humidity fluctuates from 70 to 90 percent, causing it to be generally humid.
Owing to the monsoon rains, the state constantly observes 90% greater moisture from
July to September. Nagapattinam district's median temperature oscillates between 27°
C to 32° C. The two warmest periods are typically April and May, with midday peaks
usually leading to 35° c.
.— 6 —
Fig: 1.1 Study Area
2. Project Descriptions and Goals
The capstone project implemented soft computing algorithms to judge groundwater

quality. To derive predictive models for the water's features, the project used the
inputs obtained from numerous underground water sources in the selected region and
soft computing techniques such as fuzzy logic and neural networks. Quantitative
criteria like the mean absolute error and regression value of the prediction were
employed to confirm the simulations. The experiment mainly used sensitivity testing
to investigate the components that trigger contaminating groundwater and their
respective importance. The research led to the formulation of productive groundwater
administration strategies and revealed the insight into the prospective usage of
computer-based tools for groundwater quality assessment.
.— 7 —
2.1. Methodology
Fig: 2.1 Methodology

A well-structured plan is essential for the successful execution of the project. The
initial step involved obtaining sample data specific to the chosen region. The collected
data was then subjected to different analytical techniques to eliminate any errors. Once
the data was cleansed, a conventional method of the weighted average was employed
to calculate the water quality index. In addition, QGIS was also utilized to generate
maps depicting the variation of ions. Finally, various Artificial Neural Network models
were employed to predict the water quality index based on the collected data.
2.2. Primary goals

The public water supply and drainage board of Tamil Nadu provided the information
for this study, and the weighted average approach was applied to calculate the water
quality index. The information was then thoroughly examined using a variety of
methodologies to discover more about the different physicochemical traits. Listed
below were the project's fundamental targets:
● The use of Hill Piper had been chosen to illustrate the hydrochemical attributes
present in water samples.
● Based on the input criteria, an ANN model (artificial neural network) was
designed to project an index of water quality. Multiple algorithms were
examined with varying layer sizes to analyze performance outcomes.
● It was feasible to figure out which elements were more reactive to fluctuations
in the water's quality score with visuals like correlation and machine models.
● Based on the threshold values of permitted and desired limits, the distinction of
water samples has been accomplished.
.— 8 —
3. Technical Specifications
Examining an immense amount of data was a challenging but essential step in

determining the quality of groundwater. However, the utilization of standard
quantitative metrics, using machine learning models, to predict results followed
similar steps:
3.1. Collection of Data

The study was conducted in Nagapattinam district, Tamil Nadu, which comprises 44
open wells, and analysis of the data for the years 2019-2020 was done. The raw data
was sourced from the Tamil Nadu Water Supply Board, and it included 10 different
parameters, including pH, TDS, bicarbonate, chlorine, sulfate ion, calcium, magnesium,
sodium, potassium, and nitrate ions. Collection of data for both the pre-monsoon and
post-monsoon periods of the specified years was done.
Table: 3.1 Raw Data Collected for the pre-monsoon period of 2019-2020
.— 9 —
Table: 3.2. Raw Data Collected for the post-monsoon for period of 2019-2020
3.1.1. Pre-processing of collected data

The data collected from a third-party source had some discrepancies, and since it
involved a large number of samples, there was a possibility of missing data. It would be
cumbersome to manually inspect for errors and calculate the values. Therefore, Python
programming was employed to process and clean the data and to remove any
anomalies present in the dataset. Additionally, the statistical measures of mean and
standard deviation were utilized to normalize the information. This approach enabled
the efficient processing of the data and derive meaningful insights.
3.1.2. Detection of Outliers

One of the simplest approaches to tackling the problem of errors and missing data in
the dataset was to substitute the missing values with the mean of the input values. To
achieve this, first the basic statistical measures of mean, standard deviation, and
minimum and maximum values for the dataset were computed. Following this,
identification of any outliers present in the dataset, which are values that deviate
significantly from the central tendency of the data, was done. This process of outlier
detection and mean substitution is a commonly used technique in data cleaning and
.— 10 —
preparation. By leveraging these statistical methods, the inconsistencies in the data
were addressed which helped in obtaining more reliable and consistent results.
Tab: 3.3 Statistical measures of input parameters
Fig: 3.1 Detection of outliers
The identification of outliers in the dataset was accomplished by computing the z-

score using a standard formula. Any values that were greater than 3 or less than -3
standard deviations from the mean were flagged for review and possible correction. As
depicted in Figure 3.4, some outliers were identified that deviated significantly from
the mean values, which was an accepted occurrence since it is impossible to obtain
completely pure water in the natural environment. These outliers might represent the
presence of contaminants in the water sample, and their identification was crucial in
the assessment of the water's quality. Through this process of outlier detection and
analysis, the identification of potential sources of contamination was done that helped
in making informed decisions regarding the remediation of the affected water sources.
.— 11 —
Following the removal of outliers and correction of erroneous values, the next step
involved data normalization, which was a crucial technique used to bring the data
values to a common scale. This was achieved by calculating the mean and standard
deviation of each sample and scaling the data to a comparable, equivalent scale. The
process of normalization was essential while dealing with data as it had different
ranges and units, facilitating an effective comparison and analysis of the data. By
utilizing the statistical measures of mean and standard deviation, normalization of the
data was possible along with obtaining reliable and consistent results that could be
used for further analysis and evaluation.
Tab: 3.4 Data Normalization
3.2. The conventional method of finding the water quality index
Upon completion of the data cleaning process for all 44 samples comprising 10
parameters, the next step involved assigning weights to each parameter based on their
relative importance and determining the permissible and desirable limits for each. By
utilizing these assigned weights, the relative weight Wi was computed for each
parameter, which took into consideration their individual significance in the overall
assessment of groundwater quality. This approach enabled a more accurate and
comprehensive evaluation of the water quality parameters and helped in identifying
potential sources of contamination and taking appropriate measures to address them.
.— 12 —
Tab: 3.5 Assigned weight for input parameters
Desirable Limit Highest Assigned Relative

Permissible Weight weight Wi
Limit
pH 6.5 8.5 1 0.029411765
TDS 500 2000 5 0.147058824
HCO3 200 600 1 0.029411765
Cl 250 1000 5 0.147058824
SO4 200 400 5 0.147058824
Ca 75 200 3 0.088235294
Mg 30 100 3 0.088235294
Na 0 200 5 0.147058824
K 0 12 2 0.058823529
NO3 0 45 4 0.117647059
34
Using relative weight and the highest permissible value for each parameter, the sample
values were converted from mg per liter to milliliters.
Milliliter = Ov*100*Wi/ Sn
Where Ov= observed value for the ith parameter of the sample, Sn = standard
permissible value of the ith parameter (Refer Tab: 3.6)
Adding all the resultant values of 10 parameters would give the water quality index.
.— 13 —
Tab: 3.6 Calculated WQI for the pre-monsoon period of 2019-2020
Tab: 3.7 Water Quality Index for the post-monsoon period of 2019-2020
After verifying the results, the water quality was determined based on the calculated
.— 14 —
index and assessed its suitability for drinking. To gain deeper insights, a correlation
plot was created and analyzed which parameters had a significant impact on the water
quality index. Based on this information, implementation of appropriate measures to
improve the quality of water can be done and make it cleaner. It is important to take
necessary actions to ensure that the water is safe for consumption.
Tab:3.8 WQI Range
WQI Range Wells
0-25 19 samples
26-50 18 samples
51-75 6 samples
76-100 NIL
>100 1 samples
Tab:3.9 WQI Indian Standards
>100 unsuitable for Drinking
76-100 Very Poor
51-75 Poor
26-50 Good
0-25 Excellent
4. Advancements in Water Quality Prediction Using ANN

In order to address complicated problems, soft computing techniques that leverage the
principles of biology derived from the natural world were used. Inconsistencies and
errors in real-life circumstances limit conventional approaches from always being
competent to tackle key problems since they necessitate complete information as well
as precise mathematical frameworks. To solve this issue, methods of soft computing
that could deal with such partial and faulty data were implemented. Some examples of
soft computing techniques are fuzzy logic, neural networks, genetic algorithms, swarm
intelligence, etc.
.— 15 —
Fuzzy logic- Due to its use of imperfect truth and levels of inclusion notions, fuzzy logic
can deal with unclear and unreliable data. This makes it a powerful tool for dealing
with decision-making issues and identifying trends where specific information is
necessary but not readily available.
Neural networks- This method is composed of multiple layers of interdependent nodes
that collaborate cooperatively in order to generate the outcomes of an issue, analogous
to the brain of humans. It takes inputs in the input layer, does the processing in one or
multiple hidden layers, and finally produces an output layer with the output. Typically,
70% of the dataset, referred to as the training dataset, is used during training, and
input-output pairings are used to modify the connectivity among the neurons. Here,
training the network to generalize the patterns is the primary goal. The remaining
section of the dataset is divided into testing and validation datasets, where a fresh
batch of data is given and the network's performance is evaluated based on its
accuracy. It is primarily employed in forecasting problems owing to its capacity to
comprehend patterns and correlations between variables.
Genetic algorithm- The theory of genetics and natural selection are the underlying
principles of the genetic algorithm, a method of optimization. In this approach, an
assortment of potential solutions to an issue develops in the form of chromosomes,
which are strings of genes. The methods of selection (in which a set of parent
chromosomes with ideal characteristics chooses), crossover (in which the genetic code
of different pairs of parents is switched around to produce one with the most effective
genes), and mutation (in which the values of the genes are modified in a random
manner) are used to evolve these solutions based on how well they function. This is an
iterative process that is carried out until a population of the best solutions is reached.
Swarm intelligence- The technique utilizes the notion of interpersonal interactions
that can be observed in ecological systems like ants, bees, and fish, among others,
where simple species that are subject to basic laws cooperate. It is feasible to solve
extremely intricate issues through these exceptionally sophisticated synergic
interactions. It is additionally capable of adjusting to changes in situations and
responding via local interactions. It is mostly employed for issues like routing,
scheduling, and categorization that call for optimization.
4.1. Advantages of Machine Learning(ANN)
.— 16 —
Accurate prediction: One of machine learning's biggest benefits is its capacity to
correctly forecast water quality measurements based on prior data. Regression
models, decision trees, and neural networks are a few examples of machine learning
algorithms that can analyze data patterns and predict outcomes with high levels of
accuracy. This could assist in identifying potential issues with water quality before
they become severe issues.
For instance, using information gathered from numerous sources, including satellite
imaging, water quality sensors, and weather data, machine learning models can
forecast the concentration of pollutants in water bodies. Machine learning algorithms
can use these datasets to analyze the possibility of water pollution and aid in stopping
the development of pollutants in the water.
Efficient Analysis: Monitoring water quality involves collecting an extensive volume
of information on multiple parameters, including pH, temperature, turbidity, dissolved
oxygen, and contaminants, from diverse sources, including rivers, lakes, and
groundwater. Data on water quality are often analyzed manually, which can be
laborious and prone to inaccuracy.
However, machine learning algorithms are an effective tool for monitoring water
quality because they can quickly and effectively analyze huge datasets. In large data
sets, machine learning algorithms can identify patterns and trends that could take
humans a long time or a lot of effort to notice. These algorithms can also determine
correlations between various aspects of water quality, which can be useful in locating
probable sources of contamination and forecasting problems with water quality.
Improved decision-making: Machine learning algorithms' insights can help decision-
makers find the best measures for maintaining or improving water quality. For
instance, information on the causes and sources of water contamination can be
provided by machine learning algorithms, which could help policymakers create
focused solutions to these problems. Machine learning may additionally provide
insight into the effects of various water quality management measures, such as the
efficiency of various treatment methods or the effects of changing land use on water
quality.
Cost-effective: Machine learning could help in improving resource utilization and
lowering the cost of managing and monitoring water quality. Machine learning, for
instance, can assist in lowering the need for expensive laboratory testing, which can be
time- and resource-intensive. Machine learning can also assist in lowering the need for
regular manual monitoring of water quality parameters, which can be expensive, by
.— 17 —
offering accurate predictions and real-time monitoring.
The ability of ANNs to represent both linear and non-linear relationships is one of their
key advantages. This means that even when such relations are not evident or clear to
describe using conventional statistical approaches, ANNs may uncover complicated
patterns and relationships in data.
In order to find patterns and correlations between various groundwater quality
statistics, ANNs can be trained to analyze data from a variety of sources, such as water
quality monitoring stations, geological data, and land-use information. ANNs can offer
a simple technique to model groundwater quality and produce precise predictions
about the quality of groundwater at specific locations by recognizing these
relationships directly from data. An additional benefit of ANNs is their ability to
present simulated values for desirable places where measured data are requested but
necessary for water quality estimates to be unavailable. This is especially helpful for
assessing the quality of groundwater because it can be difficult to get information from
all relevant areas. Models for the water quality in these areas can be made more
thorough and precise by simulating data using ANNs.
Additionally, ANNs develop knowledge by themselves and generate results
independent of the input. This means that even when these patterns are not obvious or
widely known, ANNs can nevertheless find hidden patterns and relationships in data.
As a result, ANNs are able to recognize complex relationships between groundwater
quality metrics and make predictions about that quality that are more precise.
Finally, since ANNs store input in their own networks rather than a database, they are
unaffected by data loss in terms of how they operate. As a result, ANNs are particularly
helpful in situations where data collection is challenging or costly, as they may
continue to produce precise predictions even in the presence of missing or insufficient
data.Being able to represent both linear and non-linear connections, generate
simulated values for desired areas, learn automatically, and perform well even in the
absence of complete data are just a few of the benefits that ANNs offer in terms of
evaluating groundwater quality. Utilizing these benefits, ANNs can assist water quality
professionals in making better decisions that better safeguard environmental and
human health.
4.2. Working of Different Models
.— 18 —
The soft computing technique used for this study was the Artificial Neural Network
(ANN), which has numerous models in it. The processing was done in MATLAB, and
the neural networks chosen were Cascade forward backpropagation, feed-forward,
Elman backpropagation, the NARX (Nonlinear AutoRegressive with eXogeneous
Outputs) neural network, and Self-organizing maps, for the purpose of training and
predicting the water quality index of the Nagapattinam district located in Tamil Nadu.
NARX neural network: nonlinear system simulation is primarily accomplished using
this. It has an input layer that houses the network's input. In order to fully comprehend
the functioning of the framework, the input for this network consists of the prior
inputs and outputs. The input data is transformed into the output through the hidden
layer, and the outcome, which is what is projected for the current time, is indicated in
the output layer. In order to train the network, input-output pairs are fed into it.
Backpropagation weight adjustments are then made in order to optimize the network
by minimizing the discrepancy between estimated and actual output. After the
completion of the training process, a new dataset is introduced by feeding the relevant
historical input-output pairs, and the output is then predicted and reported. Problems
involving time series and signal processing are the principal applications for this.
Fig:4.1 Architecture of the NARX neural network
Elman backpropagation: This particular recurrent neural network has applications in

the area of soft computing and includes three interconnected types of layers: input,
hidden, and output. The quantity of input values in the input layer varies depending on
how intricate the problem being studied is; for tasks of high complexity, there may be
more than one input layer. The difficulty of the problem also determines the number
of hidden layers, which are fed information from the output of the input layer and the
previously hidden layer. The output layer produces the system's final output after
receiving the output of the hidden layers. Additionally, this approach makes use of the
backpropagation technique, which minimizes the discrepancy between the expected
and actual output values by modifying the network weights. The backpropagation
methodology is used to eliminate errors as input and output values are fed into the
.— 19 —
system during training. After the training phase has concluded, the system can be
tested by feeding it new input data to anticipate outputs, known as the "testing phase."
Fig:4.2 Architecture of the Elman backpropagation neural network
Cascade forward backpropagation: Since this operates as a feedforward network, it

features just input to the output data flow. The input layer receives the data, which
proceeds to be transmitted to the hidden layer. Each neuron in the system is connected
using weights, and because the output of one layer functions as the input of the next,
the weights are allocated appropriately, limiting overfitting. Each neuron calculates the
weighted total after receiving the inputs and then executes an activation function in
order to generate the output. Given the variety of activation functions available,
including sigmoid, tanh, and ReLU, among others, this activation function is selected
depending on its objective, such as the binary categorization that the sigmoid function
is capable of performing. Each projected output is compared to the actual output after
its generation in order to identify discrepancies. The weights of the connections
undergo modifications to minimize the variation between the actual and predicted
output as these errors move in a backward pass. As long as precise outcomes are
obtained, this iterative procedure is maintained.
Fig:4.3 Architecture of a Cascade Forward Backpropagation Neural Network
Feed-forward neural network- Also known as the multilayer perceptron, this network
is identical to the cascade forward backpropagation neural network, and functions
similarly. It has an output layer, several hidden layers, and one input layer comprising
the input values. All of the above layers are linked by weights, and the neuron in the
following layer utilizes the output from the layer prior as its input before the activation
function is used to produce the output. First, the weights are initialized with random
values for the input data, which is done through the forward pass procedure. Here, the
.— 20 —
inputs are provided, and the current weights are used to construct the output. The
generated and the actual outputs are compared, and the error is conveyed back by
backward pass, just as in the cascade forward approach. The weights are then changed
as necessary to reduce inaccuracies up until precise results are generated. The only
distinction between a feed-forward neural network and a cascade-forward neural
network is in the learning algorithm and network design. Feed forward has a set
number of layers that are chosen before the training phase, unlike cascade forward
which has an adaptive architecture. This method is faster and less expensive when
performing calculations than the cascade forward method since it uses fewer
computing materials.
Fig:4.4 Architecture of a Feedforward Neural Network
4.3. Results
The soft computing technique used here was ANN (Artificial Neural Network) due to
the advantages mentioned in section 3.2. A total of five neural networks, namely
Cascade forward backpropagation, Feed-forward, Elman backpropagation, the NARX
(Nonlinear AutoRegressive with eXogeneous outputs) neural network, and Self-
organizing maps, were compared, and the most suitable network was selected based
on accuracy and performance. The training function used was TRAINLM, and the
adaptation learning function used was LEARNGDA. The mean squared error (MSE) was
used as the measure of performance as the performance function, and the training,
validation, and testing were done for 5, 10, and 30 layers in order to obtain the best
results, and the network was trained only once since the sample size is small. As
mentioned, 70% of the data was used as a training dataset, 15% was used as a
validation dataset with which the network was not familiar, and 15% was used for the
testing phase. Out of 44 values of output, which is the WQI, only 5 values were made
known to the network, so it could predict the remaining values, which could help in
determining the level of precision of the network. The three figures in results for each
models represents 5 layer, 10 layer and 30 layer respectively.
.— 21 —
NARX neural network
Upon executing the network, 2 plots were generated to understand the accuracy of the
network for the given input and output. They were the performance plot and the
training state plot. A comparison of 5, 10, and 30 layers for both pre-monsoon and
post-monsoon is depicted below:
Fig:4.5 Performance plots of NARX network for the pre-monsoon

The y-axis in this graph is the mean squared error (MSE), which is one of the measures
to analyze the performance of the system, and the x-axis is epochs, or the number of
iterations the network runs. The blue line represents the training dataset, which is
70% of the main dataset, the green line represents the validation dataset, which
comprises 15% of the data, the red line depicts the testing dataset, which uses the
remaining proportion of the data; and the dotted line is the best value, showing the
point where the network attains the least error. As the network is trained, the errors in
the training dataset continuously reduce, so for the analysis of the performance, the
validation dataset is examined which is a new dataset which the system isn't familiar
with. Also, the excess training of the network might cause overfitting where the
training accuracy is high. Still, validation accuracy is low which happens as the
network begins memorizing the values instead of generalizing the pattern. Here it can
be noted that the least error while using 5, 10, and 30 layers is 212.3752, 131.7722,
and 90.4549 respectively which was attained after 2, 3, and 40 epochs respectively.
When using 10 layers, the error generated was far less than with 5 layers but slightly
higher than that of 30 layers which might be because it underwent more iterations.
.— 22 —
Fig:4.6 Performance plots of NARX network for post-monsoon
Here the comparison in the performance plots of different numbers of layers in the
post-monsoon is done. The least errors obtained while using 5, 10, and 30 layers are
102.8155, 8.6166, and 3660.5048 respectively. It is clear that 10 layers produced the
least amount of errors after only 24 epochs and hence it was the most effective and
precise arrangement.
Fig:4.7 Training state plots of NARX network for pre-monsoon

The training state plot consists of 3 main graphs namely the gradient plot, the Mu plot,
and the validation check to determine how well the network is performing. The
gradient plot makes it possible to know how well the network is being trained and
what possible changes could improve
the accuracy. A higher gradient implies that the training is done very fast and the
generated solutions might not be precise whereas a low gradient shows that the
network is being trained very slowly therefore increasing the learning rate might
improve the process. The Mu plot deals with the learning rate of the network i.e., the
speed of the training process. A higher learning rate implies that the network is
learning very quickly which initially starts to generate overabundance of the most
favorable solutions. On the other hand, with a low learning rate, the network can
produce proper output. The validation check is done on the validation dataset. From
the above graph, it can be seen that the 5 layer plot, with the gradient of 2086.7484
after the final iteration, had a more accurate training than the 10 and 30-layer plots
with their gradients of 9124.4382 and 11847.3048 respectively as the gradient was
also smooth comparatively. Layer 10 plot had the highest Mu overall indicating that
.— 23 —
the network was learning rapidly and generating quicker results compared to the
other 2 plots.
Fig:4.8 Training state plots of NARX network for post-monsoon

In the case of the post-monsoon data of groundwater, the gradient curve is very
irregular throughout the 3 networks which implies that the training wasn't smooth.
However, a decreasing trend in the 10-layer network is roughly visible. Also, the layer
10 plot can be seen to have a minimum gradient of 7.5361 at the end of the training
phase. The network with 30 layers had the fastest learning rate and updating the
weights and bias compared to the other two tested networks.
Elman backpropagation neural network
In this network two governing plots were generated as well which were the training
state and performance plot. Comparison between the plots for different layers for both
pre-monsoon and post-monsoon is illustrated below:
Fig:4.9 Performance plots of Elman network for pre-monsoon

Similar to the NARX neural network, the y axis in this plot represents the mean
squared error which is the measure of performance and the x-axis represents the
number of times the data was passed through the whole network called the epochs.
For the pre-monsoon data, the least MSE for 5, 10, and 30 layers were 33.6709,
80.4438, and 1923.9966 respectively. When using 10 layers, the errors compared to
using 30 layers were significantly low which might be due to the small sample size and
excess layers in 30 layers. The 5 layer plot seemed to perform better than the 10-layer
plot with a lesser number of iterations in this case.
.— 24 —
Fig:4.10 Performance plots of Elman network for post-monsoon
In the post-monsoon dataset, it is clear that the 5 layer network performed better, with
MSE only 10.6312, than 10 and 30 layer networks whose MSE values were 178.0775
and 215.572 at 11 and 7 epochs respectively. From the performance plot of both pre-
monsoon and post-monsoon, it can be noted that the network with 5 layers was the
best with the least errors.
Fig: 4.11 Training state plots of Elman network for pre-monsoon

For the pre-monsoon dataset, it can be observed that the smoothest training was of the
network with 30 layers and it also has the least gradient implying that the training
process was precise. The 5 and 10 layer plots had a similar trend in Mu graph
indicating the rate of learning was not too varied from each other whereas the Mu
graph in the 30 layer plot decreased in a linear manner.
Fig:4.12 Training state plots of Elman network for post-monsoon
Cascade forward backpropagation neural network

This type of network produces 3 plots in total: the performance plot, training state
plot, and the regression plot. The regression plot is an additional plot here and it is a
useful tool to properly understand the accuracy of the network as it facilitates the
ability to predict the output as well unlike the Elman and NARX neural networks. This
was the selected network out of all other selected networks due to its accuracy. The
.— 25 —
comparison for 5, 10, and 30 layers is depicted below for all types of plots for the pre-
monsoon and post-monsoon dataset:
Fig:4.13 Performance plots of Cascade network for pre-monsoon

Just like the previous performance plots, the network is analyzed based on the
validation dataset (green line) as the errors for the training dataset, which is depicted
by the blue line, will continue plummeting upon more training. This might be ideal for
the training data but might cause overfitting by memorizing the data and lowering the
validation accuracy rather than generalizing the pattern, which is the main purpose of
the network. Here it can be observed that the best performance in the validation
dataset for 5, 10, and 30-layer networks was 17.634, 49.0322, and 4.8084 at 4, 3, and
14 epochs respectively. In any case, if the network is trained more than the epochs
given, the MSE starts increasing for the validation dataset as a result of overfitting.
From the given plots it can be identified that the 30-layer network had the least
amount of error after 14 iterations making it more precise than others. However, it
should be noted that the performance plot is a vague measure of accuracy compared to
the regression plot.
Fig:4.14 Performance plots of Cascade network for post-monsoon

When analyzing post-monsoon data, it can be recognized that the 30-layer network
had the MSE value of 143.7315 which is higher compared to 5 and 10-layer network
whose best performance on validation dataset was 4.2066 and 88.7382 respectively
making the network with 5 layers the network with the least MSE.
.— 26 —
Fig:4.15 Training state plots of Cascade network for pre-monsoon
The declining trend of the plots indicates that the speed of the training of the network
gradually decreased with time. This means that the weights of the network had to be
constantly updated to reduce the errors in the start of the training process and as the
network was trained, the errors reduced and therefore minute updating was required
by the end. From the generated plots, it can be observed that the training process of
the 10 layer plot was the smoothest and the errors in training was the least for the 5
layer plot for the training dataset with only 5.1162 as the gradient by the end of the
training. From the Mu plot which represents the learning rate, it can be noted that the
5 and 10 layer network had a similar trend as well as the same value of learning rate, 1,
towards the end implying that the weights were being updated and response was
generated much quicker towards the end of the process. The system with 30 layers
was learning slower compared to the other two.
Fig:4.16 Training state plots of Cascade network for post-monsoon

The gradient values for the 5, 10 and 30 layer networks were 2.6013, 40.5493 and
15.4323 respectively. The gradient plot of the 5 layer network was the best as towards
the end the weights were not updated frequently as a result of less errors, implying
that the network had been trained properly. The training of the 30 layer plot was
smooth but still had some errors at the end and the 10 layer network had the most
amount of errors by the end of the training process.
.— 27 —
Fig:4.17 Regression plots of Cascade network for pre-monsoon
Through the regression plot, the variation between the actual output and the predicted
output could be determined, as it has the ability to predict the outputs. For precision
purposes, regression values greater than 0.8 are considered suitable and accurate. In
the regression plot, the x axis is the number of samples, and the y axis represents the
WQI. The dotted line in the background in each plot represents the actual outputs and
therefore has a regression value of 1 indicating the values are perfect. There are 4 plots
in each regression plot: the blue line represents the performance and prediction for
the training dataset, the green line for the validation dataset whose values are not
familiar to the network, the red line is the performance on the testing dataset, and the
last plot with the black line is the overall performance. The regression values can range
from -1 to 1. The regression values should be closer to 1, indicating that the variation
between the predicted and actual outputs is minimum and hence the errors are also
minimum. The results of the pre-monsoon data for all 3 networks were highly
accurate, as all values were greater than 0.8 and the variations between the plots were
very minute.
Fig:4.18 Regression plots of Cascade model for post-monsoon

For the post-monsoon data as well the performance of all 3 networks with the training,
validation, and testing dataset was greater than 0.8 and very close to 1 which would
mean the network was flawless.
.— 28 —
Feedforward Neural Network
This type of network also facilitates the prediction of the outputs and therefore the
precision of the model can be examined in a much more concrete manner. Networks
with 5, 10 and 30 layers were compared against each other. The plots generated here
were the performance plot, training state plot and the regression plot. Comparison for
each plot for the pre-monsoon and post-monsoon is depicted below:
Fig:4.19 Performance plots of Feed-forward network for pre-monsoon

The best performance with the least MSE for the validation dataset was for the
network with 5 layers with the MSE value at the best validation point was 17.6762.
Fig:4.20 Performance plots of Feed-forward network for post-monsoon

Upon the analysis of the post-monsoon data, it can be noted that the network with 10
layers proved to have the least error for the validation dataset after 3 epochs making it
the most precise among others.
Fig:4.21 Training state plots of feed-forward network for pre-monsoon

For the pre-monsoon data, the training error for the 30 layer network was the least
indicating the network was trained accurately and the changes to be made at the end
of the training process was lesser than that of the other networks and the rate of
learning was the least at the end.
.— 29 —
Fig:4.22 Training state plots of feed-forward network for post-monsoon
For the post-monsoon data as well the network with the least errors for the training
dataset was the one with 30 layers whereas the learning rate of the 5 layer network
was the least towards the end.
Fig:4.23 Regression plots of Feed-forward network for pre-monsoon

From the regression plots of pre-monsoon data, it can be observed that the accuracy of
the networks with layers 5 and 10 were high as their values for each dataset was more
than 0.8 whereas, the 30 layer network had regression values for all the dataset lesser
than 0.8 indicating the high amount of variation between the predicted and the actual
outputs. This could be because of the small sample size and excess number of layers.
Fig:4.24 Regression plots of Feed-forward network for post-monsoon

For the post-monsoon dataset, the 10 layer network performed the best with
minimum deviation of the predicted values from the actual values of the output. The
regression values were greater than 0.8 for each dataset for the 10 layer network
whereas, the precision of the 5 and 30 layer network was very less comparatively
indicating the huge difference between the predicted and the actual outputs.
.— 30 —
5. Variation of Ions in water samples
In places where access to clean water is a concern, the study of hydrogeochemical
variation of ions in water samples has grown in significance. The Hill Piper diagram,
which depicts the dominating ions in the water samples, can be used to analyze this
variation. Utilizing Geographic Information Systems (GIS) to spatially analyze and map
the change of ions in water samples is another method. In order to forecast the change
of ions in water samples, self-organizing models and artificial neural network (ANN)
machine models have also been used. These models can help with water resource
management and conservation efforts by helping to comprehend the complex
relationships between various ions and their sources.
5.1. Hydrogeochemical characterization of a water sample using Hill piper
Fig: 5.1 Illustration of Hill-Piper Diagram
The Hill-Piper diagram, often also known as the Piper trilinear diagram or the Piper
plot, is a graphical representation of water chemistry data that illustrates the relative
proportions of various chemical components, such as cations and anions, in a water
sample. The illustration was created in 1942 by Arthur D. Hill and Albert F. Piper and is
frequently used in hydrogeology, environmental science, and water resource
management.
Three triangular axes, one for each of the relative proportions of cations, anions, and
total dissolved solids (TDS) in the water sample, make up the diagram. The main
cations and anions, such as sodium (Na+), potassium (K+), calcium (Ca2+), magnesium
(Mg2+), chloride (Cl-), sulfate (SO42-), and bicarbonate (HCO3-), are represented by
.— 31 —
the triangles' vertices.
● Magnesium bicarbonate + Mixed + Calcium chloride = Alkaline earth exceed
alkalies
● Sodium chloride+mixed+sodium bicarbonate = Alkalies exceed alkaline earth
● Magnesium bicarbonate + Sodium bicarbonate + mixed = weak acids exceeds
strong acids
● Calcium chloride+sodium chloride + mixed = strong acids exceed weak acids
These are chemical reactions between different compounds, resulting in changes in the
relative proportions of different chemical species. In the first reaction, there is an
excess of alkaline earth over alkalies, while in the second reaction, there is an excess of
alkalies over alkaline earth. The third reaction results in weak acids exceeding strong
acids, while the fourth reaction results in strong acids exceeding weak acids. These
reactions have implications for the overall chemistry of the water and can be used to
classify water types based on their chemical composition.
Fig: 5.2 Hill piper Diagram for Fig: 5.3 Hill piper Diagram for
Pre-monsoon 2019-2020 Post-monsoon 2019-2020
These diagrams demonstrate that sodium and potassium ions (Na+-K+) dominate over
other cations in the water samples analyzed. According to this, the concentration of
these two elements in the water is higher than that of any other cation.
Another graphical representation of water chemistry data that identifies the
predominant water type is the diamond diagram. According to the graphic, all of the
analyzed water samples were of the sodium chloride type, indicating that their
concentrations of sodium and chloride ions were higher than those of other cations
and anions.
These results collectively imply that the sodium and chloride ions in the water samples
.— 32 —
analyzed are quite high and that the relative proportions of various cations and anions
can change based on the particular chemical components of the water. This data can be
helpful in spotting possible problems with water quality and developing effective
management plans. But it's crucial to remember that these results only apply to the
water samples that were examined, and they might not apply to other water sources.
To confirm these results and their wider implications for the management of water
resources, additional study and analysis may be required.
5.2. Differentiation based on a variation of ions (GIS)

Various methods are followed by researchers to evaluate water quality, and with
advancements in technology, several advanced techniques can be used to gather
information about the variability of parameters in the data. To address this challenge,
it was decided to utilize the QGIS software to analyze how various ions in water
samples collected at different times are varying. Specifically, results for the pre-
monsoon and post-monsoon seasons of 2019-2020 were collected, and by comparing
the data from both periods, the levels of ions were determined to be fluctuating over
time.The primary focus was on analyzing the spatial and temporal variations of ions.
The term spatial refers to the changes in the levels of ions across different locations,
whereas temporal variation pertains to the changes in the levels of ions over time. This
was the main area of interest that was aimed to be investigated.
Fig: 5.4 Spatial variation of calcium Ions
.— 33 —
Fig: 5.5 Temporal Variation of WQI
The map that has been generated focuses on the calcium ion spatial variation, and it
enables the identification of areas where the permissible limit is being exceeded, as
shown by the legend in the figure. Yellow, pink, and red colors denote regions where
ion levels are beyond the desirable limit and possibly nearing the permissible limit.
The map's observation can be used as a starting point for conducting further research
to comprehend the underlying causes of these findings. Other parameters can also
have limits set, and their spatial variation can be analyzed by using distinct colors on
the map in a similar manner.
5.3. Differentiation using machine model (SOM)
Fig: 5.6 Architecture of SOM Fig: 5.7 Output of SOM
.— 34 —
An artificial neural network that learns patterns and relationships in data on its own is
called a "self-organizing network." Self-organizing neural networks can learn the
underlying structure of the data on their own, as opposed to supervised neural
networks, which need labeled input to train from. A self-organizing network's primary
principle is to modify the weights of the network's neurons in search of the best
possible representation of the input data. A layer of input neurons and a layer of
output neurons make up the network. While the output neurons represent various
groups of related data points, the input neurons take in the raw data. A set of input
samples is given to the network during training. The output layer neurons' weights are
initially initialized at random and subsequently modified according to how close they
are to the input samples. More updates are made to neurons nearby the input samples
than to neurons farther away.
As a result of this process, the neurons eventually organized into clusters that stand in
for various collections of related input data samples. These clusters can then be used to
group incoming data points according to how closely they resemble the clusters that
already exist.
Clustering, dimensionality reduction, and feature extraction are examples of
unsupervised learning tasks that frequently make use of self-organizing networks.
They have been applied to a variety of tasks, such as voice and picture recognition and
anomaly detection.
The data used for this research is from 2019-2020, and it can be compared with future
datasets to identify any changes over time and variations of parameters across
locations.The following results shown are for Pre-monsoon and post-monsoon
respectively.
Fig: 5.8 Allocation of Neurons for Input samples

The Sample Hits plot is a visualization tool that displays the location of each neuron in
.— 35 —
a two-dimensional grid and the number of observations connected to it in order to
evaluate the performance of a Self-Organizing Map (SOM). No neuron has more than
two observations associated with it in either the pre-monsoon plot or the post-
monsoon plot, which shows that the distribution of observations is fairly balanced
across the neurons.
Fig: 5.9 Neuron to neuron variation for each parameter
The Weight Planes plot displays a grid with two dimensions that shows the weights of
different input parameters, such as TDS, pH, HCO3, Cl, SO4, Ca, Mg, Na, K, and NO3.
Neurons with similar weights are depicted using a light yellow color, while those with
dissimilar weights are shown using darker shades of red and orange. When the
weights are similar, it indicates a high correlation between the neurons. In the pre-
monsoon Weight Planes plot, the pH input parameter has similar weights across the
grid, while in the post-monsoon plot, the weights vary significantly. This plot is useful
for comparing the weights of different input parameters to identify seasonal and
temporal variations.
.— 36 —
Fig: 5.10 Distances between Neurons
A visualization tool called the Neighboring Weight Distances plot shows the
separations between neighboring nodes in a SOM. It comprises nodes, which are
shown by blue circles, and the red lines that connect them. Different yellow hues can
be seen in the hexagons that the lines cross; lighter hues denote nodes that are closer
together and more likely to influence one another, while darker hues denote nodes
that are farther apart and less likely to impact one another. While there are
irregularities and greater distances between nodes in the pre-monsoon plot, the
weights are more uniformly distributed and closer together in the post-monsoon
figure. In general, the Self-Organizing Map methodology works well for determining
the standard of groundwater.
The information offered by the presented data makes it easy to comprehend how the
input weights change over the pre-monsoon and post-monsoon periods, as well as the
spatial distribution and distances between samples. These findings are essential for
identifying and tracking water contamination, especially in light of the present global
water crisis. Monitoring changes in the distribution and make-up of water samples
over time and place can help focus future research and lead to the development of
workable remedies to decrease the effects of water-related disasters. These results
significantly advance our knowledge of the intricate dynamics governing water quality
and availability and have substantial implications for global sustainability and
resilience in the face of environmental issues.
5.4. Variable correlation Diagram
Fig: 5.11 Correlation Plot

By analyzing the correlation plot, it is possible to determine the sensitivity of each
element toward the water quality index. The values on the plot, which range from -1 to
1, provide insights into the degree of correlation between variables. A correlation
coefficient of 1 indicates a direct and proportional relationship between the variables.
.— 37 —
Based on the results of the analysis, it can be observed that TDS, chlorine, and sodium
exhibit high correlation coefficients above 0.94, indicating a strong positive correlation
with water quality. This implies that changes in these variables are likely to have a
significant impact on the overall quality of water. Furthermore, the correlation plot can
be used to identify potential outliers and relationships between variables that may not
be immediately apparent through other means of data analysis.
6. Schedule tasks and milestones.

The tasks were initially listed, along with a deadline for finishing each one, and each
work was completed on schedule or earlier. The total time for the completion of this
project, including the report, was 32 weeks. Following is the time plan for the project
work:
Fig: 6.1 Time plan
7. Conclusion
Computer-based models called Artificial Neural Networks imitate how biological
neurons in the human brain work. A set of input parameters can be used to train these
models to forecast an output or categorize an item. The Water Quality Index , a single
number that measures the overall quality of a water sample based on numerous
physicochemical factors, has been predicted using ANN models in the context of water
quality assessment. The capacity to conserve resources is one advantage of employing
ANN models for water quality assessment. Traditionally, measuring numerous
physicochemical properties on water samples requires performing multiple
experiments. This procedure can be costly and time-consuming In contrast, ANN
models can be trained using a relatively small dataset, which can then be used to
predict WQI values for new water samples without conducting additional tests.
In this case, ANN models have been trained using a set of 10 parameters specifically
.— 38 —
chosen to determine the WQI in the Indian subcontinent(Nagapattinam). These
parameters were selected based on their relevance to water quality and availability in
water quality datasets. The models demonstrated their ability to predict WQI values
when WHO-defined parameters were used, indicating the potential for global
applications.However, it is important to note that variations in network parameters
can impact the results. This means that a larger training dataset with more members
may yield better regression values with learning rate and gradient. In addition, studies
have shown that a 10-layer network model has the highest regression when predicting
WQI values.Despite these considerations, the use of ANN models in water quality
assessment has the potential to simplify the complexity associated with interpreting
WQI. This can make the assessment process more efficient and cost-effective,
particularly in regions where access to water quality testing equipment is limited.
8. Project demonstration
As this study makes intensive use of soft computing techniques, different ANN models,
namely, Cascade forward backpropagation, feed-forward, Elman backpropagation,
NARX neural networks, and Self-organizing maps, for the purpose of training and
predicting the water quality index of the Nagapattinam district located in Tamil Nadu.
Other tools included GIS for the mapping of spatial and temporal variations, a decision
tree, a correlation plot for determining the impact of each parameter on the WQI,
cleaning and normalizing of the data, and a Python code for the classification of the
samples from the machine model.
.— 39 —
9. References
● M. Arunprakash, R. R. Krishnamurthy, & M. Jayaprakash (2013) “Impact of

urbanization in groundwater of south Chennai City, Tamil Nadu, India”,
Environmental Earth Science 71 (2).
● Saeedi M., Abessi O., Sharifi F., & Meraji H. (2009). “Development of
groundwater quality index”, Environmental Monitoring and Assessment 327-
335 (2010).
● Ramakrishnaiah C.R., Sadashivaiah C. & Ranagana G.(2009) “Assessment of
Water Quality Index for the Groundwater in Tumkur Taluk, Karnataka State,
India”
● Kumar S.K., Bharani R., Magesh N.S., Godson P.S., & Chandrasekar N. (2014)
“Hydrogeochemistry and groundwater quality appraisal of part of south
Chennai coastal aquifers, Tamil Nadu, India using WQI and fuzzy logic
method”, Applied Water Science 341–350 (2014).
● Walczak S., & Cerpa N. (2003) “Artificial Neural Networks”, Encyclopedia of
Physical Science and Technology (Third Edition) (2003).
● Ram A., Tiwari S.K., Pandey H.K., Chaurasia A.K., Singh S., Singh Y.V. (2020).
“Groundwater quality assessment using water quality index (WQI) under GIS
framework”
● Malekian A., Chitsaz N. (2021). “Concepts, Procedures, and Applications of
artificial neural network models in streamflow forecasting”, Advances in
Streamflow Forecasting (2021).
● D.Mostaza-Colado., F.Carreno-Conde., R.Rasines-Ladero., S.lepure.
(2020).“Science of the Total Environment”
.— 40 —

Capstone Report 2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Capstone Report 2

Uploaded by

Copyright:

Available Formats

ASSESSMENT OF GROUNDWATER QUALITY USING SOFT

Submitted in partial fulfillment of the requirements for the degree of

Under the guidance of

School of Civil Engineering

I hereby declare that the thesis entitled “ Assessment of

RISHU KUMAR THAKUR

Signature of the candidate

Internal Examiner External Examiner

Head of the Department

SR DESCRIPTION PAGE NO.

ii. Executive Summary II

iii. Table of Contents III-IV

iv. List of Figures V-VI

v. List of Tables VII

vi. Abbreviations VIII

1.1 Literature Review 2

2. Project Description and Goals 7

2.2 Primary Goals 8

3.1 Collection of Data 9

3.1.1 Pre-processing of collected data 10

3.1.2 Detection of Outliers 10

3.2.The conventional method of finding water 12

4. Advancements in water quality Prediction using 15

4.1 Advantages of Machine learning (ANN) 17

4.2 Working of Different Models 19

4.3 Result and discussion 21

5. Variation of Ions in water sample 31

5.1 Hydrogeochemical characterization of a water 31

5.3 Differentiation using a Machine Model (SOM) 34

5.4 Variable correlation Diagram 38

6. Schedule tasks and milestones. 38

Fig. No Caption Page No.

Fig 1.1 Study area 7

Fig 2.1 Methodology 8

Fig 3.1 Detection of outliers 11

Fig 4.1 Architecture of the NARX neural network 19

Fig 4.2 Architecture of the Elman Backpropagation Neural 20

Fig 4.3 Architecture of the Cascade Forward Backpropagation 20

Fig 4.4 Architecture of the Feed forward neural network 21

Fig 4.5 Performance plots of NAXR network for Pre monsoon 22

Fig 4.6 Performance plots of NAXR network for Post monsoon 23

Fig 4.9 Performance plots of Elman network for Pre monsoon 24

Fig Performance plots of Elman network for Post monsoon 25

Fig Training state plots of Elman network for Pre monsoon 25

Fig Training state plots of Elman network for Post monsoon 25

Fig Performance plots of Cascade network for Pre monsoon 26

Fig Performance plots of Cascade network for Post monsoon 26

Fig Training state plots of Cascade network for Pre monsoon 27

Fig Training state plots of Cascade network for Post monsoon 27

Fig Regression plots of Cascade network for Pre monsoon 28

Fig Performance plots of Feed forward network for Pre 29

Fig Performance plots of Feed forward network for Post 29

Fig Training state plots of Feed forward network for Pre 29

Fig Training state plots of Feed forward network for Post 30

Fig Regression plots of feed forward network for Pre 30

Fig Regression plots of feed forward network for Post 30

Fig 5.1 Illustration Of Hill piper Diagram 31

Fig 5.2 Hill piper Diagram for Pre-monsoon 2019-2020 32

Fig 5.3 Hill piper Diagram for Post-monsoon 2019-2020 32

Fig 5.4 Spatial Variation of Calcium ions 33

Fig 5.5 Temporal variation of WQI 34

Fig 5.6 Architecture of SOM 34

Fig 5.7 Output of SOM 34