Sig Group 1 - Ai and Sustainability 1

AI4ALL Special Interest Group: AI and Sustainability - Group 1
Utilizing Machine Learning Algorithms to

Model Water Quality Based on Indian
Environmental Data
By Isita Talukdar, Aroshi Ghosh, Hiya Shah, and Ayush Raj
With Contributions from Anakha Ganesh, Audrey Kim, Archika Dogra, and Hannah Zhou
Mentored By: Sam Shah, Software Engineer
Abstract
Poor water quality is an extremely important issue that has impacts in many aspects of life. Clean water is
needed for humans to drink, for those involved in aquaculture to raise livestock, and is also a metric of
environmental pollution and degradation. The ability to predict water quality drops can enable farmers
and other consumers of that water to mitigate the outbreaks of disease and protect themselves, both in
health and from financial loss. Certain environmental factors can be utilized as indicators of water
quality: dissolved oxygen, pH, conductivity, biochemical oxygen demand, nitrate amount, and total
Coliform. These are the results of many complex biological processes of the living organisms in the water
and the external environment around the water. Machine Learning has shown considerable promise in the
area of classification and prediction tasks. Thus, this project experiments with multiple Machine
Learning-based techniques to create a model that can accurately predict water quality based on
biochemical and environmental factors. The experimental results show a high accuracy in predicting
water quality based on the algorithms with real datasets.
Introduction
The goal of our small group’s research was to analyze Indian water quality data to develop an accurate
machine learning model to predict water quality in India efficiently and accurately. First, we chose the
public “Indian Water Quality Data” dataset, which has historical data on pollution levels from various
monitoring stations across India. This project explored and cleaned the data, tested and optimized linear
regression, regression tree, and bayesian ridge regression models, and evaluated them for error and
accuracy. According to WHO, globally, at least 2 billion people use a contaminated drinking water source.
Contaminated water and poor sanitation are linked to transmission of diseases such as cholera, diarrhoea,
dysentery, hepatitis A, typhoid, and polio. Absent, inadequate, or inappropriately managed water and
sanitation services expose individuals to preventable health risks. Water pollution is a major crisis in
India, with around 70 percent of its water contaminated. This study uses machine learning techniques to
run analyses on Indian water quality. We seek to develop and evaluate the machine learning model with
the greatest efficacy at predicting the Water Quality Index, continuous value, and other constituent values
based on environmental factors.
Background Research
Almost 70 percent of India’s surface water resources and groundwater reserves are polluted by organic,
biological, and inorganic contaminants. In some areas, water sources rendered unsafe for consumption,
irrigation, and industrial needs contribute to the growing water scarcity. The Central Pollution Control
Board of India (CPCB) identified extremely polluted areas of 18 major Indian rivers. They found that
several sectors contributed to severe water contamination, including urban, industrial, and agricultural
activities. These include residues of household borne effluents, pesticide, and fertilizer that may allow
algae to proliferate into drinking sources. This lack of water sanitation and hygiene contributes to more
than 0.4 million Indian deaths annually. A contributor to the overall lack of awareness of water pollution
in India is inadequate infrastructure, including deteriorating monitoring stations which infrequently test
for bacterial risks and discharge into water bodies. In our study, we used a dataset containing water
quality constituent data from monitoring stations throughout India. Using measures of dissolved oxygen,
pH, temperature, conductivity, nitrate, and biochemical oxygen demand, we were able to develop machine
learning models to predict the water quality index.
Our Data
This section outlines the dataset that was chosen for analysis as well as other datasets that were
considered in the project.
Final Dataset
The dataset consists of information about properties of water from stations all over India. The temperature
of water is provided in Celsius. D.O. (i.e., Dissolved Oxygen), which represents the level of free and non
compound oxygen contained in water, is also recorded. This can be used to measure water quality as more
oxygen means species can thrive better in the water. The pH column indicates the amount of acid/basic
compound in the water. The pH controls the solubility and biological quality of impurities such as
nutrients and heavy metals.
With heavy metals, toxicity is measured by their solubility. Conductivity measures the capability of
electric current to pass through water. Biochemical Oxygen Demand (BOD) is a metric for the dissolved
oxygen consumed by aerobic microorganisms when organic matter is decomposed. It is an index for
evaluating the impact of wastewater discharged on the area. High BOD values indicate a high amount of
available organic compounds for bacteria that consume oxygen.
Nitrate concentration is also measured, as higher levels of nitrate are harmful for life underwater. Total
coliform is analyzed, and indicates bacteria that are present in animals. Coliform do not cause diseases but
some coliform like E. coli can cause serious harm to living bodies. Together, these six
variables—Dissolved Oxygen (DO), pH, Conductivity, Biochemical Oxygen Demand (BOD), Nitrate
Amount (NA), and Total Coliform (CO)—are used as the features(inputs) that influence Water Quality
(output).
Water Quality is calculated in a separate column by the following equation after scaling the above input
variables as shown:
(from Al-Akhir, 2020)
Where ‘ph/pH’ refers to pH values, ‘do’ refers to Dissolved Oxygen, ‘bdo’ refers to Biochemical Oxygen
Demand, ‘ec’ refers to electrical conductivity, ‘na’ refers to Nitrogen content, and ‘co’ refers to Total
Coliform. ‘Wqi’ is water quality. Although the above equation could be used to calculate water quality,
our models can find hidden relationships between the variables that are not determined by the set
equation. For this reason, the model is trained on the non-scaled values.
The dataset was found via

Kaggle(https://www.kaggle.com/anbarivan/indian-water-quality-analysis-and-prediction/comments). The
paper with information about the equations was also found at that link (Al-Akhir, 2020).
Calculating WQI
The authors calculated and plotted the overall Water Quality Index (WQI) score for each year on the
below graph.
Plots of Constituents
Preprocessing Steps
Using the equations dictated above, a water quality index column was added to the data, which was stored
in a Pandas Dataframe. All rows with null values were dropped for modeling. After this step, there were
1495 rows, or data points left to analyze. Below is a part of the data points used in analysis.
Past Alternatives
Originally, the project was to utilize the LeakDB dataset, created by KIOS Research and Innovation
Center of Excellence based in Cyprus. The LeakDB dataset depicts artificially created but realistic data on
leakages in a variety of water distribution networks under varying conditions. However, the dataset was
unable to be accessed due to software constraints.
The LeakDB dataset can be found here: https://github.com/KIOS-Research/LeakDB (Vrachimis, 2018)
After shifting our focus to water quality, a dataset on New York water quality was also explored in
addition to the final India dataset. As the India dataset had a more diverse range of features, it was chosen
for the analysis. Nevertheless, the New York dataset was preprocessed in our Github.
The New York Dataset can be found here:

https://www.kaggle.com/new-york-city/nyc-drinking-water-quality-distribution-monitoring?select=Data_
Dictionary_DWQO_Distribution_Chem-Micro_101817.xls (York, 2019)
Models Evaluated
The following models were developed to find a relationship between the features (pH, BOD, conductivity,
etc) and water quality (WQI) based on the data. The predicted variable or y-variable is always a
continuous value of WQI and the features/inputs or x-variables are always the six environmental factors
outlined in previous sections.
Linear Regression
To use linear regression on a dependent variable (in our case, water quality) 𝑦 on the set of independent
variables (in our case, pH, conductivity, etc) 𝐱 = (𝑥₁, …, 𝑥ᵣ), where 𝑟 is the number of predictors, it is
assumed that there is a linear relationship between 𝑦 and 𝐱: 𝑦 = 𝛽₀ + 𝛽₁𝑥₁ + ⋯ + 𝛽ᵣ𝑥ᵣ + 𝜀. This equation is
the regression equation. 𝛽₀, 𝛽₁, …, 𝛽ᵣ are the regression coefficients, and 𝜀 is the random error.
Linear regression determines the estimators of the regression coefficients, the predicted weights, which
are written as 𝑏₀, 𝑏₁, …, 𝑏ᵣ. They define the regression function 𝑓(𝐱) = 𝑏₀ + 𝑏₁𝑥₁ + ⋯ + 𝑏ᵣ𝑥ᵣ. This function
will capture the relationship between the inputs and output with significant accuracy.
The predicted response, 𝑓(𝐱ᵢ), for each observation, is supposed to be as close as possible to the
corresponding actual response 𝑦ᵢ. The differences 𝑦ᵢ - 𝑓(𝐱ᵢ) for all observations 𝑖 = 1, …, 𝑛, are called
residuals. The goal of regression is to find best-predicted weights, which correspond to the lowest
residuals.
To measure the accuracy of the model, the coefficient of determination can be analyzed. The coefficient
of determination, written as 𝑅², represents the amount variation in 𝑦 that can be explained by dependence
on 𝐱 using the regression model. A higher 𝑅² demonstrates a superior fit and means that the model can
better explain the variation of the output with changing inputs (Real Python, 2020).
Regression Tree (Decision Tree Regression)
Decision Trees are one of the most well-known methods for supervised learning. They can be used to
solve both Regression and Classification tasks. In this case, the regression algorithm must be used.
A Decision Tree is a tree-structured classifier with three types of nodes. The Root Node is the initial node,
representing the entire sample, and can be split into further nodes. Interior Nodes are the features of a data
set and branches represent decision rules. Leaf Nodes represent the outcome. This algorithm is very useful
for solving decision-related problems.
For any data point of inputs, it is put through the entire tree by answering True/False questions until it
reaches the leaf node. The final prediction is the average of the value of the dependent variable in that
particular leaf node. Through many iterations, the tree is able to predict an accurate value for the set of
inputs.
Unlike Linear Regression, a linear relationship is not assumed. The accuracy of the Regression tree is
measured in a similar way to the Linear Regression algorithm above, with coefficient of determination,
written as 𝑅² (K, 2020).
Random Forest
Decision trees are powerful models that can be used for many regression and classification tasks, but one
of their main drawbacks is that they are prone to overfitting. It can be very easy for a decision tree to
memorize the training data and be unable to generalize and make predictions on testing data. However, by
decreasing the size of the tree too much, its prediction accuracy might decrease as well. It is difficult to
find the middle ground where the tree makes accurate predictions but also does not overfit.
Instead of using just one decision tree, it is possible to use multiple. A random forest includes a large
number of weak, shallow decision trees that operate as an ensemble (Yiu, 2019). Random forests use
bagging, meaning that each decision tree is trained on different samples. A random forest uses the outputs
from all of these individual decision trees to make a prediction.
Bayesian Ridge Regression
The output of a Bayesian Regression model is obtained from a probability distribution, in contrast to
traditional regression techniques where the output is obtained from a single value of each input. The
output, 𝑦 is generated from a normal distribution (i.e. the mean and variance are normalized). The goal of
Bayesian Regression is to discern the posterior distribution for the model parameters, not the parameters
themselves. The expression for Posterior is written as:
where Posterior is the probability of an event to occur given that another event has already occurred, Prior
is the probability of an event having occurred prior to another event, and Likelihood is a likelihood
function where a parameter variable is marginalized.
This is equivalent to the Bayes’ Theorem:
where A and B are events, P(A) is the probability of occurrence of A, and P(A|B) is the probability of A
given that event B has already occurred. P(B), the probability of event B occurring cannot be 0 since it
has already occurred.
The posterior distribution for the model parameters is proportional to the likelihood of the data multiplied
by the prior probability of the parameters. As the number of data points increase, the value of likelihood
will increase and will become much larger than the prior value. The regression process is started with an
initial estimate (i.e., the prior value). As more data points are analyzed, the model becomes less
inaccurate. For Bayesian Ridge Regression, a large amount of training data is needed to make the model
accurate.
For Bayesian Regression to obtain a fully probabilistic model, the output ‘y’ is assumed to be the
Gaussian distribution around Xw as shown below:
where alpha is a hyper-parameter for the Gamma distribution prior, treated as a random variable estimated
from the data. The mathematical expression on which Bayesian Ridge Regression works is:
where w is the shape parameter for the Gamma distribution prior to the alpha parameter and lambda is the
shape parameter for the Gamma distribution prior to the Lambda parameter (GeeksForGeeks, 2020)
The Bayesian approach can be used with any regression technique like Linear Regression, Lasso
Regression, etc. This algorithm uses Ridge Regression, which Ridge Regression is a way to create a
parsimonious model (i.e., uses the least amount of parameters) when a dataset has multicollinearity (i.e.,
correlations between predictor variables). This may apply to water quality parameters as the physical
measurements may be related to each other.
Optimization Techniques
Based on the performance of the original models, the following optimization techniques were utilized to
further improve accuracy.
Regression Tree Optimization
As stated previously, decision trees are highly prone to overfitting. If fit on the training data using the
default scikit-learn hyperparameters, the decision tree will not generalize well on unseen data (in this
case, the testing data). Therefore, tuning these hyperparameters is essential to ensure that the decision tree
does not overfit.
The main problem with hyperparameter tuning is that it is unclear which combination of hyperparameter
values will increase the accuracy of predictions the most. Therefore, a grid search can be performed in
order to find the best possible combination of hyperparameter values. For the decision tree, two of the
most important scikit-learn hyperparameter values were grid-searched: “max_depth” and
“min_samples_split.” The first hyperparameter controls the depth of a tree; if a tree becomes too large, it
is very prone to memorizing the training data and therefore, overfitting. The second hyperparameter
controls how many samples must be inside a leaf node before it is allowed to split. This also ensures that
the tree does not grow too large.
Using scikit-learn’s “GridSearchCV,” 50 possible “max_depth” values and 50 possible

“min_samples_split” values were set along with a scoring metric of R-squared and 10-fold cross
validation. While the unoptimized decision tree resulted in an R-squared score of 0.9538, the optimized
decision tree resulted in an R-squared score of 0.9595.
Random Forest Optimization
Random Forests are not as prone to overfitting as decision trees, but this does not mean that random
forests cannot be optimized further. So, Grid search has been performed on the random forest model as
well. 5 possible “max_depth” values, 25 possible “min_samples_split” values, and 25 possible
“n_estimators” values were set. The “n_estimators” hyperparameter controls the number of individual
decision trees included in the forest. 4-fold cross validation (due to runtime constraints) was chosen and
the scoring metric was R-squared.
Metrics of Success
R-squared Score
As mentioned before, the R-squared score was the metric metric of accuracy for all three models.
Correlation (otherwise known as “R”) is a number between 1 and -1 where a value of +1 implies that an
increase in the independent variable(s) results in some increase in the output, -1 implies that an increase in
x results in a decrease in y, and 0 means that there isn’t any relationship between x and y. R² is the
percentage of variation (i.e. varies from 0 to 1) explained by the relationship between two variables. An
R-squared value above 0.8 or below -0.8 indicates a higher correlation (Maklin, 2019). In this project,
reaching a high R-squared means that the model has successfully linked the environmental inputs to water
quality.
Mean Absolute Error
Mean Absolute Error is a fairly common metric to calculate error. It measures performance of models
which are applied on continuous variables, such as the Water Quality Index. Error is defined as the
prediction error of a model (Actual Value - Predicted Value).
This value is determined for each row of data. Then the absolute value of each difference is averaged. The
formula of Mean Absolute Error is:
where MAE is Mean Absolute Error, n is the number of data points, y j is the actual value and 𝑦𝑗
Is the predicted value.
Mean Absolute Error does not consider the direction of the error (i.e., whether there is underestimation or
overestimation). It is also a linear score, meaning that every individual error is weighted equally in the
average. (Garmsiri, 2018).
Mean Squared Error
Mean Squared Error is almost identical to Mean Absolute Error, however the difference between actual
and predicted is squared instead of put into an absolute value. Squaring the differences emphasises the
differences as larger differences become amplified by the squaring. This better accounts for the larger
errors and provides a more accurate measure of error. The formula of Mean Squared Error is:
where MSE is Mean Squared Error, n is the number of data points, yi is the actual value and y~
Is the predicted value (freeCodeCamp.org, 2018).

Root Mean Squared Error
The Root Mean Squared Error technique is exactly the same as the Means squared error, but is square
rooted as shown:
The square root function makes this a good indicator of the standard deviation of the errors, which can
demonstrate whether the model is consistent in its accuracy or varies based on input.
An important note about this metric is that having a small Root Mean Squared Error is not necessarily an
adequate proof that the model is sufficient. If the error is too small this could signify that the model is
suffering from overfitting, meaning that it will only perform with high accuracy on the dataset it was
given (Moody, 2019). This would make the model conducive to being adapted to other datasets from
other regions in our case.
Results
Interpreting Error and Accuracy
R-squared Score Mean Absolute Mean Squared Root Mean

Error Error Squared Error
Linear Regression 0.620 5.68 52.3 7.23
Regression Tree 0.954 0.750 6.35 2.52

(No Grid Search)
0.959 0.709 5.57 2.36

Regression Tree
(With Gridsearch)
Bayesian Ridge 0.62 5.68 52.3 7.23

Regression
Random Forest 0.972 0.799 3.80 1.95

Regressor (No
Grid Search)
0.970 0.820 4.11 2.03

Random Forest
Regressor (With
Grid Search)
As shown by the results in the table, the Random Forest Regressor performed the best on the data. With
the highest r-squared score, well above 0.95, it has found the greatest correlation between the features and
water quality. It also has the lowest errors in all three categories, except Mean Absolute Error, showing
that the model had the least deviations from the actual data. Surprisingly, the random forest that was
unoptimized had an R-squared value that was 0.02 higher than the optimized random forest. This may
mean that the optimization does not try enough values for each hyperparameter or it does not try values in
the optimal range for each hyperparameter. However, a large runtime and lack of heavy-duty
computational resources prevented us exploring further. Linear Regression and Bayesian Ridge
Regression performed at a lower level, most likely because they are simpler than the Random Forest
Regressor. The reason for their almost identical results could be because they share a very similar
technique, the main difference being that Bayesian Ridge Regression is suited to datasets with
collinearity, which may have been the case in the features of this dataset. Indeed, the environmental
factors may be related to each other via complex webs of biological, chemical, and ecological
phenomena.
Conclusion
This project confirms that machine learning algorithms can be utilized to predict water quality based on
chemical and environmental factor analysis in water sources in India. The Random Forest Regressor
model would be able to accurately make predictions based on any data in India. Also, as there was not a
significant sign of overfitting, this algorithm could be utilized for other datasets from geographically
diverse locations.
This early warning of inadequate water quality could be useful for many areas. Sufficient water quality is
paramount to drinking water sources. Thus, this model could be used to indicate whether the main water
source of regions is safe for drinking. This could prevent a great deal of illness and death related to
contaminated water. Additionally, this could be of use to those who are in the fish farming industry. If the
water quality is low and will affect the fish in terms of disease, then farmers would lose profit and risk
selling injurious products. If this model is able to predict low water quality beforehand, then those farmers
will be able to take preemptive actions to combat the chance of disease and pollution. Based on this
project, extensions can be made into experimenting with other datasets, other machine learning models,
and combining the machine learning software with direct, IoT sensors, to improve the prediction of water
quality.
Access to Code
The code for this project is located at the following Github Repository:
https://github.com/IsitaT03/AI4ALL-SIG-Group-AI-and-Sustainability-1
References
Al-Akhir Nayan, Ahamad Nokib Mozumder, Joyeta Saha, Khan Raqib Mahmud, Abul Kalam Al Azad.
(2020). Early Detection of Fish Diseases by Analyzing Water Quality Using Machine Learning
Algorithms. International Journal of Advanced Science and Technology, 29(05), 14346 - 14358.
Retrieved from http://sersc.org/journals/index.php/IJAST/article/view/33228
FreeCodeCamp.org. (2018, October 08). Machine learning: An introduction to mean squared error and
regression lines. Retrieved February 11, 2021, from
https://www.freecodecamp.org/news/machine-learning-mean-squared-error-regression-line-c7dde9a26b93
/
Garmsiri, S. (2018, September 06). Art of choosing metrics in Supervised models Part 1. Retrieved
February 09, 2021, from
https://towardsdatascience.com/art-of-choosing-metrics-in-supervised-models-part-1-f960ae46902e
Implementation of Bayesian Regression. (2020, September 02). Retrieved February 02, 2021, from
https://www.geeksforgeeks.org/implementation-of-bayesian-regression/
K, G. (2020, July 18). Machine Learning Basics: Decision Tree Regression. Retrieved February 02, 2021,
from https://towardsdatascience.com/machine-learning-basics-decision-tree-regression-1d73ea003fda
Maklin, C. (2019, July 21). R squared interpretation: R squared linear regression. Retrieved February 09,
2021, from
https://towardsdatascience.com/statistics-for-machine-learning-r-squared-explained-425ddfebf667
Moody, J. (2019, September 06). What does RMSE really mean? Retrieved February 11, 2021, from
https://towardsdatascience.com/what-does-rmse-really-mean-806b65f2e48e
Murty, M. (2021). Retrieved 12 February 2021, from

http://www.idfc.com/pdf/report/2011/Chp-19-Water-Pollution-in-India-An-Economic-Appraisal.pdf
Real Python. (2020, November 26). Linear Regression in Python. Retrieved February 01, 2021, from
https://realpython.com/linear-regression-in-python/#linear-regression
Vrachimis, S. G., Kyriakou, M. S., Eliades, D. G. and Polycarpou, M. M. (2018). LeakDB : A benchmark
dataset for leakage diagnosis in water distribution networks. In Proc. of WDSA / CCWI Joint Conference
(Vol. 1).
Yiu, Tony. (2019, June 12). Understanding Random Forest. Retrieved February 12, 2021, from
https://towardsdatascience.com/understanding-random-forest-58381e0602d2
York, C. (2019, December 02). NYC drinking water quality distribution monitoring. Retrieved February
08, 2021, from
https://www.kaggle.com/new-york-city/nyc-drinking-water-quality-distribution-monitoring?select=Data_
Dictionary_DWQO_Distribution_Chem-Micro_101817.xls

Sig Group 1 - Ai and Sustainability 1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Sig Group 1 - Ai and Sustainability 1

Uploaded by

Copyright:

Available Formats

AI4ALL Special Interest Group: AI and Sustainability - Group 1

Utilizing Machine Learning Algorithms to

Mentored By: Sam Shah, Software Engineer

(from Al-Akhir, 2020)

The dataset was found via

The LeakDB dataset can be found here: https://github.com/KIOS-Research/LeakDB (Vrachimis, 2018)

The New York Dataset can be found here:

Regression Tree (Decision Tree Regression)

Bayesian Ridge Regression

This is equivalent to the Bayes’ Theorem:

Regression Tree Optimization

Using scikit-learn’s “GridSearchCV,” 50 possible “max_depth” values and 50 possible

Random Forest Optimization

Mean Absolute Error

Is the predicted value.

Mean Squared Error

Is the predicted value (freeCodeCamp.org, 2018).

Root Mean Squared Error

Interpreting Error and Accuracy

R-squared Score Mean Absolute Mean Squared Root Mean

Linear Regression 0.620 5.68 52.3 7.23

Regression Tree 0.954 0.750 6.35 2.52

0.959 0.709 5.57 2.36

Bayesian Ridge 0.62 5.68 52.3 7.23

Random Forest 0.972 0.799 3.80 1.95

0.970 0.820 4.11 2.03

Murty, M. (2021). Retrieved 12 February 2021, from

You might also like