Sayantan Final Print Project Report

PROJECT REPORT ON
AIR QUALITY INDEX PREDICTION USING

MACHINE LEARNING ALGORITHM IN
KOLKATA
In partial fulfilment for the award of the degree
Of
Master of Business Administration
By
Sayantan Ghosh
Under the supervisor of
Dr. Biswajit Biswas
Department of Business Administration
University Of Kalyani
Kalyani 741235, India
July 2022
ACKNOWLEDGEMENT
I would like to extend a big thank you to all those who have given me
the opportunity to complete this on-going work. First of all I would like
to thank Dr. Biswajit Biswas and Dr. Indrajit Bhattacharya (Kalyani
Government Engineering College) my Mentor of Education at the
Universityof Kalyani for helping me to start providing new ideas and
on-going guidance and support throughout the project. His guiding
comments, practical suggestions, helpful information, practical advice,
and endless suggestions were very helpful. I thank him for his precious
time in answering my questions. In the end, would like to thank all the
people who assist me directly and indirectly to complete this work very
effective. There have never been times when a conflict of interest was
not resolved successfully.
Student Signature Faculty Signature
Date: Date:
i
CERTIFICATE OF ORIGINALITY
I Sayantan Ghosh, Roll 99/MBA No. 200022 of 2022, a fulltime

bonafide student of Master of Business Administration (MBA)
programme of University Of Kalyani, Kalyani, Nadia. Thereby certify
that this project work carried out by me during 4th semester and this
report submitted in partial fulfilment of the MBA degree is an original
work of mine under the guidance of Dr. Biswajit Biswas and this project
is not based or reproduce from any other person or on any earlier work
undertaken at any other time or for any other purpose, and has not been
submitted anywhere else at any time.
Student Signature Faculty Signature
Date: Date:
i
Abstract
Air quality refers to the level where the air is pollutant free or not in that
particular area. Air pollution has become a serious matter in India and to the
rest of the world also. In this work author focused on different methods for
predicting the Air Quality and forecasting the level of pollutant which cause
air pollution. In this work a model of Air Quality Index (AQI) is being built
using different techniques and Machine Learning Algorithms. The author
implies Algorithms like Random Forest (RF), Linear Regression (LR) and
Support Vector Regression (SVR). In this work author is collecting the data
from Central Pollution Control Board and from AQI. In this work the author
is creating a model using Machine Learning Algorithms which will detect the
most contaminated area and where the concentration level of pollutant is
high. This will help in identifying which pollutants is present with highly
concentration and what are its impact on environment as the model is being
created for making clean ecosystem and healthy living life.
i
List of figures
Sl. List of Figures Page

No. No.
1 Package Import 8
2 Data Import 8
3 Data Description 9
4 Box – Plot Analysis 9
5 Scattering Plot Of Ozone 10
6 Histogram 10
7 Scattered Plot Of SO2 and NO2 11
8 Heatmap Analysis 11
9 Pairplot Analysis 12
10 Scattering Plot Of PM 2.5 12
11 Frequency check of Ozone 13
12 Actual vs Predicted Data of PM10 14
13 Actual vs Predicted Data of SO2 14
i
Contents
Sl. NO. List of Contents PAGE NO.

Chapter 1: 1.1 Introduction 1-2
Introduction 1.2 Objective of the Project 3
Chapter 2: 2.1 Literature Review 4-5
Literature Review 2.2 Research Gap 5
Chapter 3: Research 3.1 Box-plot Analysis 6

Methodology 3.2 Heatmap 6
3.3 Pairplot 6
3.4 Scatter plot 6-7
3.5 Matplotlib Histogram 7
3.6 Linear Regression 7
Chapter 4: Data Results and Findings 8-14
Analysis and
Interpretation
Chapter 5: 5.1 Managerial Implications 15
Managerial 5.2 Limitations 15
Implications, 5.3 Future scope 15
Limitations, 5.4 Conclusion 16
Future-scope &
Conclusion
Reference 16
v
Chapter 1: Introduction
1.1 Introduction
Air pollution caused by the presence of compounds in the atmosphere that are hazardous to
human and health of other living animals as well as to the environment also. Gases,
particles, and biological molecules are among the various types of air contaminants present
in the mixture of air pollutants. Air pollution has become a serious matter in many of the
urban cities in India. Every human being must know about the air quality that they are
breathing, so CPCB had developed the Air Quality Index for every city in India. The AQI
gives an idea on the quality of air for that particular area whether that area is polluted or
not. AQI is a numeric value by which government pollution board measure the air
pollutants level present in the atmosphere. If the AQI value is increased then percentage of
pollutants is high and that can affect adversely human health.
According to Central Pollution Board there are twelve parameters present in the air
pollutant and author have taken the most important pollutants which are very harmful to
atmosphere and they are PM2.5, PM10, SO2, NO2, CO, OZONE. The selection of these
pollutants varies on availability of AQI, data availability frequency monitoring and
measurement methods. In respect to CPCB the AQI gives the idea about the air quality to
what extent the particular area is polluted which gives an idea that AQI provides the actual
value of Air Quality in our eco system which is in touch with our different health issues.
The author in this work focused on the AQI prediction models and taken the six months
data of Kolkata region where Ballygunge is the focused area of testing among other
centers. The work is being performed in an open source platform ‘Python’ using Jupyter
Anaconda. Jupyter Notebook is an open source web application which helps in creating
and sharing documents containing live code, visualizations, texts & equations. Anaconda
is a Graphic User Interface navigator tools which makes the work easy to install, configure
and helps to run code in notebook in an isolated environment which can be called as
Conda-Python environment.
In this work, author imported packages for predictions and data processing those are
Numpy, Pandas, Matplotlib. pylot, Seaborn, Sklearn, Warnings.
1
NUMPY
It is one of the most important packages used for soft computing in python. It helps in
working with arrays and creates a multi-dimensional object array that can be used in
various mathematical operations like matrices, linear-algebra, and fourier transform.
PANDAS
This package is used in machine learning and data science for analysis and data cleaning.
It is the best tool which can handle messy real world data easily. With the help of this
package the processing of data works very fast, flexible and make impressive data
structure which are design in a way where both relational and labeled data can performs in
an easy or intuitive way.
MATPLOTLIB.PYLOT
It is a group of functions which help matplotlib work as MATLAB. The pylot function
creates a figure and makes plotting area in the figure by decorating the plot with labels.
Pylot API is less flexible than object oriented API & generating the visualization with
pylot is very fast. It is a numerical extension of numPy library which provides state based
interface to MATLAB like interface.
SEABORN
It is a data visualization python library depend on Matplotlib. It gives a high label interface
for informative and attractive statistical graphical interpretation. It creates statistical plot
with default styles and color palettes which makes the graph more attractive and easy to
understand. It explore dataset oriented API which switch between various visual
representation of same variable.
WARNINGS
It helps to alert the developer at situations which are not necessarily exceptions.The
package works when there are obsolete programing elements like functions, class etc. It is
actually an alternate of exception that
is built in PYTHON.
2
1.2 Objective of the project
 To design a prediction model and this will help in forecasting the quality of air of any
pollution control station of Kolkata.
 To gather a brief knowledge on Air Quality Index as it is very much important to know
about the bad impacts of air pollutants that is affecting adversely on human health.
3
Chapter 2: Literature Review
2.1 Literature Review:

HuipingPeng (2013):- In this paper he had taken hourly pollutant level concentration data
of Canada. He had used accuracy, efficiency and updability as key indicators in his
research work. He had used machine learning methods using Extreme Learning Machine
(ELM) which is an updated algorithm and different from other models used for
forecasting.
Akshaya A.C (et al.2019):- They had research on complete data analysis of pollutants and
their model can forecast AQI of any region with more than 90% accuracy. In their study
the author had taken China AQI at the time of checking and investigating the data. They
took each air pollutants concentration and their percentage of impact in AIR using
Gradient Boost Algorithm.
Radhika M Patil (et al. 2020):- The main agenda of this research paper is to discover the
AQI model and to find out the impacts of polluted air in life of every human being. In this
paper they had first applied calculation of AQI by taking the concentration of different air
pollutants and getting a single numerical value which is known as AQI value. Then the
authors of this paper had done aggregated Index Calculation which helps in finding the air
quality condition and its impact on our environment. According to this study, authors
found that they generally worked on social experiment of AQI by using ANN, Linear &
Logistic Regression and how air pollution is affecting our lives.
YonaMaimury (et al. 2020):- In this study they had taken data over eleven years of three
regions in Taiwan. They applied two machine learning method for AQI prediction which is
new and give better results from other machine learning algorithms. They created a model
which illustrates analysis and prediction of Air Quality System.
Meng Dun (et al. 2020):- In this research work, they created a hybrid model using
multivariable regression and support vector regression to predict the pollutants present in
the air. The author claimed that their hybrid model gives better accuracy percentage than
the other single model present in the market for prediction of AQI.
4
Khalid Nahar (et al.2020):- In this study they created a model using Data tree, SVM,
KNN, RF and Logistic Regression which gives a daily observation of the air pollutant and
their accuracy is not less than 92 percent as per Ministry of Environment. In this paper
they had find out the most contaminated sites and the pollutant concentration present there
and researched on it how to make the air pollutant free to clean ecosystem.
K.Kumar and B.P. Pande (2022):- In this study they had used Gaussian Naive Bayes
model which has the highest accuracy in terms of forecasting than Support Vector
Machine model. They had also created model with XG Boost which has the highest
linearity than actual and predicted data.
S. Bhattacharya and SK Shahnwaz: - In this research work they had taken the data of
Delhi AQI. They used Support Vector Regression (SVR) model for forecasting the
pollutants levels present in the AQI of Delhi. Along with SVR they also used RBF
(Radical Basis Function) which gives better results in forecasting and analyzing the data.
With the help of this analysis there model predicts various pollutant level with an accuracy
of 93.4 percent.
2.2 Research Gap

From the literature review, author found that most of the author had created model for
forecasting of present data available in the AQI website but there is no such model which
can be used for future prediction of AQI where precautions of air pollutants can be taken
from beforehand only. The analysis of data done in most of the model carry with low
amount of dataset only but a Hyper parameter Optimization can be done with Random
Forest which can handle large amount of dataset during analysis. Author also found that
there is no model with web-based terminal where prediction of AQI can also be done with
the help of Predictive Model Markup Language using Python.
5
Chapter 3: Research Methodology
In this work author performed the statistical analysis by taking the parameters as
Ozone,NO2,SO2,PM 2.5, PM 10 which gives an idea about the awareness of pollutant that
how much it is affecting our ecosystem. It is being performed in Jupyter Notebook with
the help of machine learning techniques. The details of this analysis are described in
below.
3.1 Box-Plot Analysis

It is one of the graphical representations of numerical information of data. It contains lines
which divides data set in form of three quartiles which represents minimum, maximum,
median, first quartile and third quartile. In this work with the help of the analysis it created
a graph with x-axis and y-axis that visualizes the value distribution in respect with each
variable (air pollutants) taken on this data.The box is called interquartile range (IQR) the
middle line of the box is defined as median and upper point is denoted as lower quartile Q1
and lower point is said as upper quartile Q3.
3.2 Heatmap
It works as a two-dimensional graphical representations of data in which individual values
are represented in form of colors contain in a matrix. It works with the help of Seaborn
package that is been imported in this work. Here, the pollutants with higher values are
represented in darker shades and lower values in lighter shades.
3.3 Pair plots

With the help of this pair plot function it helps in creating axes in which each variable
presents in the data in a way that x-axis and y-axis are directly in proportional to each
other as row and column that creates a relationship in a regressor format among the
dataset.
3.4 Scatter plot

In this works it helps in observing correlation in between variables and utilizes dots in
identifying the relationship among them. It plots data points in horizontal axis and vertical
axis to display how each variable is connected to another.
6
3.5 Matplotlib Histogram
It is used in visualizing the frequency distribution by separating the array (numeric) to
mini same sized bins. Then the variable forms a continuous distribution program which is
very helpful to compare by various classifications. It is the main skill which is used in data
science for building a frequency table that are generally taken from a complete dataset to
easy to learn the various elements occurrence and it is the main purpose of Matplotlib
Histogram package used by axes subplot.
3.6 Linear Regression

It is one of the best used regression technique performed in machine learning. It performs
statistical analysis which creates model that build a relationship with an independent
variable and dependent variable. It helps in predicting a response whether the two
variables are linearly related or not. The author tried to find a prediction by taking
response value from dataset(y) and taking independent variable(x) which will give a
graphical analysis that what can be the condition of air quality in future.
7
Chapter 4: Data Analysis and Interpretation
The python packages used in the model is imported first for data analysis and prediction.
In Fig 4.1 have taken numPy package for mathematical operations, pandas for data
analysis, matplotlib.pylot package for plotting histogram and scattering plot, seaborn
package is taken for visualization of data.
Figure 4.1: Package Import

In Fig. 4.2 the dataset is imported and viewing the shape of dataset whether the data is
imported properly or not for analysis.
Figure 4.2: Data Import

In Fig 4.3 the data is described and displayed a statistical summary of the data frame. If the
dataset is containing a numerical value then data. Describe () command is used for
graphical representation which measures description of the data in form of descriptive
statistics.
8
Figure 4.3: Data Description
The Box-plot analysis is done in Fig 4.4 to show how the data is well distributed in the
dataset as it is a type of chart which explains visual distribution of numerical data
displaying in form of quartile data.
Figure 4.4: Box-plot analysis
With the help of panda’s visual analysis scattered plot of Ozone is displayed here in Fig
4.5 by collecting the pairs of data in which a relationship is identified. Then the graph is
drawn with independent variables on horizontal axis and dependent variables on vertical
axis.
9
Figure 4.5: Scattering Plot
The histogram of PM 10 is shown in Fig 4.6 using visual analysis which shows an
accurate display of distribution of numerical data and range of values are divided into
series of intervals.
Figure 4.6: Histogram
In this Fig. 4.7 the scattered plot is analyzed by taking two axes where x-axis is NO2 and
y-axis is SO2 which perform a comparison between the variables and identifying the
common variables and distributed in form of scattering format.
1
Figure 4.7: Scattered Plot
Here the heat map analysis is performed in Fig 4.8where two-dimensional graphical
representation is shown in form of matrix that are represented by colours and it performs
two-dimensional plot where values are mapped on indices and columns to the chart.
Figure 4.8: Heatmap Analysis
1
With the help of Seaborn package pair plot function is performed here in Fig 4.9 which
helps to understand the relationship among each variable.
Figure 4.9: Pair plots Analysis

Scattering plot of PM2.5 is shown here in Fig 4.10
Figure 4.10: Scattering plot Of PM 2.5
1
In this histogram the frequency level of Ozone is checked and shown in Fig 4.11.
Figure 4.11: Frequency check of Ozone

After the train and test of data now it is the time for predicting the pollutants, author have
taken the data of five months in which three months data have been taken as an actual data
and two months data is used for prediction purpose. In Fig 4.12 we have taken the data
PM10 in respect to time where the blue line is identified as an actual data and orange line
as the predicted data. As the two months data is already in our hand from before so
manually while checking the predicted data with the original data set it seems that our
prediction is 90% accurate and from here author can say that our prediction model is quite
satisfactory and performing well.
1
Fig. 4.12: Actual vs Predicted data of PM10
In Fig. 4.13 the prediction is done on SO2 as the same process which is done on previous
prediction model and it gave us the 90% accuracy according to our dataset. From here we
can say that our model is performing well.
Fig 4.13: Actual vs. Predicted data of SO2
1
Chapter 5: Managerial Implications, Limitations,
Future scope & Conclusion
5.1 Managerial Implications
1. In this work, author is building a hybrid model using machine learning algorithms
which will help in predicting the AQI in contaminated areas/polluted cities.
2. From this model a word of pollutant control will work if the prediction is done
correctly using proper algorithms.
3. Basically the model will help in predicting the AQI and making an environment
pollutant control and creating a clean ecosystem.
5.2 Limitations
1. As the data of AQI is taken from government site author worked with the static
data only but if that is done with real time data using cloud computing it could give
better result.
2. As the analysis of AQI is done with Kolkata which intake only less data, but for
future work while working with large amount of data scope of using Genetic
Algorithm is a better method.
3. The model is created with low amount of data so complexity is very high which
will create impact on prediction analysis.
5.3 Future scope

1. A model can be created which can predict future data also with better accuracy by
using Deep Learning Techniques.
2. A web based terminal can be created for prediction of AQI using cloud computing
where real time data will be used for forecasting.
3. A model needs to be created which will not only work on predicting the air pollutants
but also on meteorological parameters also for analyzing the concentration level.
4. In this model we are using Machine Learning Algorithms but if the prediction is done
with Artificial Neural Networks the prediction results will be more prominent better
than this model.
1
5.4 Conclusion
The purpose of this work is to understand the Air Quality Index (AQI) as it tells that the air
present in our ecosystem is how much harmful for our health. The work is being done with
machine learning algorithms using python as it gives better result in prediction. The work
of predicting the pollutant levels is quite tough due to dynamic change of the data and its
changeability in time. However, the tasks of predicting air pollutant levels have been
increased due to its impact on the environment. In this project the author used Linear
Regression for predicting the levels of pollutants like NO2, SO2, PM2.5, PM10, Ozone
and Air Quality Index (AQI) data for Kolkata is available CPCB website. On next step, the
author would like to forecast and differentiate the performance with other Machine
Learning methods like Artificial Neural Network (ANN) and genetic algorithms, for this
work. The author would also like to work with real time data using cloud computing and
use hyper parameter optimization for larger datasets.
Reference
1. Meng Dun (et al. 2020) ‘Short-Term Air Quality Prediction Based on Fractional Grey
Linear Regression and Support Vector Machine’ Volume 2020 |Article
ID 8914501 | https://doi.org/10.1155/2020/8914501
2. Akshaya A.C(etal.2019) ‘Indian Air Quality Prediction and Analysis using Machine
Learning Volume 14, Number 11, 2019 (Special Issue) © Research India
Publications.http://www.ripublication.com
3. S. Bhattacharya and SK Shahnwaz ‘Using Machine Learning to Predict Air Quality Index
in New Delhi’ https://arxiv.org/ftp/arxiv/papers/2112/2112.05753.pdf
4. Khalid Nahar (et al.2020) ‘ Air Quality Index Using Machine Learning – A Jordan Case
Study’ An international journal of advanced computer technology, 9(9), September-2020
(Volume-IX, Issue- IX)
5. Radhika M Patil (et al. 2020) ‘Prediction an air quality index data using machine learning
and deep learning’ http://norma.ncirl.ie/5208/1/ruchitadattatraypatil.pdf
6. https://www.india.gov.in/official-website-central-pollution-control-board
7. https://cpcb.nic.in/
8. https://www.iqair.com/in-en/india/west-bengal/kolkata

Sayantan Final Print Project Report

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Sayantan Final Print Project Report

Uploaded by

Copyright:

Available Formats

PROJECT REPORT ON

AIR QUALITY INDEX PREDICTION USING

In partial fulfilment for the award of the degree

Master of Business Administration

Under the supervisor of

Dr. Biswajit Biswas

Department of Business Administration

Kalyani 741235, India

Student Signature Faculty Signature

I Sayantan Ghosh, Roll 99/MBA No. 200022 of 2022, a fulltime

Student Signature Faculty Signature

Sl. List of Figures Page

Sl. NO. List of Contents PAGE NO.

Chapter 3: Research 3.1 Box-plot Analysis 6

2.1 Literature Review:

2.2 Research Gap

3.1 Box-Plot Analysis

3.3 Pair plots

3.4 Scatter plot

3.6 Linear Regression

Figure 4.1: Package Import

Figure 4.2: Data Import

Figure 4.4: Box-plot analysis

Figure 4.6: Histogram

Figure 4.8: Heatmap Analysis

Figure 4.9: Pair plots Analysis

Figure 4.10: Scattering plot Of PM 2.5

Figure 4.11: Frequency check of Ozone

Fig 4.13: Actual vs. Predicted data of SO2

5.1 Managerial Implications

5.3 Future scope

You might also like