You are on page 1of 7

MAULANA AZAD

NATIONAL INSTITUTE OF TECHNOLOGY BHOPAL

Predicting Air Pollution Level in a Specific City

Submitted By: - Submitted To: -

1) Shubham Bauskar 151112221 Dr. Nilay Khare


2) Prajal Jain 151112239 Dr.Mansi Gyanchandani
3) Prashant Pandey 151112242
4) Apoorv Natani 151112222
5) Indrashish Roy 151112209
Introduction: -

The regulation of air pollutant levels is rapidly becoming one of the most important tasks for
the governments of developing countries, especially China and India. Among the pollutant
index, Fine particulate matter (PM2.5) is a significant one because it is a big concern to
people's health when its level in the air is relatively high. PM2.5 refers to tiny particles in the
air that reduce visibility and cause the air to appear hazy when levels are elevated.
However, the relationships between the concentration of these particles and meteorological
and traffic factors are poorly understood. To shed some light on these connections, some of
these advanced techniques have been introduced into air quality research. These studies
utilized selected techniques, such as Decision Tree, K-Nearest Neighbour, and Naïve
Bayesian Classifier to predict ambient air pollutant levels based on mostly weather data.
This project attempt to apply some machine learning techniques to predict PM2.5 levels
based on a dataset consisting of daily weather parameters in Delhi, India. Due to the
uncertainty of the specific number PM2.5 level, we simplified the problem to be a binary
classification one, that is to classify the PM2.5 level into "High" (> 120) and "low"
(<= 120 ug/m3), and also into 6 class classification. The value is chosen based on the Air
Quality Level standard in India, which set 115 ug/m3 to be mild level pollution.
For 6 Classes Classification:-

Problem Statement: -
The aim of this project is to predict the PFM levels of the specific city in advance based on
the previous meteorological and PFM (Air Quality) data obtained through various resources.
Predicting the Air Quality Level in advance will help government to take early measures in
controlling the air pollution of that city and this will help people to take certain precautions
in order to avoid various disease that are caused due to increase in air pollution.
Objectives of the proposed Work: -

This project attempts to apply some machine learning techniques such as Naïve Bayesian
Classifier, K-Nearest Neighbour, and Decision Tree Classifier to predict PM2.5 levels based
on a dataset consisting of daily weather in the city being chosen. Due to the uncertainty of
the specific number PM2.5 level, we have simplified the problem to be a binary classification
one, that is to classify the PM2.5 level into "High" (> x threshold ug/m ) and "low" (<= xthreshold
3

ug/m ). The value is chosen based on the Air Quality Level standard in the chosen city, which
3

set xthreshold ug/m to be mild level pollution.


3

Some of the Classifiers are trained to classify 6 class label {1,2,3,4,5,6} in which 1-4 class
labels are considered as non-harmful pollution levels whereas 5-6 contains severe pollution
levels.
In order to identify and forecast key parameters affecting air quality and propose
appropriate preventive strategies and policies, it is essential to systematically collect data
characterizing air quality. The data includes two parts: training data set and test data set.

Proposed Methodology: -

1) K-Nearest Neighbour (Predicting the PFM value): -


Algorithm:-
1) Split the data into 60-40% as training data and testing data respectively
2) Find out the k Nearest point form the test tuple based on the Euclidean distances
between testing and training tuple
3) Convert the distances of k training tuples identified in previous step in range [0-1]
4) Multiply the PFM value of each k-tuples corresponding to its normalized distance from
The testing tuple and store the values in an Array A
5) Find out the sum of new PFM values that are stored in the Array A
6) Return the PFM value of testing tuple as sum/k

2) K-Nearest Neighbour Classifier: -


Algorithm:-
1) Split the data into 60-40% as training data and testing data respectively
2) Find out the K-Nearest point form the test tuple based on the Euclidean distances
between testing and training tuple
3) Maintain a count array ‘A’ corresponding to each class label
4) Based on the Class of each training tuple increment the count of corresponding class
label
5) Find out the class label corresponding to the Maximum count i.e. cl, and assign the
class of test tuple as cl.

3) Naïve Bayesian Classifier (Binary Classification): -


Algorithm:-
1) Split the data into 60-40% as training data and testing data respectively
2) Compute the prior probability of each class label
Prior probability= No. of instances belonging to that class/ Total no. of instances
3) Compute the Likelihood (in this case there are 4 attributes & 2 class Labels) so there
will be 8 Gaussian curve

Gaussian distribution can be computed by using formula: -

Where u=mean of the Attribute such that all tuples belong to same class
And sigma=standard deviation of the Attribute such that all tuples belong to same class
4) Compute the posterior probability of each class labels based on test tuples since Naïve
Bayesian Classifier assume that all the attributes contribute independently.
Posterior Probability= Likelihood * Prior Probability
5) Assign the Test Tuple to the class label that corresponds to highest posterior probability
among all class labels.

4) Decision Tree Regressor: -


The ID3 algorithm can be used to construct a decision tree for regression by replacing
Information Gain with Standard Deviation Reduction
Algorithm:-
1) Split the data into 60-40% as training data and testing data respectively
2) The standard deviation of the target is calculated.
In this internal node applies the test on the tuple i.e. to be classified
Leaf Node possess the predicted value.
Branches possess the result of the outcome of the test.
Root node contains that attribute which best splits the data.
3) The dataset is then split on the different attributes. The standard deviation for each
branch is calculated. The resulting standard deviation is subtracted from the standard
deviation before the split. The result is the standard deviation reduction

Using Information Gain: -

Using Gini Index: -

Where p j is the probability of number of tuples of the data set belonging to that class
label.
The attribute with the largest standard deviation reduction is chosen for the decision
node
4) The dataset is divided based on the values of the selected attribute. This process is run
recursively on the non-leaf branches, until all data is processed. When the number of
instances is more than one at a leaf node we calculate the average as the final value for
the target.

5) Linear Regressor: -
Algorithm:-
1) Split the data into 60-40% as training data and testing data respectively
2) Compute the coefficient of the following equation
Y=a+bX1+cX2+dX3+eX4
Where Y=dependent attribute i.e. to be predicted
and Xi =independent attribute

3) Now Y possess the PFM value of the test tuple


Results: -

Scope of Improvement: -
 Results are based on amount of data so in order to improve the result increase the
quality of data
 Accuracy can be increased by using various different classifiers that are available
 Before using the classifier on the data remove the outliers and the noise that is present in
the data
 In some cases normalized data generate better results as compared to original data
References: -

[1] Predicting air pollution level in a specific city by Dan Wei .


[2] Athanasiadis, Ioannis N., et al. "Applying machine learning techniques on air quality data for real-
time decision support." First international NAISO symposium on information technologies in
environmental engineering (ITEE'2003), Gdansk, Poland. 2003.

[3] Ioannis N. Athanasiadis, Kostas D. Karatzas and Pericles A. Mitkas. "Classification techniques for
air quality forecasting." Fifth ECAI Workshop on Binding Environmental Sciences and Artificial
Intelligence, 17th European Conference on Artificial Intelligence, Riva del Garda, Italy, August 2006.

You might also like