Air Quality Index Analysis & Prediction

IMPACT OF AIR POLLUTION ON OUR LIVES
Contents:
1. Abstract
2. Objective
3. Problem Statement
4. Introduction
5. Data Analysis
5 a. Part I
5 b. Part II
6. Methodology - I
6 a. Impact of Air pollution - Pre COVID.
7. Coding and Results.
8. Methdology - II
8 a. Impact of Air pollution - Post COVID.
9. Summary
10.Conclusion
ABSTRACT
Air pollution is increasing day by day. Mostly the chemical pollutants like CO2, SO2, NH3,
PMare the causes of the air pollution. The sources of these pollutants are Industries, vehicles,
Burning of fossil fuels e.t.c. This Document provides an detail description and analysis of factors
and their ratios affecting and lead to harmfulness to the people and other living organisms. Air
Quality index is the most important factor that should be considered. By considering it we can
estimate the effect rate of air pollution like severe, poor ,good. The data is given by Central
government Pollution board. I estimated Air quality Index by applying Machine Learning
Classification techniques Like Random Forest, Support Vector Machine and then Clustering analysis
for grouping the effect of the pollutants based on Air Quality Index. After data Analyzation is done
using Tableau tool for predicting impact of pollution after COVID’19 and pollutants percentage are
also analyzed using this tool. After that based on the effect of pollution. I can say what are the
harmful effects that we are going to face as per pollution group.
OBJECTIVE
The main objective of this project is to provide harmful effects of Air Pollution and the sources
that are causing it. Our goal is predict the impact of Air Pollution after three years of COVID’19
and analyze the pollution before three years of COVID’19.
PROBLEM STATEMENT
Predict the Air Quality Index (AQI) of the current data and compare with existing data. Group
the effect rate of pollution into good (0 – 50), Satisfactory (51-100) , Moderate (101-200), Poor
(200 – 300) and Very Poor (300 and above). Mention the Impact of air pollution and predict the
air pollution for next upcoming years.
INTRODUCTION
Air pollution may be described as contamination of the atmosphere by gaseous, liquid, or solid
wastes or by-products that can endanger human health and welfare of plants and animals, attack
materials, reduce visibility or produce undesirable odors. Although some pollutants are released
by natural sources like volcanoes, coniferous forests, and hot springs, the effect of this pollution
is very small when compared to that caused by emissions from industrial sources, power and heat
generation, waste disposal, and the operation of internal combustion engines. Fuel combustion is
the largest contributor to air pollutant emissions, caused by man, with stationary and mobile
sources equally responsible. The air pollution problem is encountered outdoor as well as indoor.
To read more about the Outdoor Air Pollution and to read more about the Indoor Air Pollution
The indoor air pollution came to our attention during 80's while outdoor air pollution has been
around for some time. The major pollutants which contribute to indoor air pollution include
radon, volatile organic compounds, formaldehyde, biological contaminants, and combustion by-
products such as carbon monoxide, carbon dioxide, sulfur dioxide, hydrocarbons.
The major pollutants which contribute to outdoor air pollution are sulfur dioxide, carbon
monoxide, nitrogen oxides, ozone, total suspended particulate matter, lead, carbon dioxide, and
toxic pollutants.
There are several reasons to worry about air pollution. Some are:
Air pollution affects every one of us.

Air pollution can cause health problems and, may be, death.
Air pollution reduces crop yields and affects animal life.
Air pollution can contaminate soil and corrode materials.
DATA ANALYSIS
PART-I
Tool Used: Tableau
In this Part, we discuss about the chemical pollutants which cause air pollution is collected and
entered in an csv file ,using tableau tool they are analyzed.
This part of the data analysis explains the brief historical data of air pollution like chemical
factors, annual death rates, different kinds of air pollution.
Fig 5a.1.1 Data:
Fig 5a.1.2 Tableau-tool analysis:

Smoke air pollution
Fig 5a.2.1 Data:
Fig 5a.2.2 Tableau analysis:
Transport and Industry Effects:
Fig 5a.3.1 Data:

Fig 5a.3.2 Tableau Analysis:
Fig 5a.4.1 Annual death rates:

Fig 5a.4.2 Tableau Analysis death rates:
PART-II
Technology &Tool Used: Python (Machine Learning) & Jupyter Notebook.
In this Part, we discuss about the chemical pollutants that cause air pollution and AQI. Air
Quality index is the main solution to detect the type of pollutiondiseases that cause effect the
lives of people and living organisms.
The data is taken from the Central Pollution of India and entered in an csv file.
The number of instances are 24022.(city.csv)
Training Data:
Testing Data(i):
Samples are taken and then air quality Index is predicted.
Instances: 90
On this data, we want to predict the air quality index and then we group them into five disease
stages, as we discussed earlier.
Testing Data(ii):
Samples are taken and then air quality Index is predicted.
Instances: 21
Factor:
Air Quality index : The total of all chemical pollutants *1.5
Let’s go to the Methodology to understand better.

METHODOLOGY I
Tool Used: Tableau.
Dataset: city.csv
Impact of Air pollution - Pre COVID.
Fig 6.1- AQI vs Year
Fig 6.2- AQI vs Pollution Remark

Fig 6.3 – AQI vs Cities
Methodology – I conclusion:
We can conclude that 95% of the pollution is decreased by 2019-2020.

CODING AND RESULTS
Technology Used: Python (Machine Learning)
Tool Used: Jupyter Notebook.
Training data is trained, and then test data is given as input to predict the results.
We are analyzing in three kinds. They are
(i) Prediction of Air Quality Index

(ii) Clustering the Air Quality Index and COVID
(iii) Marking the affect of pollution and disease messaging, as per the central government
standards.
Prediction of Air Quality Index

Train data: City.csv
No. of instances: 24022
Train data: TEST file

No. of instances: 90
Language: Python
Technique: Regression (Random Forest Regressor & Support Vector Machine)
Explanation is available in code fragment.

In [3]: import pandas as pd
In [4]: #loading the train data set of airquality(90 instances)

data=pd.read_csv('C:\sravan\city_day.csv')
In [7]: #missing data is removed

traindata1=traindata.dropna()
In [37]: #first we predict the air quality index by splitting our data as 80%trai
n data and 20% testing
#Then we apply regression techniqueto predict the air quality index base
d on all chemical pollutants.
#there after we apply cluster analysis
# and finally we want to predict what are the harmful affects that you a
re going to face like good,very poor e.t.c
In [38]: #first drop unwanted columns.
In [8]: traindata1.head(3)
Out[8]:
CITY DATE PM2.5 PM10 NO NO2 Nox NH3 CO SO2 O3 Benzen
1969 Amaravati 11/25/2017 81.40 124.50 1.44 20.50 12.08 10.72 0.12 15.24 127.09 0.2
1970 Amaravati 11/26/2017 78.32 129.06 1.26 26.00 14.85 10.28 0.14 26.96 117.44 0.2
1971 Amaravati 11/27/2017 88.76 135.32 6.60 30.85 21.77 12.91 0.11 33.59 111.81 0.2
In [9]: traindata2=traindata1.drop(['CITY','DATE','pollution range'],axis='colum

ns')
In [10]: traindata2.head(2)
Out[10]:
PM2.5 PM10 NO NO2 Nox NH3 CO SO2 O3 Benzene Toluene Xylene air
1969 81.40 124.50 1.44 20.5 12.08 10.72 0.12 15.24 127.09 0.20 6.50 0.06
1970 78.32 129.06 1.26 26.0 14.85 10.28 0.14 26.96 117.44 0.22 7.95 0.08
In [11]: #here prediction value(class label is air_quality index) so,make it into

target variable
target=traindata2['air_quality_index']
print(len(traindata2))
print(len(target))
4646
4646
In [47]: #traindata=traindata.drop(['air_quality_index'],axis='columns
In [13]: #Then split our traindata into training(80%) and testing (20%)
In [12]: from sklearn.model_selection import train_test_split
In [13]: X_train,x_test,Y_train,y_test=train_test_split(traindata2,target,test_si
ze=0.3)
#making our data into test and trainsets
In [14]: len(X_train)
Out[14]: 3252
In [15]: from sklearn.ensemble import RandomForestRegressor
In [16]: r=RandomForestRegressor(n_estimators=50)
In [21]: #model
r.fit(X_train,Y_train)
Out[21]: RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,

max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=Non
e,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=50,
n_jobs=None, oob_score=False, random_state=None,
verbose=0, warm_start=False)
In [18]: r.score(x_test,y_test)
Out[18]: 0.999758744001202
In [90]: #score obtained is 99.9% predicted.....
In [22]: res=r.predict(X_train)
res
Out[22]: array([374.12, 156. , 266.72, ..., 46. , 140. , 247.98])
In [23]: print(traindata)
CITY DATE PM2.5 PM10 NO NO2 Nox NH3

CO \
1969 Amaravati 11/25/2017 81.40 124.50 1.44 20.50 12.08 10.72
0.12
1970 Amaravati 11/26/2017 78.32 129.06 1.26 26.00 14.85 10.28
0.14
1971 Amaravati 11/27/2017 88.76 135.32 6.60 30.85 21.77 12.91
0.11
1972 Amaravati 11/28/2017 64.18 104.09 2.56 28.07 17.01 11.42
0.09
1973 Amaravati 11/29/2017 72.47 114.84 5.23 23.20 16.59 12.25
0.16
... ... ... ... ... ... ... ... ...
...
24018 Patna 4/27/2020 19.03 50.03 77.24 14.17 57.37 11.30
0.43
24019 Patna 4/28/2020 12.37 39.29 66.20 11.68 58.88 11.30
0.39
24020 Patna 4/29/2020 15.21 41.96 79.67 13.50 69.42 10.13
0.42
24021 Patna 4/30/2020 30.93 60.26 69.32 14.46 61.62 10.08
0.52
24022 Patna 5/1/2020 29.26 76.89 75.87 11.84 65.66 12.02
0.52
SO2 O3 Benzene Toluene Xylene air_quality_index \

1969 15.24 127.09 0.20 6.50 0.06 184.0
1970 26.96 117.44 0.22 7.95 0.08 197.0
1971 33.59 111.81 0.29 7.63 0.12 198.0
1972 19.00 138.18 0.17 5.02 0.07 188.0
1973 10.55 109.74 0.21 4.71 0.08 173.0
... ... ... ... ... ... ...
24018 9.83 23.31 0.66 3.22 0.16 109.0
24019 8.63 31.79 0.55 3.05 0.14 98.0
24020 9.37 33.08 0.69 1.24 0.73 111.0
24021 11.96 41.62 1.67 1.82 2.62 118.0
24022 7.86 35.56 2.28 1.93 2.75 118.0
pollution range
1969 Moderate
1970 Moderate
1971 Moderate
1972 Moderate
1973 Moderate
... ...
24018 Moderate
24019 Satisfactory
24020 Moderate
24021 Moderate
24022 Moderate
[4646 rows x 16 columns]
In [24]: #now lets take other test data for predicting air quality index
testdata=pd.read_csv('C:\sravan\TEST.csv')
In [25]: testdata
Out[25]:
STATE CITY DATE PM2.5 PM10 NO NO2 Nox NH3 CO SO2
Andhra
0 Rajamahendravaram 27/2/2019 31 49 16 4 10 0 49 0
Pradesh
1 assam gauhati 5/1/2019 18 19 10 29 16 44 19 0
2 assam gauhati 5/2/2019 30 31 12 2 20 17 31 0
4 assam gauhati 23/5/2019 31 31 12 2 20 17 31 0
3 assam gauhati 5/10/2019 43 42 11 2 24 19 42 0

... ... ... ... ... ... ... ... ... ... ... ...
86 Andhrapradesh Visakhapatnam 21/1/2020 90 0 22 6 8 23 0 0
87 Delhi Delhi 25/1/2020 89 0 67 0 0 23 0 0
88 Delhi Delhi 26/1/2020 88 0 45 4 5 35 0 0
amaravathi 1/4/2019 302 181 144 2 39 0 181 0

89 Andhra
Pradesh
90 Maharashtra Mumbai 12/2/2017 330 0 41 0 6 86 0 0
91 rows × 18 columns
In [26]: testdata=testdata.drop(['STATE','CITY','DATE','REMARK','HEALTH-IMPACT'],
axis='columns')
testdata
Out[26]:
predicted
PM2.5 PM10 NO NO2 Nox NH3 CO SO2 air quality O3 Benzene Toulene Xylene
index
0 31 49 16 4 10 0 49 0 287.80 3 0 0 0
1 18 19 10 29 16 44 19 0 287.80 44 0 0 0
2 30 31 12 2 20 17 31 0 439.18 50 0 0 0
3 43 42 11 2 24 19 42 0 446.16 57 0 0 0
4 31 31 12 2 20 17 31 0 436.68 49 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ...
86 90 0 22 6 8 23 0 0 252.76 67 0 0 0
87 89 0 67 0 0 23 0 0 130.52 45 0 0 0
88 88 0 45 4 5 35 0 0 241.26 67 0 0 0
89 302 181 144 2 39 0 181 0 152.94 78 0 0 0
90 330 0 41 0 6 86 0 0 160.54 52 0 0 0
In [28]: target1=traindata['air_quality_index']
traindata3=traindata2.drop(['air_quality_index'],axis='columns')
traindata3
Out[28]:
PM2.5 PM10 NO NO2 Nox NH3 CO SO2 O3 Benzene Toluene Xylene
1969 81.40 124.50 1.44 20.50 12.08 10.72 0.12 15.24 127.09 0.20 6.50 0.06
1970 78.32 129.06 1.26 26.00 14.85 10.28 0.14 26.96 117.44 0.22 7.95 0.08
1971 88.76 135.32 6.60 30.85 21.77 12.91 0.11 33.59 111.81 0.29 7.63 0.12
1972 64.18 104.09 2.56 28.07 17.01 11.42 0.09 19.00 138.18 0.17 5.02 0.07
1973 72.47 114.84 5.23 23.20 16.59 12.25 0.16 10.55 109.74 0.21 4.71 0.08
... ... ... ... ... ... ... ... ... ... ... ... ...
24018 19.03 50.03 77.24 14.17 57.37 11.30 0.43 9.83 23.31 0.66 3.22 0.16
24019 12.37 39.29 66.20 11.68 58.88 11.30 0.39 8.63 31.79 0.55 3.05 0.14
24020 15.21 41.96 79.67 13.50 69.42 10.13 0.42 9.37 33.08 0.69 1.24 0.73
24021 30.93 60.26 69.32 14.46 61.62 10.08 0.52 11.96 41.62 1.67 1.82 2.62
24022 29.26 76.89 75.87 11.84 65.66 12.02 0.52 7.86 35.56 2.28 1.93 2.75
In [29]: testing=RandomForestRegressor(n_estimators=50)
In [30]: testing.fit(traindata3,target1)
Out[30]: RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,

max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=Non
e,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=50,
n_jobs=None, oob_score=False, random_state=None,
verbose=0, warm_start=False)
In [121]: res=testing.predict(testdata)
res
Out[121]: array([287.8 , 287.8 , 439.18, 446.16, 436.68, 374.54, 439.46, 257.6 ,

261.3 , 151.88, 154.24, 182.34, 167.78, 158.42, 216.1 , 235.48,
159.02, 84.38, 120.54, 139.36, 122.84, 259. , 163.4 , 271.24,
302.16, 284.88, 220.02, 214.92, 290. , 232.42, 107.88, 158.8 ,
151.68, 219.86, 262.74, 376.44, 303.76, 286.04, 116.36, 117.02,
151.28, 139.9 , 86.6 , 157. , 218.88, 344.84, 246.8 , 131.38,
185.02, 339.94, 384.86, 159.1 , 406.88, 264.78, 283.36, 162.58,
131.34, 224.74, 249.44, 130.32, 129.5 , 158.94, 166.8 , 281.24,
178.24, 140.36, 187.14, 153.94, 334. , 145.48, 505.54, 494.8 ,
170.4 , 88.22, 183.48, 265.9 , 146.84, 146.14, 170.68, 141.84,
168.1 , 162.5 , 170.2 , 186.08, 170.52, 162.24, 252.76, 130.52,
241.26, 152.94, 160.54])
In [31]: testing.score(traindata3,target1)
Out[31]: 0.9924580809809389
In [32]: res=pd.DataFrame(res)
In [33]: res
Out[33]:
0
0 374.12
1 156.00
2 266.72
3 174.98
4 79.00
... ...
3247 41.02
3248 49.00
3249 46.00
3250 140.00
3251 247.98
In [34]: #now keep this in test(result) data set

testdata["predicted air quality index"]=res
In [35]: testdata.to_csv (r'C:\sravan\predicted_airquality_final.csv', index = Fa

lse, header=True)
In [36]: testdata
Out[36]:
predicted
PM2.5 PM10 NO NO2 Nox NH3 CO SO2 air quality O3 Benzene Toulene Xylene
index
0 31 49 16 4 10 0 49 0 374.12 3 0 0 0
1 18 19 10 29 16 44 19 0 156.00 44 0 0 0
2 30 31 12 2 20 17 31 0 266.72 50 0 0 0
3 43 42 11 2 24 19 42 0 174.98 57 0 0 0
4 31 31 12 2 20 17 31 0 79.00 49 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ...
86 90 0 22 6 8 23 0 0 45.00 67 0 0 0
87 89 0 67 0 0 23 0 0 222.02 45 0 0 0
88 88 0 45 4 5 35 0 0 69.00 67 0 0 0
89 302 181 144 2 39 0 181 0 105.00 78 0 0 0
90 330 0 41 0 6 86 0 0 137.00 52 0 0 0
In [37]: traindata3
FINAL RESULT:
Clustering the Air Quality Index vs COVID
We are considering AQI vs COVID for cluster the data and then group into 5
clusters.
They are good, satisfactory, Moderate, poor, Very poor.
good=cluster(0),satisfactory=cluster(2),poor=cluster(3),moderate=cluster(1),very
poor=cluster(4)
Data:
We used K Means Clustering Algorithm to cluster the data and scatter plot to
visualize the data.
I n [ 3 ] : import pandas a s p d
I n [ 4 2 ] : #loading the train data set of airquality(90 instances)

data= p d. read_csv( 'C: \ sravan \internship \\airpollution_cluster_analysis.cs
v ')
In [43]: data
Out[43]:
P M 2.5- P M 1 0- N O2- N H 3- S O 2- OZ O N E -
ST AT E C IT Y D AT E CO
AV G AV G A VG AV G AG AV G
An dhr a
0 a mar a v athi 1/1/2 01 9 1 90 13 1 10 7 4 42 0 63
Pr ad es h
Andhr a
1 a mar a v athi 1/2/2 01 9 1 88 13 1 11 0 4 40 0 62
Pr ades h
An dhr a
2 a mar a v athi 1/3/2 01 9 2 80 17 4 15 5 2 37 0 52
Pr ad es h
Andhr a
3 a mar a v athi 1/4/2 0 19 3 0 2 18 1 1 44 2 39 0 7 8 tr af
Pr ades h
An dhr a
4 a mar a v athi 1/6/2 01 9 2 85 16 0 12 1 3 19 0 71
Pr ad es h
... ... ... ... ... ... ... ... ... ... ...
86 An dhr a pr ad es h Vi s ak h ap atn a m 2/4/ 20 20 12 3 0 5 6 6 0 0 5 6 tr af
87 D el hi D el hi 4/1/20 20 43 0 76 4 0 0 76 tr af
88 D el hi D el hi 23/1/2 02 0 11 1 0 46 7 0 0 78 tr af
89 D el hi D el hi 25/1/2 02 0 89 0 67 0 0 23 45 tr af
90 D el hi D el hi 26/1/2 02 0 88 0 45 4 5 35 67 tr af
91 ro ws × 15 colu mn s
I n [ 4 4 ] : inputs =d a t a. d r o p ( 'AIR_QUALITY_INDEX' , a x i s =' c o l u m n s')
I n [ 4 5 ] : target =data[ 'AIR_QUALITY_INDEX' ]
I n [ 4 6 ] : target
Out[46]: 0 190
1 188
2 280
3 302
4 285
...
86 123
87 43
88 111
89 89
90 88
Name: AIR_QUALITY_INDEX, Length: 91, dtype: int64
inputs
In [47]:
Out[47]:
P M 2.5- P M 1 0- N O2- N H 3- S O 2- OZ O N E -
ST AT E C IT Y D AT E CO
An dhr a
0 a mar a v athi 1/1/2 01 9 1 90 13 1 10 7 4 42 0 63
Pr ad es h
Andhr a
1 a mar a v athi 1/2/2 01 9 1 88 13 1 11 0 4 40 0 62
Pr ades h
An dhr a
2 a mar a v athi 1/3/2 01 9 2 80 17 4 15 5 2 37 0 52
Pr ad es h
Andhr a
3 a mar a v athi 1/4/2 0 19 3 0 2 18 1 1 44 2 39 0 7 8 tr af
Pr ades h
An dhr a
4 a mar a v athi 1/6/2 01 9 2 85 16 0 12 1 3 19 0 71
Pr ad es h
... ... ... ... ... ... ... ... ... ... ...
87 D el hi D el hi 4/1/20 20 43 0 76 4 0 0 76 tr af
88 D el hi D el hi 23/1/2 02 0 11 1 0 46 7 0 0 78 tr af
89 D el hi D el hi 25/1/2 02 0 89 0 67 0 0 23 45 tr af
90 D el hi D el hi 26/1/2 02 0 88 0 45 4 5 35 67 tr af
In [ ]:
I n [ 4 8 ] : from sklearn.preprocessing i m p o r t L a b e l E n co d e r
#converting binary to nominal using labelencoder
I n [ 4 9 ] : le_fever = LabelEncoder()
inputs[ ' covid' ] = le_fever . fit_transform(inputs[ 'COVID' ] )
In [50]: inputs
Out[50]:
ST AT E C IT Y D AT E P M 2.5- P M 1 0- N O2- N H 3- S O 2- C O OZ O N E -
An dhr a
0 a mar a v athi 1/1/2 01 9 1 90 13 1 10 7 4 42 0 63
Pr ad es h
Andhr a
1 a mar a v athi 1/2/2 01 9 1 88 13 1 11 0 4 40 0 62
Pr ades h
An dhr a
2 a mar a v athi 1/3/2 01 9 2 80 17 4 15 5 2 37 0 52
Pr ad es h
Andhr a
3 a mar a v athi 1/4/2 0 19 3 0 2 18 1 1 44 2 39 0 7 8 tr af
Pr ades h
An dhr a
4 a mar a v athi 1/6/2 01 9 2 85 16 0 12 1 3 19 0 71
Pr ad es h
... ... ... ... ... ... ... ... ... ... ...
87 D el hi D el hi 4/1/2 0 20 4 3 0 7 6 4 0 0 7 6 tr af
88 D el hi D el hi 23/1/2 02 0 1 11 0 46 7 0 0 78 tr af
89 D el hi D el hi 25/1/2 02 0 89 0 67 0 0 23 45 tr af
90 D el hi D el hi 26/1/2 02 0 88 0 45 4 5 35 67 tr af
In [51]: target
Out[51]: 0 190
1 188
2 280
3 302
4 285
...
86 123
87 43
88 111
89 89
90 88
Name: AIR_QUALITY_INDEX, Length: 91, dtype: int64
In [52]: #making results for clustering analysis

thenres = inputs . drop([ 'STATE' ,'CITY' , ' D A T E ', ' P L A C E ', 'REMARK' , 'HEALTH - I M P A
C T ',' C O V I D ', ] , a x i s =' c o l u m n s' )
thenres
Out[52]:
PM 2.5- A V G P M 1 0- A V G N O2- AV G N H 3- A V G S O 2- A G C O OZ ON E - AV G c o v id
0 19 0 13 1 1 07 4 42 0 63 0
1 18 8 13 1 1 10 4 40 0 62 0
2 28 0 174 15 5 2 37 0 52 0
3 30 2 181 14 4 2 39 0 78 0
4 28 5 160 12 1 3 19 0 71 0
... ... ... ... ... ... ... ... ...
86 12 3 0 56 6 0 0 56 1
87 43 0 76 4 0 0 76 1
88 11 1 0 46 7 0 0 78 1
89 89 0 67 0 0 23 45 1
90 88 0 45 4 5 35 67 1
91 ro ws × 8 colum n s
I n [ 5 3 ] : thentarget =t h e n r e s [' c o v i d ' ]
I n [ 5 4 ] : thentarget
Out[54]: 0 0
1 0
2 0
3 0
4 0
..
86 1
87 1
88 1
89 1
90 1
Name: covid, Length: 91, dtype: int32
I n [ 5 5 ] : from sklearn.svm import SVC

s v m=S V C ( )
#predicting data within traindata using support vector machine
I n [ 5 6 ] : s v m.fit(thenres,thentarget)
C : \U s e r s\ rajesh \ anaconda3 \lib \ si t e- packages \ sklearn \ s v m\ base.py:193: Fut

ureWarning: The default value of gamma will change from 'auto' to 'scal
e' in version 0.22 to account better for unscaled features. Set gamma ex
plicitly to 'auto' or 'scale' to avoid this warning.
"avoid this warning.", FutureWarning)
Out[56]: SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,

decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
kernel='rbf', max_iter= - 1, prob ability=False, random_state=None,
shrinking=True, tol=0.001, verbose=False)
In [57]: s v m.score(thenres,thentarget)
Out[57]: 1.0
I n [ 6 1 ] : from sklearn.cluster import K M e a n s

#i am using K Means algorithm for clustering
I n [ 6 3 ] : from matplotlib import pyplot a s plt

thenres
Out[63]:
PM 2.5- A V G P M 1 0- A V G N O2- AV G N H 3- A V G S O 2- A G C O OZ ON E - AV G c o v id
0 19 0 13 1 1 07 4 42 0 63 0
1 18 8 13 1 1 10 4 40 0 62 0
2 28 0 174 15 5 2 37 0 52 0
3 30 2 181 14 4 2 39 0 78 0
4 28 5 160 12 1 3 19 0 71 0
... ... ... ... ... ... ... ... ...
86 12 3 0 56 6 0 0 56 1
87 43 0 76 4 0 0 76 1
88 11 1 0 46 7 0 0 78 1
89 89 0 67 0 0 23 45 1
90 88 0 45 4 5 35 67 1
In [64]: thenres[ 'air_quality_index'] =target

thenres
Out[64]:
SO 2- OZ ON E-
PM 2.5- PM 10- N O 2- N H 3- CO co v id air _q u alit y_in d ex
AV G AVG AV G A V G AG AV G
0 19 0 13 1 1 07 4 42 0 63 0 19 0
1 18 8 13 1 1 10 4 40 0 62 0 18 8
2 28 0 17 4 1 55 2 37 0 52 0 28 0
3 30 2 18 1 1 44 2 39 0 78 0 30 2
4 28 5 16 0 1 21 3 19 0 71 0 28 5
... ... ... ... ... ... ... ... ... ...
86 123 0 56 6 0 0 56 1 123
87 43 0 76 4 0 0 76 1 43
88 11 1 0 46 7 0 0 78 1 11 1
89 89 0 67 0 0 23 45 1 89
90 88 0 45 4 5 35 67 1 88
In [66]: p l t.scatter(thenres[ ' c o v i d '],thenres[ 'air_quality_index' ])

#visualizing scatterplot before and after corona
p l t.t i t l e ( 'AIR QUALITY VS COVID' )
p l t.x l a b e l (' C O V I D ' )
p l t.y l a b e l (' A I R Q U AL I T Y I N DE X ' )
Out[66]: Text(0, 0.5, 'AIR QUALITY INDEX')
In [67]: k m=K M e a n s ( n_ c l u s t e rs = 5)
km
#dividing into 5 clusters
Out[67]: KMeans(algorithm='auto', copy_x=True, init='k - means++', max_iter=300,

n_clusters=5, n_init=10, n_jobs=None, precompute_distances='aut
o',
random_state=None, tol=0.0001, verbose=0)
In [68]: clus =k m. fit_predict(thenres[[ 'covid' , 'air_quality_i n d e x ']])

clus
#displaying the cluster data group
Out[68]: array([1, 1, 3, 3, 3, 3, 3, 4, 4, 4, 0, 0, 0, 0, 4, 4, 0, 0, 0, 0, 0, 4,
0, 4, 4, 4, 4, 4, 4, 0, 0, 4, 0, 4, 4, 1, 1, 4, 0, 0, 0, 0, 0, 0 ,
4, 1, 0, 0, 4, 1, 3, 1, 3, 4, 4, 4, 4, 1, 1, 0, 0, 4, 1, 4, 0, 0 ,
0, 0, 3, 0, 2, 3, 1, 0, 3, 4, 0, 0, 0, 0, 0, 4, 0, 4, 0, 4, 4, 0 ,
4, 4, 4])
In [69]: thenres[ 'grouped_pollutuion' ]=clus

thenres
#displaying in the dataset
#good=cluster(0),satisfactory=cluster(2),poor=cluster(3),moderate=cluste
r(1),very poor=cluster(4)
Out[69]:
PM 2.5- PM 10- N O 2- N H 3- SO 2- OZ ON E-
CO co v id air _q u alit y_in d ex g r o u p ed _ p o llu t u io n
0 190 13 1 10 7 4 42 0 63 0 19 0 1
1 188 13 1 11 0 4 40 0 62 0 18 8 1
2 280 17 4 15 5 2 37 0 52 0 28 0 3
3 302 1 81 144 2 39 0 78 0 30 2 3
4 285 1 60 121 3 19 0 71 0 28 5 3
... ... ... ... ... ... ... ... ... ... ...
86 12 3 0 56 6 0 0 56 1 12 3 4
87 43 0 76 4 0 0 76 1 43 0
88 11 1 0 46 7 0 0 78 1 11 1 4
89 89 0 67 0 0 23 45 1 89 4
90 88 0 45 4 5 35 67 1 88 4
In [76]: d f 1=thenres[thenres.grouped_pollutuion = = 0 ]
d f 2=thenres[thenres.grouped_pollutuion = = 1 ]
p l t.scatter(df1 . covid,df1[ 'air_quality_index' ],color ="green" )

p l t.scatter(df2 . covid,df2[ 'air_quality_index' ],color ="blue" )
plt .scatter(df3 .covid,df3[ 'air_quality_index'],color = "yellow" )
p l t.scatter(df4 . covid,df4[ 'air_quality_index' ],color ="red" )
p l t.scatter(df5 . covid,df5[ 'air_quality_index' ],color ="black" )
p l t.x l a b e l (' c o v i d ' )
p l t.y l a b e l (' a i r quality')
p l t.l e g e n d (' 2 3 4 0 1 ' )
Out[76]: <matplotlib.legend.Legend at 0xe6d0f50>
In [ ] :
In [ ] :
In [32]:
In [ ] :
In [86]:
In [ ] :
In [92]:
In [ ] :
In [ ] :
In [ ] :
In [ ] :
In [98]:
In [ ] :
In [ ] :
RESULT:
Marking the affect of pollution and disease as per the central government
standards (Category prediction)
Central Government Standards
Technology Used: Python Random forest Classifier
Train data:
Test data:
Predicted Result:
In [51 ]: i m p o r t p andas a s p d
f r o m mat plotl ib im port p yplot a s p lt
I n [13 7]: #loading the train data set o f air qualit y(90 instan ces)
t r a i n d a t a = p d . read_ csv( 'C : \sra van \ i nterns hip \ \ airpol lutio n_eff ect_ca use_t
raindata .csv' )
traindat a
O ut [ 1 3 7 ] :
STATE CI TY DATE PM2.5- PM10- NO2- NH3- SO2- CO OZO NE-
AVG AVG AVG AVG AG AVG
Andhra
0 amaravathi 1/1/2019 190 131.0 107 4 42 0 63
Pradesh
1 An dhr a
amaravathi 1/2/2019 188 131.0 110 4 40 0 62
Pradesh
Andhra
2 amaravathi 1/3/2019 280 174.0 155 2 37 0 52
Pradesh
3 An dhr a
amaravathi 1/4/2019 302 181.0 144 2 39 0 78 traf
Pradesh
Andhra
4 amaravathi 1/6/2019 285 160.0 121 3 19 0 71
Pradesh
... ... ... ... ... ... ... ... ... ... ...
86 Andhrapradesh Visakhapatnam 2/4/2020 123 0.0 56 6 0 0 56 traf
87 Delhi Delhi 4/1/2020 43 0.0 76 4 0 0 76 traf
88 Delhi Delhi 23/1/2020 111 0.0 46 7 0 0 78 traf
89 Delhi Delhi 25/1/2020 89 0.0 67 0 0 23 45 traf
90 Delhi Delhi 26/1/2020 88 NaN 45 4 5 35 67 traf
91 ro w s × 1 5 colu m n s
In [138] : #loading the test data s et of airq uality (19 i nstanc es)
t e s t d a t a =p d. r ead_c sv( ' C : \ srav an \in ternsh ip \ \ a irpoll ution _effe ct_cau se_te
stdata.c sv')
testdata
O ut [ 1 3 8 ] :
STATE CI TY DATE PM2.5- PM10- NO2- NH3- SO2- CO- O ZONE - P
AVG AVG AVG AVG AVG AVG AVG
0 Telangana Hyderabad 4/1/2020 110 94 25 3 2 32 32
2 Telangana Hyderabad 4/3/2020 66 73 7 3 5 27 17 i
t
10 Andhra Amaravati 4/1/2020 64 69 6 2 32 18 34
pradesh i
t
11 Andhra Amaravati 4/2/2020 48 57 6 2 27 - 26
pra de s h i
t
12 Andhra Amaravati 4/3/2020 50 59 5 2 28 - 17
pradesh i
Andhra
13 Rajamahendravaram 4/4/2020 56 56 9 2 10 28 37
pradesh
Andhra
14 Rajamahendravaram 4/5/2020 43 48 8 2 9 27 33 i
pradesh
Andhra
15 Rajamahendravaram 4/6/2020 34 40 7 2 9 27 17
pradesh
Andhra
16 Tirupati 4/7/2020 35 38 7 1 8 26 27 i
pradesh
17 Andhra Tirupati 4/8/2020 37 33 7 1 7 22 63 i

pra de s h
18 Andhra visa khapatnam 4/9/2020 23 37 33 2 9 6 26

pradesh
19 Andhra visa khapatnam 4/10/2020 42 71 48 2 7 6 22

pra de s h
In [139] : #scatter plot showi ng the stat e and its a ir qu ality index
p l t. s c a t t e r ( t r a i n d a t a[ ' A I R _ Q UA L I T Y _ I N D E X ' ] , t r a i n d a t a [' R E M A R K ' ] )
p l t. titl e( 'PO LLUTI ON REMA RK' )
p l t. xlab el( 'A IR_QU ALITY_ INDEX ' )
p l t. ylab el( 'P OLLUT ION REM ARK' )
O ut [ 1 3 9 ] : Tex t(0, 0.5, ' POLLU TION REMARK ')
I n [14 0]: #goal is to p redic t base d on air p olluti on we will say w hich level of po
llution you w ill b e affe cted.
I n [14 1]: # w e a r e u s i n g c l a s s i f i c a t i o n t e c h n i q u e f o r th is
I n [14 2]: t r a i n _ d a t a s e t = trai ndata . drop( [ 'HEA LTH - IM PACT' ,' S O 2- A G ' , ' CO' ,' DATE' , ' C I T
Y ','STAT E' ,'P LACE' , 'COVI D' ,'P M2.5 - A V G ' ,' PM10 - A V G ', ' NO2 - A VG' ,' NH3 -AV G' , ' O
Z O N E-AVG ' ],ax is ='c olumns ' )
# t e s t _ d a t a s e t = t e s t d a t a . d r o p ( [ ' H E A L T H - IMP ACT', 'SO2 - A G','C O','D ATE',' CIT
Y ' , ' S T A T E ' , ' P L A C E ' , ' C O V I D ' , ' P M 2 . 5 - AVG',' PM10 - AVG',' NO2 - A VG',' NH3 -AV G','O
Z O N E-AVG '],ax is='c olumns ')
I n [143]: train_da taset
O ut [ 1 4 3 ] :
AIR_QUALI TY_I NDEX REM ARK
0 190 moderate
1 188 moderate
2 280 poor
3 302 very poor
4 285 poor
... ... ...
86 123 moderate
87 43 good
88 111 moderate
89 89 satisfactory
90 88 satisfactory
91 ro w s × 2 col um n s
In [ ]:
I n [14 4]: f r o m skl earn. prepr ocessi ng im port LabelE ncode r

#convert ing b inary to no minal usin g labe lenco der
I n [14 5]: l e _ v a r= L abelE ncode r()

t r a i n _ d a t a s e t [ 'pol lution _effe ct_ca tegory ' ]= le _var . f it_tr ansfo rm(tra in_da
t a s e t ['R EMARK ' ] )
I n [14 6]: train_da taset

#it is c atego rized that 1=mod erate ,2=poo r,0=g ood,3= satis facto ry and 4 =
very poo r.
t r a i n _ d a t a s e t 1 = tra in_dat aset . drop( [ 'REMA RK' ], axis = ' colum ns' )
I n [147]: train_da taset 1
O ut [ 147] :
AIR_QUALI TY_I NDEX polluti on_effect_categor y
0 190 1
1 188 1
2 280 2
3 302 4
4 285 2
... ... ...
86 123 1
87 43 0
88 111 1
89 89 3
90 88 3
I n [136]:
-------- ----- ----- ------ ----- ----- ------ ----- ------ ----- ----- ------ -----
---
KeyError T r a c e b a c k ( m o s t r e c e n t c a l l la
st)
< i p y t h o n -inpu t - 1 3 6 -e8af5 3e925 a5> i n <mod ule>
- - - -> 1 train _data set3 = t rain_ datas et1 . dr op ( [' pollut ion_e ffect _categ ory' ]
,a x i s='c olumn s' )
~\ anacon da3 \ l ib \si te -pac kages \pand as \cor e \ fra me.py i n dr op (se lf, la bels,
axis, in dex, colum ns, le vel, inpla ce, er rors)
3995 l e v e l =le vel ,
3996 i n p l a c e = inpla ce ,
-> 3 9 9 7 e r r o r s =e rrors ,
3998 )
3999
~\ anacon da3 \ l ib \si te -pac kages \pand as \cor e \ gen eric.p y i n d r o p( self, label
s , a x i s , i n d e x , c o l u m n s , l e v e l , i n p l a c e , error s)
3934 f o r a x i s , lab els i n a x e s . item s ( ) :
3935 i f label s is not N o n e :
-> 393 6 o b j = ob j . _dr op_ax is ( lab els , a x i s , l e v e l = leve l ,
e r r o r s =e rrors )
3937
3938 i f inp lace :
~\ anacon da3 \ l ib \si te -pac kages \pand as \cor e \ gen eric.p y i n _drop _axis ( self,
l a b e l s , a x i s , l e v e l , error s)
3968 n e w _a x i s = ax is . dr op ( lab els , l e v e l = l e v e l , erro rs
=e r r o r s )
3969 else:
-> 3 9 7 0 n e w _a x i s = ax is . dr op ( lab els , e rrors = error s )
3971 r e s u l t = s e l f . rein dex (** { axis _name : new_ axis } )
3972
~\ anacon da3 \ l ib \si te -pac kages \pand as \cor e \ ind exes \ b ase.p y i n d r o p (s elf,
labels, error s)
5016 i f m a s k .a ny ( ) :
5017 i f error s ! = "i gnore" :
-> 5 0 1 8 r a i se Ke yErro r (f"{ labels [mask ]} no t foun d in axi
s ")
5019 i n d e x e r = in dexer [ ~ m a s k ]
5020 r e t u r n s e l f . dele te ( in dexer )
K e y E r r o r : "[' pollu tion_e ffect _cate gory'] not found in ax is"
I n [12 5]: #here i am us ing d ecisio n tre e cla sifier for classi fying the pollut ion r
emark.
f r o m skl earn impor t t r e e
I n [12 6]: f r o m skl earn. ensem ble im port Rando mFores tClas sifier
I n [12 7]: r a m= Rand omFor estCl assifi er(n_ estim ators = 100 )
I n [12 8]: r a m. fit( train _data set1,t arget _trai n_data set1)
O ut [ 1 2 8 ] : Ran domFo restCl assif ier(b ootstr ap=Tr ue, cl ass_w eight =None, crit erion= ' g i n
i',
m a x _ de p t h = N o n e , m a x _ f e a t u r e s = ' a u t o ' , m a x _ l e a f _ n o d
es=None,
m i n _ im p u r i t y _ d e c r e a s e = 0 . 0 , m i n _ i m p u r i t y _ s p l i t = N o n
e,
m i n _ sa m p l e s _ l e a f = 1 , m i n _ s a m p l e s _ s p l i t = 2 ,
m i n _ we i g h t _ f r a c t i o n _ l e a f = 0 . 0 , n _ e s t i m a t o r s = 1 0 0 ,
n _ j o bs = N o n e , o o b _ s c o r e = F a l s e , rand om_sta te=No ne,
v e r b os e = 0 , w a r m _ s t a r t = F a l s e )
I n [14 8]: t e s t i n g = testd ata . d rop([ ' HEALT H -IMP ACT' ,' SO2 - A VG' ,'C O - AVG ' ,'DA TE' ,'C ITY' ,
' S T A T E ' , 'PLAC E' ,'P M2.5 -A VG' , ' PM10 - A V G ' ,' NO2 - A VG' ,'N H3 -AV G' , 'O ZONE - A VG' ] ,
a x i s='co lumns ' )
I n [15 2]: t a r g e t _ t r a i n _ d a t a s e t = tra in_da taset [ 'poll ution _effec t_cat egory ' ]
target_t rain_ datas et
Out[152] : 0 1
1 1
2 2
3 4
4 2
..
86 1
87 0
88 1
89 3
90 3
Name: po lluti on_ef fect_c atego ry, L ength: 91, dtype: int3 2
In [155] : t r a i n _ d a t a s e t 1 = tra in_dat aset . drop( [ 'poll ution _effec t_cat egory ' , 'REM ARK'
] , a x i s= ' colum ns' )
train_da taset 1
Out[155] :
A IR _ QU A L IT Y_ IN D E X
0 190
1 188
2 280
3 302
4 285
... ...
86 123
87 43
88 111
89 89
90 88
In [97 ]:
In [98 ]: testing
Out[98]:
0 110
1 117
2 73
3 65
4 68
5 61
6 55
7 43
8 58
9 40
10 69
11 57
12 59
13 56
14 48
15 40
16 38
17 63
18 37
19 71
In [ ]:
I n [ 1 0 3 ] : tes ting
Out[103] :
0 110
1 117
2 73
3 65
4 68
5 61
6 55
7 43
8 58
9 40
10 69
11 57
12 59
13 56
14 48
METHDOLOGY II
We seen results about air pollution by considering different attributes like AQI and COVID
before and now COVID.
Now in this Methodology we want to predict the air pollution an deaths of people (after
COVID).
Tool Used: Tableau.
So we use tableau to predict the next year pollution an death rate, by considering each attribute in
city.csv file. So let’s recap the data set.
This dataset contain data from the year 2015 to May 2020(till present)
Let’s move on…..
Fig 8.1.1- AQI vs Year

Description:
AQI – 2015: 386,337
AQI – 2016: 489,903
AQI – 2017: 564,131
AQI – 2018: 1,005,646
AQI – 2019: 1,050,165
AQI – 2020: 3, 59, 407
Fig: 8.1.2: Predicting to 2021, 2022, 2023 and 2024
AQI – 2021: 2, 77, 570
AQI – 2022: 2, 67, 210
AQI – 2034: 2, 11, 211
AQI – 2024: 2, 34, 345

Fig 8.2.1 : Each chemical pollutants reaction on the environment and its prediction rate up to
2024
Fig 8.2.2 : Each chemical pollutants reaction on the environment and its prediction rate upto
2024
Summary of the data:

SUM (Benzene)
Sum: 51,465
Average: 10,293
Minimum: 4,956
Maximum: 19,768
Median: 9,281
Standard deviation: 6,118
First quartile: 5,154
Third quartile: 12,306
Skewness: 0.70
Excess Kurtosis: -0.86
SUM (NH3)
Sum: 358,869
Average: 71,774
Minimum: 44,766
Maximum: 107,020
Median: 62,112
Skewness: 0.33
SUM (NO)
Sum: 362,816
Average: 72,563
Minimum: 38,347
Maximum: 111,688
Median: 58,267
Skewness: 0.29
SUM (Toluene)
Sum: 142,619
Average: 28,524
Minimum: 12,710
Maximum: 52,022
Median: 16,467
Skewness: 0.43
SUM(Xylene)
Sum: 68,693
Average: 6,869
Minimum: 720
Maximum: 10,626
Median: 8,219
Skewness: -0.80
SUM (Air Quality Index)
Sum: 8,421,167
Average: 842,116.70
Minimum: 386,337
Maximum: 1,050,165
Median: 984,997.00
First quartile: 669,347.50
Third quartile: 984,997.00
Skewness: -0.94
Fig 8.3.1: Predicting Remark on Industry and traffic air pollution
Mostly we got satisfactory results. i.e pollution range : (above 50 but less than100)
Fig: 8.4 :Predicting industry and air pollution 2020-2024
We found mostly we get satisfactory results for the next four years.
Fig 8.5 cities vs remark
Similarly we obtained majority as satisfactory for the given cities for the next four years.
Fig 8.6.1 : Industry Pollution

Fig 8.7 : COVID vs Air pollution
Fig 8.8 Industry smoke prediction
Year =2020
Lower Prediction Interval for Suspended Particulate Matter (SPM)=-100.197425345
Upper Prediction Interval for Suspended Particulate Matter =161.558892074

(SPM) Suspended Particulate Matter (SPM)= 30.680733365
Year =2034
Lower Prediction Interval for Suspended Particulate Matter (SPM)= -186.356139481
Upper Prediction Interval for Suspended Particulate Matter =247.717606211
(SPM) Suspended Particulate Matter (SPM)= 30.680733365.

PREDICTION CONCLUSION
FINALLY, FOR THE NEXT FOUR YEARS BY CONSIDERING ALL THE FACTORS,
WE GOT PREDICTION AS “SATISFACTORY” (50-100 IS THE POLLUTION RANGE).
EFFECT: Minor breathing discomfort to sensitive people.

SUMMARY
1. Air Pollution Major sources are Traffic and Industry, which include PM2.5 and PM10
major chemicals.
2. Based on the Air Quality Index (AQI) Pollution is estimated and causes effects in living
organisms. Central Government standards are followed for formulating AQI.
3. Tableau analysis tool is used to analyze this data.
4. Air quality is predicted based on chemical pollutants and model is fitted on Training data
using Random Forest Regressor and trained on 2020 dataset.
5. After predicting the AQI, based on COVID estimation, they are clustered into 5
categories like good, satisfactory, poor, moderate and very poor.
6. The finally Classification technique is applied on my dataset to predict the type of disease
, the classification techniques are Support vector machine and random forest
Classifier.
7. For Future Prediction of Air Pollution, Tableau is used for forecasting the data till 2024
like each chemical occurance and overall AQI.
8. Industry pollution is also forecasted up to 2050.
9. Finally ,We can analyze and predict that for the upcoming years the air pollution in will
be “SATISFACTORY” , such that pollution can range mainly due to Industry and Traffic
or both by 50 -100
10. So the effect would be “ Minor breathing discomfort to sensitive people “.
11. Finally on an average, there are no major problems facing with air pollution, based on the
results we got.
CONCLUSION
The data is taken from Central Government of India. The best ensembling regression
Techniques like Random Forest, Bagging are used. Data is correctly analyzed using
tableau tool. The prediction results are approximately correct. There is no Code and
analysis Plagiarism.

Air Quality Index Analysis &amp; Prediction

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Air Quality Index Analysis &amp; Prediction

Uploaded by

Copyright:

Available Formats

IMPACT OF AIR POLLUTION ON OUR LIVES

Air pollution affects every one of us.

Fig 5a.1.1 Data:

Fig 5a.1.2 Tableau-tool analysis:

Fig 5a.2.1 Data:

Fig 5a.2.2 Tableau analysis:

Transport and Industry Effects:

Fig 5a.3.1 Data:

Fig 5a.4.1 Annual death rates:

The number of instances are 24022.(city.csv)

Samples are taken and then air quality Index is predicted.

Samples are taken and then air quality Index is predicted.

Air Quality index : The total of all chemical pollutants *1.5

Let’s go to the Methodology to understand better.

Impact of Air pollution - Pre COVID.

Fig 6.1- AQI vs Year

Fig 6.2- AQI vs Pollution Remark

We can conclude that 95% of the pollution is decreased by 2019-2020.

Tool Used: Jupyter Notebook.

We are analyzing in three kinds. They are

(i) Prediction of Air Quality Index

Prediction of Air Quality Index

Train data: TEST file

Explanation is available in code fragment.

In [4]: #loading the train data set of airquality(90 instances)

In [7]: #missing data is removed

In [38]: #first drop unwanted columns.

In [9]: traindata2=traindata1.drop(['CITY','DATE','pollution range'],axis='colum

In [11]: #here prediction value(class label is air_quality index) so,make it into

In [12]: from sklearn.model_selection import train_test_split

In [15]: from sklearn.ensemble import RandomForestRegressor

Out[21]: RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,

In [90]: #score obtained is 99.9% predicted.....

Out[22]: array([374.12, 156. , 266.72, ..., 46. , 140. , 247.98])

CITY DATE PM2.5 PM10 NO NO2 Nox NH3

SO2 O3 Benzene Toluene Xylene air_quality_index \

[4646 rows x 16 columns]

STATE CITY DATE PM2.5 PM10 NO NO2 Nox NH3 CO SO2

1 assam gauhati 5/1/2019 18 19 10 29 16 44 19 0

2 assam gauhati 5/2/2019 30 31 12 2 20 17 31 0

4 assam gauhati 23/5/2019 31 31 12 2 20 17 31 0

3 assam gauhati 5/10/2019 43 42 11 2 24 19 42 0

86 Andhrapradesh Visakhapatnam 21/1/2020 90 0 22 6 8 23 0 0

87 Delhi Delhi 25/1/2020 89 0 67 0 0 23 0 0

88 Delhi Delhi 26/1/2020 88 0 45 4 5 35 0 0

amaravathi 1/4/2019 302 181 144 2 39 0 181 0

90 Maharashtra Mumbai 12/2/2017 330 0 41 0 6 86 0 0

89 302 181 144 2 39 0 181 0 152.94 78 0 0 0

4646 rows × 12 columns

Out[30]: RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,

Out[121]: array([287.8 , 287.8 , 439.18, 446.16, 436.68, 374.54, 439.46, 257.6 ,

3252 rows × 1 columns

In [34]: #now keep this in test(result) data set

In [35]: testdata.to_csv (r'C:\sravan\predicted_airquality_final.csv', index = Fa

89 302 181 144 2 39 0 181 0 105.00 78 0 0 0

Clustering the Air Quality Index vs COVID

I n [ 4 2 ] : #loading the train data set of airquality(90 instances)

86 An dhr a pr ad es h Vi s ak h ap atn a m 2/4/ 20 20 12 3 0 5 6 6 0 0 5 6 tr af

I n [ 4 4 ] : inputs =d a t a. d r o p ( 'AIR_QUALITY_INDEX' , a x i s =' c o l u m n s')

I n [ 4 5 ] : target =data[ 'AIR_QUALITY_INDEX' ]

86 An dhr a pr ad es h Vi s ak h ap atn a m 2/4/ 20 20 12 3 0 5 6 6 0 0 5 6 tr af

86 An dhr a pr ad es h Vi s ak h ap atn a m 2/4/ 20 20 12 3 0 5 6 6 0 0 5 6 tr af

In [52]: #making results for clustering analysis

... ... ... ... ... ... ... ... ...

Air Quality Index Analysis & Prediction

Air Quality Index Analysis & Prediction