Professional Documents
Culture Documents
Contents:
1. Abstract
2. Objective
3. Problem Statement
4. Introduction
5. Data Analysis
5 a. Part I
5 b. Part II
6. Methodology - I
6 a. Impact of Air pollution - Pre COVID.
7. Coding and Results.
8. Methdology - II
8 a. Impact of Air pollution - Post COVID.
9. Summary
10.Conclusion
ABSTRACT
Air pollution is increasing day by day. Mostly the chemical pollutants like CO2, SO2, NH3,
PMare the causes of the air pollution. The sources of these pollutants are Industries, vehicles,
Burning of fossil fuels e.t.c. This Document provides an detail description and analysis of factors
and their ratios affecting and lead to harmfulness to the people and other living organisms. Air
Quality index is the most important factor that should be considered. By considering it we can
estimate the effect rate of air pollution like severe, poor ,good. The data is given by Central
government Pollution board. I estimated Air quality Index by applying Machine Learning
Classification techniques Like Random Forest, Support Vector Machine and then Clustering analysis
for grouping the effect of the pollutants based on Air Quality Index. After data Analyzation is done
using Tableau tool for predicting impact of pollution after COVID’19 and pollutants percentage are
also analyzed using this tool. After that based on the effect of pollution. I can say what are the
harmful effects that we are going to face as per pollution group.
OBJECTIVE
The main objective of this project is to provide harmful effects of Air Pollution and the sources
that are causing it. Our goal is predict the impact of Air Pollution after three years of COVID’19
and analyze the pollution before three years of COVID’19.
PROBLEM STATEMENT
Predict the Air Quality Index (AQI) of the current data and compare with existing data. Group
the effect rate of pollution into good (0 – 50), Satisfactory (51-100) , Moderate (101-200), Poor
(200 – 300) and Very Poor (300 and above). Mention the Impact of air pollution and predict the
air pollution for next upcoming years.
INTRODUCTION
Air pollution may be described as contamination of the atmosphere by gaseous, liquid, or solid
wastes or by-products that can endanger human health and welfare of plants and animals, attack
materials, reduce visibility or produce undesirable odors. Although some pollutants are released
by natural sources like volcanoes, coniferous forests, and hot springs, the effect of this pollution
is very small when compared to that caused by emissions from industrial sources, power and heat
generation, waste disposal, and the operation of internal combustion engines. Fuel combustion is
the largest contributor to air pollutant emissions, caused by man, with stationary and mobile
sources equally responsible. The air pollution problem is encountered outdoor as well as indoor.
To read more about the Outdoor Air Pollution and to read more about the Indoor Air Pollution
The indoor air pollution came to our attention during 80's while outdoor air pollution has been
around for some time. The major pollutants which contribute to indoor air pollution include
radon, volatile organic compounds, formaldehyde, biological contaminants, and combustion by-
products such as carbon monoxide, carbon dioxide, sulfur dioxide, hydrocarbons.
The major pollutants which contribute to outdoor air pollution are sulfur dioxide, carbon
monoxide, nitrogen oxides, ozone, total suspended particulate matter, lead, carbon dioxide, and
toxic pollutants.
There are several reasons to worry about air pollution. Some are:
PART-I
Tool Used: Tableau
In this Part, we discuss about the chemical pollutants which cause air pollution is collected and
entered in an csv file ,using tableau tool they are analyzed.
This part of the data analysis explains the brief historical data of air pollution like chemical
factors, annual death rates, different kinds of air pollution.
In this Part, we discuss about the chemical pollutants that cause air pollution and AQI. Air
Quality index is the main solution to detect the type of pollutiondiseases that cause effect the
lives of people and living organisms.
The data is taken from the Central Pollution of India and entered in an csv file.
Training Data:
Testing Data(i):
Instances: 90
On this data, we want to predict the air quality index and then we group them into five disease
stages, as we discussed earlier.
Testing Data(ii):
Instances: 21
Factor:
Dataset: city.csv
Methodology – I conclusion:
Training data is trained, and then test data is given as input to predict the results.
Language: Python
Technique: Regression (Random Forest Regressor & Support Vector Machine)
In [37]: #first we predict the air quality index by splitting our data as 80%trai
n data and 20% testing
#Then we apply regression techniqueto predict the air quality index base
d on all chemical pollutants.
#there after we apply cluster analysis
# and finally we want to predict what are the harmful affects that you a
re going to face like good,very poor e.t.c
In [8]: traindata1.head(3)
Out[8]:
CITY DATE PM2.5 PM10 NO NO2 Nox NH3 CO SO2 O3 Benzen
1969 Amaravati 11/25/2017 81.40 124.50 1.44 20.50 12.08 10.72 0.12 15.24 127.09 0.2
1970 Amaravati 11/26/2017 78.32 129.06 1.26 26.00 14.85 10.28 0.14 26.96 117.44 0.2
1971 Amaravati 11/27/2017 88.76 135.32 6.60 30.85 21.77 12.91 0.11 33.59 111.81 0.2
In [10]: traindata2.head(2)
Out[10]:
PM2.5 PM10 NO NO2 Nox NH3 CO SO2 O3 Benzene Toluene Xylene air
1969 81.40 124.50 1.44 20.5 12.08 10.72 0.12 15.24 127.09 0.20 6.50 0.06
1970 78.32 129.06 1.26 26.0 14.85 10.28 0.14 26.96 117.44 0.22 7.95 0.08
4646
4646
In [47]: #traindata=traindata.drop(['air_quality_index'],axis='columns
In [13]: #Then split our traindata into training(80%) and testing (20%)
In [13]: X_train,x_test,Y_train,y_test=train_test_split(traindata2,target,test_si
ze=0.3)
#making our data into test and trainsets
In [14]: len(X_train)
Out[14]: 3252
In [16]: r=RandomForestRegressor(n_estimators=50)
In [21]: #model
r.fit(X_train,Y_train)
In [18]: r.score(x_test,y_test)
Out[18]: 0.999758744001202
In [22]: res=r.predict(X_train)
res
In [23]: print(traindata)
pollution range
1969 Moderate
1970 Moderate
1971 Moderate
1972 Moderate
1973 Moderate
... ...
24018 Moderate
24019 Satisfactory
24020 Moderate
24021 Moderate
24022 Moderate
In [24]: #now lets take other test data for predicting air quality index
testdata=pd.read_csv('C:\sravan\TEST.csv')
In [25]: testdata
Out[25]:
Andhra
0 Rajamahendravaram 27/2/2019 31 49 16 4 10 0 49 0
Pradesh
91 rows × 18 columns
In [26]: testdata=testdata.drop(['STATE','CITY','DATE','REMARK','HEALTH-IMPACT'],
axis='columns')
testdata
Out[26]:
predicted
PM2.5 PM10 NO NO2 Nox NH3 CO SO2 air quality O3 Benzene Toulene Xylene
index
0 31 49 16 4 10 0 49 0 287.80 3 0 0 0
1 18 19 10 29 16 44 19 0 287.80 44 0 0 0
2 30 31 12 2 20 17 31 0 439.18 50 0 0 0
3 43 42 11 2 24 19 42 0 446.16 57 0 0 0
4 31 31 12 2 20 17 31 0 436.68 49 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ...
86 90 0 22 6 8 23 0 0 252.76 67 0 0 0
87 89 0 67 0 0 23 0 0 130.52 45 0 0 0
88 88 0 45 4 5 35 0 0 241.26 67 0 0 0
90 330 0 41 0 6 86 0 0 160.54 52 0 0 0
91 rows × 13 columns
In [28]: target1=traindata['air_quality_index']
traindata3=traindata2.drop(['air_quality_index'],axis='columns')
traindata3
Out[28]:
PM2.5 PM10 NO NO2 Nox NH3 CO SO2 O3 Benzene Toluene Xylene
1969 81.40 124.50 1.44 20.50 12.08 10.72 0.12 15.24 127.09 0.20 6.50 0.06
1970 78.32 129.06 1.26 26.00 14.85 10.28 0.14 26.96 117.44 0.22 7.95 0.08
1971 88.76 135.32 6.60 30.85 21.77 12.91 0.11 33.59 111.81 0.29 7.63 0.12
1972 64.18 104.09 2.56 28.07 17.01 11.42 0.09 19.00 138.18 0.17 5.02 0.07
1973 72.47 114.84 5.23 23.20 16.59 12.25 0.16 10.55 109.74 0.21 4.71 0.08
... ... ... ... ... ... ... ... ... ... ... ... ...
24018 19.03 50.03 77.24 14.17 57.37 11.30 0.43 9.83 23.31 0.66 3.22 0.16
24019 12.37 39.29 66.20 11.68 58.88 11.30 0.39 8.63 31.79 0.55 3.05 0.14
24020 15.21 41.96 79.67 13.50 69.42 10.13 0.42 9.37 33.08 0.69 1.24 0.73
24021 30.93 60.26 69.32 14.46 61.62 10.08 0.52 11.96 41.62 1.67 1.82 2.62
24022 29.26 76.89 75.87 11.84 65.66 12.02 0.52 7.86 35.56 2.28 1.93 2.75
In [29]: testing=RandomForestRegressor(n_estimators=50)
In [30]: testing.fit(traindata3,target1)
In [121]: res=testing.predict(testdata)
res
In [31]: testing.score(traindata3,target1)
Out[31]: 0.9924580809809389
In [32]: res=pd.DataFrame(res)
In [33]: res
Out[33]:
0
0 374.12
1 156.00
2 266.72
3 174.98
4 79.00
... ...
3247 41.02
3248 49.00
3249 46.00
3250 140.00
3251 247.98
In [36]: testdata
Out[36]:
predicted
PM2.5 PM10 NO NO2 Nox NH3 CO SO2 air quality O3 Benzene Toulene Xylene
index
0 31 49 16 4 10 0 49 0 374.12 3 0 0 0
1 18 19 10 29 16 44 19 0 156.00 44 0 0 0
2 30 31 12 2 20 17 31 0 266.72 50 0 0 0
3 43 42 11 2 24 19 42 0 174.98 57 0 0 0
4 31 31 12 2 20 17 31 0 79.00 49 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ...
86 90 0 22 6 8 23 0 0 45.00 67 0 0 0
87 89 0 67 0 0 23 0 0 222.02 45 0 0 0
88 88 0 45 4 5 35 0 0 69.00 67 0 0 0
90 330 0 41 0 6 86 0 0 137.00 52 0 0 0
91 rows × 13 columns
In [37]: traindata3
FINAL RESULT:
We are considering AQI vs COVID for cluster the data and then group into 5
clusters.
They are good, satisfactory, Moderate, poor, Very poor.
good=cluster(0),satisfactory=cluster(2),poor=cluster(3),moderate=cluster(1),very
poor=cluster(4)
Data:
We used K Means Clustering Algorithm to cluster the data and scatter plot to
visualize the data.
I n [ 3 ] : import pandas a s p d
In [43]: data
Out[43]:
P M 2.5- P M 1 0- N O2- N H 3- S O 2- OZ O N E -
ST AT E C IT Y D AT E CO
AV G AV G A VG AV G AG AV G
An dhr a
0 a mar a v athi 1/1/2 01 9 1 90 13 1 10 7 4 42 0 63
Pr ad es h
Andhr a
1 a mar a v athi 1/2/2 01 9 1 88 13 1 11 0 4 40 0 62
Pr ades h
An dhr a
2 a mar a v athi 1/3/2 01 9 2 80 17 4 15 5 2 37 0 52
Pr ad es h
Andhr a
3 a mar a v athi 1/4/2 0 19 3 0 2 18 1 1 44 2 39 0 7 8 tr af
Pr ades h
An dhr a
4 a mar a v athi 1/6/2 01 9 2 85 16 0 12 1 3 19 0 71
Pr ad es h
... ... ... ... ... ... ... ... ... ... ...
87 D el hi D el hi 4/1/20 20 43 0 76 4 0 0 76 tr af
88 D el hi D el hi 23/1/2 02 0 11 1 0 46 7 0 0 78 tr af
89 D el hi D el hi 25/1/2 02 0 89 0 67 0 0 23 45 tr af
90 D el hi D el hi 26/1/2 02 0 88 0 45 4 5 35 67 tr af
91 ro ws × 15 colu mn s
I n [ 4 6 ] : target
Out[46]: 0 190
1 188
2 280
3 302
4 285
...
86 123
87 43
88 111
89 89
90 88
Name: AIR_QUALITY_INDEX, Length: 91, dtype: int64
inputs
In [47]:
Out[47]:
P M 2.5- P M 1 0- N O2- N H 3- S O 2- OZ O N E -
ST AT E C IT Y D AT E CO
AV G AV G A VG AV G AG AV G
An dhr a
0 a mar a v athi 1/1/2 01 9 1 90 13 1 10 7 4 42 0 63
Pr ad es h
Andhr a
1 a mar a v athi 1/2/2 01 9 1 88 13 1 11 0 4 40 0 62
Pr ades h
An dhr a
2 a mar a v athi 1/3/2 01 9 2 80 17 4 15 5 2 37 0 52
Pr ad es h
Andhr a
3 a mar a v athi 1/4/2 0 19 3 0 2 18 1 1 44 2 39 0 7 8 tr af
Pr ades h
An dhr a
4 a mar a v athi 1/6/2 01 9 2 85 16 0 12 1 3 19 0 71
Pr ad es h
... ... ... ... ... ... ... ... ... ... ...
87 D el hi D el hi 4/1/20 20 43 0 76 4 0 0 76 tr af
88 D el hi D el hi 23/1/2 02 0 11 1 0 46 7 0 0 78 tr af
89 D el hi D el hi 25/1/2 02 0 89 0 67 0 0 23 45 tr af
90 D el hi D el hi 26/1/2 02 0 88 0 45 4 5 35 67 tr af
91 ro ws × 14 colu mn s
In [ ]:
I n [ 4 8 ] : from sklearn.preprocessing i m p o r t L a b e l E n co d e r
#converting binary to nominal using labelencoder
I n [ 4 9 ] : le_fever = LabelEncoder()
inputs[ ' covid' ] = le_fever . fit_transform(inputs[ 'COVID' ] )
In [50]: inputs
Out[50]:
ST AT E C IT Y D AT E P M 2.5- P M 1 0- N O2- N H 3- S O 2- C O OZ O N E -
AV G AV G A VG AV G AG AV G
An dhr a
0 a mar a v athi 1/1/2 01 9 1 90 13 1 10 7 4 42 0 63
Pr ad es h
Andhr a
1 a mar a v athi 1/2/2 01 9 1 88 13 1 11 0 4 40 0 62
Pr ades h
An dhr a
2 a mar a v athi 1/3/2 01 9 2 80 17 4 15 5 2 37 0 52
Pr ad es h
Andhr a
3 a mar a v athi 1/4/2 0 19 3 0 2 18 1 1 44 2 39 0 7 8 tr af
Pr ades h
An dhr a
4 a mar a v athi 1/6/2 01 9 2 85 16 0 12 1 3 19 0 71
Pr ad es h
... ... ... ... ... ... ... ... ... ... ...
87 D el hi D el hi 4/1/2 0 20 4 3 0 7 6 4 0 0 7 6 tr af
88 D el hi D el hi 23/1/2 02 0 1 11 0 46 7 0 0 78 tr af
89 D el hi D el hi 25/1/2 02 0 89 0 67 0 0 23 45 tr af
90 D el hi D el hi 26/1/2 02 0 88 0 45 4 5 35 67 tr af
91 ro ws × 15 colu mn s
In [51]: target
Out[51]: 0 190
1 188
2 280
3 302
4 285
...
86 123
87 43
88 111
89 89
90 88
Name: AIR_QUALITY_INDEX, Length: 91, dtype: int64
Out[52]:
PM 2.5- A V G P M 1 0- A V G N O2- AV G N H 3- A V G S O 2- A G C O OZ ON E - AV G c o v id
0 19 0 13 1 1 07 4 42 0 63 0
1 18 8 13 1 1 10 4 40 0 62 0
2 28 0 174 15 5 2 37 0 52 0
3 30 2 181 14 4 2 39 0 78 0
4 28 5 160 12 1 3 19 0 71 0
86 12 3 0 56 6 0 0 56 1
87 43 0 76 4 0 0 76 1
88 11 1 0 46 7 0 0 78 1
89 89 0 67 0 0 23 45 1
90 88 0 45 4 5 35 67 1
91 ro ws × 8 colum n s
I n [ 5 4 ] : thentarget
Out[54]: 0 0
1 0
2 0
3 0
4 0
..
86 1
87 1
88 1
89 1
90 1
Name: covid, Length: 91, dtype: int32
I n [ 5 6 ] : s v m.fit(thenres,thentarget)
In [57]: s v m.score(thenres,thentarget)
Out[57]: 1.0
Out[63]:
PM 2.5- A V G P M 1 0- A V G N O2- AV G N H 3- A V G S O 2- A G C O OZ ON E - AV G c o v id
0 19 0 13 1 1 07 4 42 0 63 0
1 18 8 13 1 1 10 4 40 0 62 0
2 28 0 174 15 5 2 37 0 52 0
3 30 2 181 14 4 2 39 0 78 0
4 28 5 160 12 1 3 19 0 71 0
86 12 3 0 56 6 0 0 56 1
87 43 0 76 4 0 0 76 1
88 11 1 0 46 7 0 0 78 1
89 89 0 67 0 0 23 45 1
90 88 0 45 4 5 35 67 1
91 ro ws × 8 colum n s
Out[64]:
SO 2- OZ ON E-
PM 2.5- PM 10- N O 2- N H 3- CO co v id air _q u alit y_in d ex
AV G AVG AV G A V G AG AV G
0 19 0 13 1 1 07 4 42 0 63 0 19 0
1 18 8 13 1 1 10 4 40 0 62 0 18 8
2 28 0 17 4 1 55 2 37 0 52 0 28 0
3 30 2 18 1 1 44 2 39 0 78 0 30 2
4 28 5 16 0 1 21 3 19 0 71 0 28 5
... ... ... ... ... ... ... ... ... ...
86 123 0 56 6 0 0 56 1 123
87 43 0 76 4 0 0 76 1 43
88 11 1 0 46 7 0 0 78 1 11 1
89 89 0 67 0 0 23 45 1 89
90 88 0 45 4 5 35 67 1 88
91 ro ws × 9 colum n s
In [67]: k m=K M e a n s ( n_ c l u s t e rs = 5)
km
#dividing into 5 clusters
Out[68]: array([1, 1, 3, 3, 3, 3, 3, 4, 4, 4, 0, 0, 0, 0, 4, 4, 0, 0, 0, 0, 0, 4,
0, 4, 4, 4, 4, 4, 4, 0, 0, 4, 0, 4, 4, 1, 1, 4, 0, 0, 0, 0, 0, 0 ,
4, 1, 0, 0, 4, 1, 3, 1, 3, 4, 4, 4, 4, 1, 1, 0, 0, 4, 1, 4, 0, 0 ,
0, 0, 3, 0, 2, 3, 1, 0, 3, 4, 0, 0, 0, 0, 0, 4, 0, 4, 0, 4, 4, 0 ,
4, 4, 4])
Out[69]:
PM 2.5- PM 10- N O 2- N H 3- SO 2- OZ ON E-
CO co v id air _q u alit y_in d ex g r o u p ed _ p o llu t u io n
AV G AV G A VG AV G AG AV G
0 190 13 1 10 7 4 42 0 63 0 19 0 1
1 188 13 1 11 0 4 40 0 62 0 18 8 1
2 280 17 4 15 5 2 37 0 52 0 28 0 3
3 302 1 81 144 2 39 0 78 0 30 2 3
4 285 1 60 121 3 19 0 71 0 28 5 3
... ... ... ... ... ... ... ... ... ... ...
86 12 3 0 56 6 0 0 56 1 12 3 4
87 43 0 76 4 0 0 76 1 43 0
88 11 1 0 46 7 0 0 78 1 11 1 4
89 89 0 67 0 0 23 45 1 89 4
90 88 0 45 4 5 35 67 1 88 4
91 ro ws × 10 colu mn s
In [76]: d f 1=thenres[thenres.grouped_pollutuion = = 0 ]
d f 2=thenres[thenres.grouped_pollutuion = = 1 ]
d f 3=thenres[thenres.grouped_pollutuion = = 2 ]
d f 4=thenres[thenres.grouped_pollutuion = = 3 ]
d f 5=thenres[thenres.grouped_pollutuion = = 4 ]
In [ ] :
In [ ] :
In [32]:
In [ ] :
In [86]:
In [ ] :
In [92]:
In [ ] :
In [ ] :
In [ ] :
In [ ] :
In [98]:
In [ ] :
In [ ] :
RESULT:
Marking the affect of pollution and disease as per the central government
standards (Category prediction)
Train data:
Test data:
Predicted Result:
In [51 ]: i m p o r t p andas a s p d
f r o m mat plotl ib im port p yplot a s p lt
I n [13 7]: #loading the train data set o f air qualit y(90 instan ces)
t r a i n d a t a = p d . read_ csv( 'C : \sra van \ i nterns hip \ \ airpol lutio n_eff ect_ca use_t
raindata .csv' )
traindat a
O ut [ 1 3 7 ] :
STATE CI TY DATE PM2.5- PM10- NO2- NH3- SO2- CO OZO NE-
AVG AVG AVG AVG AG AVG
Andhra
0 amaravathi 1/1/2019 190 131.0 107 4 42 0 63
Pradesh
1 An dhr a
amaravathi 1/2/2019 188 131.0 110 4 40 0 62
Pradesh
Andhra
2 amaravathi 1/3/2019 280 174.0 155 2 37 0 52
Pradesh
3 An dhr a
amaravathi 1/4/2019 302 181.0 144 2 39 0 78 traf
Pradesh
Andhra
4 amaravathi 1/6/2019 285 160.0 121 3 19 0 71
Pradesh
... ... ... ... ... ... ... ... ... ... ...
91 ro w s × 1 5 colu m n s
In [138] : #loading the test data s et of airq uality (19 i nstanc es)
t e s t d a t a =p d. r ead_c sv( ' C : \ srav an \in ternsh ip \ \ a irpoll ution _effe ct_cau se_te
stdata.c sv')
testdata
O ut [ 1 3 8 ] :
STATE CI TY DATE PM2.5- PM10- NO2- NH3- SO2- CO- O ZONE - P
AVG AVG AVG AVG AVG AVG AVG
t
10 Andhra Amaravati 4/1/2020 64 69 6 2 32 18 34
pradesh i
t
11 Andhra Amaravati 4/2/2020 48 57 6 2 27 - 26
pra de s h i
t
12 Andhra Amaravati 4/3/2020 50 59 5 2 28 - 17
pradesh i
Andhra
13 Rajamahendravaram 4/4/2020 56 56 9 2 10 28 37
pradesh
Andhra
14 Rajamahendravaram 4/5/2020 43 48 8 2 9 27 33 i
pradesh
Andhra
15 Rajamahendravaram 4/6/2020 34 40 7 2 9 27 17
pradesh
Andhra
16 Tirupati 4/7/2020 35 38 7 1 8 26 27 i
pradesh
In [139] : #scatter plot showi ng the stat e and its a ir qu ality index
p l t. s c a t t e r ( t r a i n d a t a[ ' A I R _ Q UA L I T Y _ I N D E X ' ] , t r a i n d a t a [' R E M A R K ' ] )
p l t. titl e( 'PO LLUTI ON REMA RK' )
p l t. xlab el( 'A IR_QU ALITY_ INDEX ' )
p l t. ylab el( 'P OLLUT ION REM ARK' )
I n [14 0]: #goal is to p redic t base d on air p olluti on we will say w hich level of po
llution you w ill b e affe cted.
I n [14 1]: # w e a r e u s i n g c l a s s i f i c a t i o n t e c h n i q u e f o r th is
I n [14 2]: t r a i n _ d a t a s e t = trai ndata . drop( [ 'HEA LTH - IM PACT' ,' S O 2- A G ' , ' CO' ,' DATE' , ' C I T
Y ','STAT E' ,'P LACE' , 'COVI D' ,'P M2.5 - A V G ' ,' PM10 - A V G ', ' NO2 - A VG' ,' NH3 -AV G' , ' O
Z O N E-AVG ' ],ax is ='c olumns ' )
# t e s t _ d a t a s e t = t e s t d a t a . d r o p ( [ ' H E A L T H - IMP ACT', 'SO2 - A G','C O','D ATE',' CIT
Y ' , ' S T A T E ' , ' P L A C E ' , ' C O V I D ' , ' P M 2 . 5 - AVG',' PM10 - AVG',' NO2 - A VG',' NH3 -AV G','O
Z O N E-AVG '],ax is='c olumns ')
O ut [ 1 4 3 ] :
AIR_QUALI TY_I NDEX REM ARK
0 190 moderate
1 188 moderate
2 280 poor
4 285 poor
86 123 moderate
87 43 good
88 111 moderate
89 89 satisfactory
90 88 satisfactory
91 ro w s × 2 col um n s
In [ ]:
O ut [ 147] :
AIR_QUALI TY_I NDEX polluti on_effect_categor y
0 190 1
1 188 1
2 280 2
3 302 4
4 285 2
86 123 1
87 43 0
88 111 1
89 89 3
90 88 3
91 ro w s × 2 col um n s
I n [136]:
-------- ----- ----- ------ ----- ----- ------ ----- ------ ----- ----- ------ -----
---
KeyError T r a c e b a c k ( m o s t r e c e n t c a l l la
st)
< i p y t h o n -inpu t - 1 3 6 -e8af5 3e925 a5> i n <mod ule>
- - - -> 1 train _data set3 = t rain_ datas et1 . dr op ( [' pollut ion_e ffect _categ ory' ]
,a x i s='c olumn s' )
~\ anacon da3 \ l ib \si te -pac kages \pand as \cor e \ fra me.py i n dr op (se lf, la bels,
axis, in dex, colum ns, le vel, inpla ce, er rors)
3995 l e v e l =le vel ,
3996 i n p l a c e = inpla ce ,
-> 3 9 9 7 e r r o r s =e rrors ,
3998 )
3999
~\ anacon da3 \ l ib \si te -pac kages \pand as \cor e \ gen eric.p y i n d r o p( self, label
s , a x i s , i n d e x , c o l u m n s , l e v e l , i n p l a c e , error s)
3934 f o r a x i s , lab els i n a x e s . item s ( ) :
3935 i f label s is not N o n e :
-> 393 6 o b j = ob j . _dr op_ax is ( lab els , a x i s , l e v e l = leve l ,
e r r o r s =e rrors )
3937
3938 i f inp lace :
~\ anacon da3 \ l ib \si te -pac kages \pand as \cor e \ gen eric.p y i n _drop _axis ( self,
l a b e l s , a x i s , l e v e l , error s)
3968 n e w _a x i s = ax is . dr op ( lab els , l e v e l = l e v e l , erro rs
=e r r o r s )
3969 else:
-> 3 9 7 0 n e w _a x i s = ax is . dr op ( lab els , e rrors = error s )
3971 r e s u l t = s e l f . rein dex (** { axis _name : new_ axis } )
3972
~\ anacon da3 \ l ib \si te -pac kages \pand as \cor e \ ind exes \ b ase.p y i n d r o p (s elf,
labels, error s)
5016 i f m a s k .a ny ( ) :
5017 i f error s ! = "i gnore" :
-> 5 0 1 8 r a i se Ke yErro r (f"{ labels [mask ]} no t foun d in axi
s ")
5019 i n d e x e r = in dexer [ ~ m a s k ]
5020 r e t u r n s e l f . dele te ( in dexer )
I n [12 5]: #here i am us ing d ecisio n tre e cla sifier for classi fying the pollut ion r
emark.
f r o m skl earn impor t t r e e
I n [12 6]: f r o m skl earn. ensem ble im port Rando mFores tClas sifier
I n [12 7]: r a m= Rand omFor estCl assifi er(n_ estim ators = 100 )
I n [12 8]: r a m. fit( train _data set1,t arget _trai n_data set1)
O ut [ 1 2 8 ] : Ran domFo restCl assif ier(b ootstr ap=Tr ue, cl ass_w eight =None, crit erion= ' g i n
i',
m a x _ de p t h = N o n e , m a x _ f e a t u r e s = ' a u t o ' , m a x _ l e a f _ n o d
es=None,
m i n _ im p u r i t y _ d e c r e a s e = 0 . 0 , m i n _ i m p u r i t y _ s p l i t = N o n
e,
m i n _ sa m p l e s _ l e a f = 1 , m i n _ s a m p l e s _ s p l i t = 2 ,
m i n _ we i g h t _ f r a c t i o n _ l e a f = 0 . 0 , n _ e s t i m a t o r s = 1 0 0 ,
n _ j o bs = N o n e , o o b _ s c o r e = F a l s e , rand om_sta te=No ne,
v e r b os e = 0 , w a r m _ s t a r t = F a l s e )
I n [14 8]: t e s t i n g = testd ata . d rop([ ' HEALT H -IMP ACT' ,' SO2 - A VG' ,'C O - AVG ' ,'DA TE' ,'C ITY' ,
' S T A T E ' , 'PLAC E' ,'P M2.5 -A VG' , ' PM10 - A V G ' ,' NO2 - A VG' ,'N H3 -AV G' , 'O ZONE - A VG' ] ,
a x i s='co lumns ' )
I n [15 2]: t a r g e t _ t r a i n _ d a t a s e t = tra in_da taset [ 'poll ution _effec t_cat egory ' ]
target_t rain_ datas et
Out[152] : 0 1
1 1
2 2
3 4
4 2
..
86 1
87 0
88 1
89 3
90 3
Name: po lluti on_ef fect_c atego ry, L ength: 91, dtype: int3 2
In [155] : t r a i n _ d a t a s e t 1 = tra in_dat aset . drop( [ 'poll ution _effec t_cat egory ' , 'REM ARK'
] , a x i s= ' colum ns' )
train_da taset 1
Out[155] :
A IR _ QU A L IT Y_ IN D E X
0 190
1 188
2 280
3 302
4 285
... ...
86 123
87 43
88 111
89 89
90 88
91 ro w s × 1 col um n s
In [97 ]:
In [98 ]: testing
Out[98]:
A IR _ QU A L IT Y_ IN D E X
0 110
1 117
2 73
3 65
4 68
5 61
6 55
7 43
8 58
9 40
10 69
11 57
12 59
13 56
14 48
15 40
16 38
17 63
18 37
19 71
In [ ]:
I n [ 1 0 3 ] : tes ting
Out[103] :
A IR _ QU A L IT Y_ IN D E X
0 110
1 117
2 73
3 65
4 68
5 61
6 55
7 43
8 58
9 40
10 69
11 57
12 59
13 56
14 48
METHDOLOGY II
We seen results about air pollution by considering different attributes like AQI and COVID
before and now COVID.
Now in this Methodology we want to predict the air pollution an deaths of people (after
COVID).
So we use tableau to predict the next year pollution an death rate, by considering each attribute in
city.csv file. So let’s recap the data set.
This dataset contain data from the year 2015 to May 2020(till present)
Fig 8.2.2 : Each chemical pollutants reaction on the environment and its prediction rate upto
2024
Sum: 51,465
Average: 10,293
Minimum: 4,956
Maximum: 19,768
Median: 9,281
Skewness: 0.70
SUM (NH3)
Sum: 358,869
Average: 71,774
Minimum: 44,766
Maximum: 107,020
Median: 62,112
Skewness: 0.33
SUM (NO)
Sum: 362,816
Average: 72,563
Minimum: 38,347
Maximum: 111,688
Median: 58,267
Skewness: 0.29
SUM (Toluene)
Sum: 142,619
Average: 28,524
Minimum: 12,710
Maximum: 52,022
Median: 16,467
Skewness: 0.43
SUM(Xylene)
Sum: 68,693
Average: 6,869
Minimum: 720
Maximum: 10,626
Median: 8,219
Standard deviation: 3,375
Skewness: -0.80
Sum: 8,421,167
Average: 842,116.70
Minimum: 386,337
Maximum: 1,050,165
Median: 984,997.00
Skewness: -0.94
Mostly we got satisfactory results. i.e pollution range : (above 50 but less than100)
Fig: 8.4 :Predicting industry and air pollution 2020-2024
We found mostly we get satisfactory results for the next four years.
Similarly we obtained majority as satisfactory for the given cities for the next four years.
Year =2020
Year =2034
FINALLY, FOR THE NEXT FOUR YEARS BY CONSIDERING ALL THE FACTORS,
WE GOT PREDICTION AS “SATISFACTORY” (50-100 IS THE POLLUTION RANGE).
The data is taken from Central Government of India. The best ensembling regression
Techniques like Random Forest, Bagging are used. Data is correctly analyzed using
tableau tool. The prediction results are approximately correct. There is no Code and
analysis Plagiarism.