You are on page 1of 9

JAYPEE INSTITUTE OF

INFORMATION TECHNOLOGY Noida

Bachelor of Technology, 9th Semester


2020-2021
Department of CSE & IT
RMIPR Project Report

Submitted By- Kumar Parth (17803017),

B-13 (DD)
SPECULATING DAILY MAXIMUM
CARBON MONOXIDE (CO) LEVEL

Submitted By –
Kumar Parth (17803017)
SPECULATING DAILY MAXIMUM CARBON MONOXIDE (CO) LEVEL

Objective

Considering the increasing pollution levels in the city and its harmful effects on kid’s health, in this study
we wish to predict Carbon monoxide levels given the various sensor values. If CO levels are within 2ppm to
9ppm then it is considered to be tolerable.

Forecasting Description

To forecast the daily maximum Carbon Monoxide (CO) level for next one week (5th April 2005 to 11th
April 2005) by using data of various air pollutants including CO from 10th March 2004 to 4th April 2005.

Data Description

The dataset contains 9358 instances of hourly averaged responses from an array of 5 metal oxide chemical
sensors embedded in an Air Quality Chemical Multi sensor Device. Data were recorded from 10th March
2004 to 4th April 2005 (one year). Ground Truth hourly averaged concentrations for CO, NonMetallic
Hydrocarbons, Benzene, Total Nitrogen Oxides (NOx) and Nitrogen Dioxide (NO2) and were provided by
a co-located reference certified analyzer.

Source: UCI machine learning repository- Air Quality data set


(http://archive.ics.uci.edu/ml/datasets/Air+Quality#)

Attribute Information

0 Date (DD/MM/YYYY)
1 Time (HH.MM.SS)
2 True hourly averaged concentration CO in mg/m^3 (reference analyzer)
3 PT08.S1 (tin oxide) hourly averaged sensor response (nominally CO targeted)
4 True hourly averaged overall Non Metallic Hydro Carbons concentration in micro g/m^3 (reference
analyzer)
5 True hourly averaged Benzene concentration in micro g/m^3 (reference analyzer)
6 PT08.S2 (titania) hourly averaged sensor response (nominally NMHC targeted)
7 True hourly averaged NOx concentration in ppb (reference analyzer)
8 PT08.S3 (tungsten oxide) hourly averaged sensor response (nominally NOx targeted)

1
9 True hourly averaged NO2 concentration in micro g/m^3 (reference analyzer)
10 PT08.S4 (tungsten oxide) hourly averaged sensor response (nominally NO2 targeted)
11 PT08.S5 (indium oxide) hourly averaged sensor response (nominally O3 targeted)
12 Temperature in °C
13 Relative Humidity (%)
14 AH Absolute Humidity Group

Key Characteristics

Data was found with missing values which were visible as “-200”. Data had monthly seasonality and was
also changing as per the days of the week, which could be because of the varying number of automobiles
(emitting air pollutants) on weekdays and weekends.

1. Variables in the Data Set Y variable – > CO(GT)


2. Possible X Variables –> PTO8.S1(CO), NMHC(GT), C6H6(GT), PTO8.S2(NMHC), NOx(GT),
PTO8.S3(NOx), NO2(GT), PTO8.S4(NO2), PTO8.S5(O3), T, RH and AH
3. X Variable NMHC had more than 90% missing values (Excluded from the possible X variables set)
4. All other variables had less than 10% missing values
5. Replaced the missing values by the previous hour values and for consecutive missing values with
last week-hour values

Plot of CO vs Time

X-axis -> Days of the year (Ex. 1st day is 5th April’04 and vice-versa)
Y-axis – Concentration of CO in PPM

2
This suggests a seasonality of CO w.r.t. days of the year to compensate that we will introduce dummy
variables
X4 = 1 if days of the year are between 200 to 300
= 0 otherwise

Plot of CO vs Week time

1:7  Monday: Sunday (X-axis)


Y-axis – Concentration of CO in PPM
Different Colors represents different months of the year

This suggests a seasonality of CO w.r.t. days of the week. to compensate that we will introduce dummy
Variable
X5 = 1 if Monday, Tuesday, Saturday and Sunday
= 0 otherwise

3
Input Variables:

Linear correlation coefficients computed among analyzed species using on field recorded data

r 0.98
NMHC-C6H6

r 0.78
CO-NOx

r 0.67
CO-NO2

r 0.72
C6H6-NOx

r 0.60
C6H6-NO2

r 0.76
NOx-NO2

r 0.90
CO-C6H6

As regard as benzene-NMHC coefficient, it should be noted that it has been computed using only the first 8
days of measurements, after which the NMHC targeted analyzer went out of service.
After checking different available variables we decided that the following variables can affect the CO
levels:

Regressors:
• Daily maximum C6H6 (lag 7)
• Daily maximum T (lag 7)
• Daily maximum AH (lag 7)
• Monthly dummy variables
• Weekly dummy variable

FAQ:

Q1: Why to use lag 7?


Ans: To forecast the CO concentration a weak earlier.

Q2: Why T and AH as a Regressors?


Ans: T – Temperature, AH - Absolute Humidity are one of the key factors of CO concentration in
atmosphere. (Literature review) and correlation coefficients.

4
Multiple Regression Analysis

Y = Xβ + ε (Model)

Full analysis:

1. Coefficient table

Estimate SE t-Stat p-Value


(Intercept)' 2.208278182 0.2283410655 9.670963816 8.19E-20
x1' 0.1454091155 0.006748314924 21.54747032 1.11E-66
x2' -0.05443110318 0.01089796337 -4.994612418 9.22E-07
x3' -0.01977522209 0.2179249985 -0.09074324759 0.9277472139
x4' 0.3087663162 0.1681926407 1.835789693 0.0672156882
x5' 0.1594614238 0.1342518955 1.18777782 0.2357061988

2 ANOVA

SumSq DF MeanSq F pValue


3.892357
Total 1416.818082 364 369 NaN NaN
187.2184
Model 936.0920362 5 072 139.8122876 5.41E-82
1.339069
Residual 480.726046 359 766 NaN NaN
1.301740
Lack of fit 458.2127127 352 661 0.4047461339 0.9826200603
3.216190
Pure error 22.51333333 7 476 NaN NaN

SSres = 480.726045999331 || SSreg = 936.092036192449 || SSTotal = 1416.81808219178

MSres = 1.34 || MSreg = 187.22

R2 = 0.660700232413988 ||| R2_adjusted = 0.655974608910004

5
Residue Analysis:

Normal probability plot of the residual: This is a graph designed so that the cumulative normal distribution
will plot as a straight line. Let t[1] < t[2] < . . . < t[n] be the externally studentized residuals ranked in
increasing order. If we plot t[i] against the cumulative probability Pi = − ( ) i n 1 2 / , i = 1, 2, . . . , n , on the
normal probability plot

Plot of Residuals against the Fitted Values yˆI : plot of the (preferrably the externally studentized residuals,
t i ) versus the corresponding fi tted values yˆi is useful for detecting several common types of model
inadequacies

6
Conclusions:

 Y = 2.2 + 0.15 (Max C6H6) – 0.05 (Max T) – 0.02 (Max AH) + 0.31 (Monthly dummy) + 0.16
(Weekly dummy)
 R2_adjusted = 0.656 => Our model can explain 65% of the variability in the data
 Normal probability plot of the residual behaves properly
 Plot of Residuals against the Fitted Values yˆI behaves properly too

References:

On field calibration of an electronic nose for benzene estimation in an urban pollution monitoring scenario
S. De Vito a,∗ , E. Massera a, M. Piga b, L. Martinotto b, G. Di Francia a

https://archive.ics.uci.edu/ml/datasets/Air+Quality

You might also like