You are on page 1of 70

Submitted by

Muthukumar
Neelamegam
K11834456

Submitted at
Institute of
Telecooperation

Supervisor
Assoc. Univ.-Prof. DI

PREDICTING AIR QUALITY Dr. Karin Anna


Hummel

USING WEATHER March, 2022

FORECASTING AND
MACHINE LEARNING

Master’s Thesis
to obtain the academic degree of

Diplom-Ingenieur
in the Master’s Program

Computer Science

JOHANNES KEPLER
UNIVERSITY LINZ
Altenbergerstraße 69
4040 Linz, Österreich
www.jku.at
DVR 0093696
Abstract

Climate change is one of the most significant challenges for humankind. Climate is in-
fluenced by natural phenomena such as the sun, volcanism, greenhouse gases, air pol-
lution caused by industry and vehicles. Climate in turn affects air quality by increasing
ventilation rates such as wind speed, circulation, dry deposition, natural emissions,
and background concentrations. This work has been entirely focused on air quality,
pollutants, meteorological effects, and its prediction.

Air is one of the essential things in a human’s life, and its quality impacts every living
being on the earth. Raising awareness among citizens is the key to comprehensive
support of these efforts. Therefore, smartphone apps can provide a valuable means to
provide information and warn citizens. For example, the user may be alerted about
the air quality level, and suggestions on how to react and improve air quality can
be disseminated quickly by a mobile app. Today’s technology is intended to evolve
continuously overtime to help humans to live a sustainable and healthy life. With this,
artificial intelligence is the enabling paradigm that is already used in various domains
such as general information systems, military systems, medicine, and manufacturing
and production. In environmental research, machine learning may be used to better
understand air quality and to predict near-future air quality.

This thesis applies machine learning techniques to a smartphone app that informs the
current and predicts the near future air quality. Prediction is based on weather fore-
casts and the experimentation of the concept is done by using the classifiers RepTree,
KNN, and Random Forest. The WEKA tool is used to train the models and evaluate
them with actual weather and air quality data. The final implementation is exemplified
for two locations, Linz-Austria and New Delhi-India.

The results of the prediction are observed and compared with the actual air quality in-
dex for 16 days (Linz, Austria) and 13 days (New Delhi, India). The overall prediction
accuracy for Austria is 81%, and for India 76% using a Random Forest algorithm.

2/70
Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.1 Aim of this work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2 Contribution of this thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3 Structure of this thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2 Fundamentals of air quality and meteorological data . . . . . . . . . . . . . . 13


2.1 Air pollutants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Meteorological data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1 Air quality index models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Comparative analysis of classification models . . . . . . . . . . . . . . . . 19
3.3 Air quality prediction for industrial areas and smart cities . . . . . . . . 20
3.4 Existing Android apps for monitoring and predicting air quality . . . . . 21

4 Mobile air quality information system . . . . . . . . . . . . . . . . . . . . . . . 24


4.1 Smartphones and their capabilities . . . . . . . . . . . . . . . . . . . . . . 24
4.2 Air Life - Android application . . . . . . . . . . . . . . . . . . . . . . . . . 24

5 Implementation of air quality prediction . . . . . . . . . . . . . . . . . . . . . . 30


5.1 System architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.2 Use of the WEKA library for machine learning . . . . . . . . . . . . . . . 32
5.3 Data collection and analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.4 Training, testing, and validation . . . . . . . . . . . . . . . . . . . . . . . . 39
5.5 Machine learning techniques and tools . . . . . . . . . . . . . . . . . . . . 40

6 Experiments and results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44


6.1 Performance measures of the algorithm . . . . . . . . . . . . . . . . . . . 44
6.2 Experiments and results of the training . . . . . . . . . . . . . . . . . . . 46
6.3 Response time to build a model . . . . . . . . . . . . . . . . . . . . . . . . 53
6.4 Real-time results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
A.1. Raw Data Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
A.2. Air Life Data Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

3/70
List of Figures

4.1 Air Quality Index levels of health concern [1]. . . . . . . . . . . . . . . . 25


4.2 Dashboard of Air Life. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.3 Prediction of Air Life. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.4 Air Life app functions and user customization. . . . . . . . . . . . . . . . 27
4.5 Good air quality. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.6 Hazardous air quality. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5.1 Air Life - System architecture. . . . . . . . . . . . . . . . . . . . . . . . . . 30


5.2 Air quality prediction process using weather forecasts. . . . . . . . . . . 31
5.3 The stages involved in developing the machine learning model. . . . . . 32
5.4 List of countries and their instances in the training data set. . . . . . . . 35
5.5 Number of AQI instances with balanced classes. . . . . . . . . . . . . . . 36
5.6 WEKA - Select attribute filter. . . . . . . . . . . . . . . . . . . . . . . . . . 38

6.1 Comparison of MAE and RMSE using the 10-fold cross-validation method. 47
6.2 Comparison of RAE and RRSE using the 10-fold cross-validation method. 47
6.3 Comparison of MAE and RMSE using the percentage split method. . . . 48
6.4 Comparison of RAE and RRSE using the percentage split method. . . . . 49
6.5 Classifier error using the random forest algorithm. . . . . . . . . . . . . . 50
6.6 The actual and the predicted AQI values are shown for each instance. . . 51
6.7 The actual and the predicted AQI values are shown for each instance. . . 52
6.8 Air quality classes and value range. . . . . . . . . . . . . . . . . . . . . . 55
6.9 Time series of actual vs. predicted air quality for "Linz, Austria". . . . . . 55
6.10 Overall results of a 16-day prediction for "Linz, Austria". . . . . . . . . . 56
6.11 Time series of actual vs. predicted air quality for "New-Delhi, India". . . 57
6.12 Overall results of a 13-day prediction for "New-Delhi, India". . . . . . . . 58
6.13 Confusion matrix (Linz, Austria). . . . . . . . . . . . . . . . . . . . . . . . 59
6.14 Overview confusion matrix for 16-day prediction (Linz, Austria). . . . . 59
6.15 Confusion matrix (New-Delhi, India). . . . . . . . . . . . . . . . . . . . . 60
6.16 Overview confusion matrix for 13-day prediction (New-Delhi, India). . . 60

A.1 A screenshot of the raw data source - "Linz, Austria". . . . . . . . . . . . 64


A.2 A screenshot of the Air Life app’s cached data source. . . . . . . . . . . . 65

4/70
List of Tables

2.1 List of air pollutants, their unit of measurements and thresholds. . . . . 14


2.2 List of weather data and its unit of measurements. . . . . . . . . . . . . . 15

4.1 The minimum requirements of the Android device. . . . . . . . . . . . . 29

5.1 Attributes and their datatype. . . . . . . . . . . . . . . . . . . . . . . . . . 33


5.2 List of API parameters to get air quality information. . . . . . . . . . . . 36
5.3 List of API parameters to get forecast weather information. . . . . . . . . 37
5.4 List of algorithms chosen for generating the models. . . . . . . . . . . . . 41

6.1 Performance measures with 10-fold cross validation training, entire data
set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.2 Performance measures with split percentage training. . . . . . . . . . . . 48
6.3 Training on the Austrian data set and its performance measures. . . . . . 49
6.4 Training on Indian data set and its performance measures. . . . . . . . . 51
6.5 The time taken to build a model with 15680 instances. . . . . . . . . . . . 53
6.6 The time taken to build a model by using the Austrian data set of 1826
instances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.7 The time taken to build a model by using the Indian data set of 1734
instances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.8 Actual vs. Predicted air quality classes of "Linz, Austria". . . . . . . . . 56
6.9 Statistics of actual and predicted (Air Life) AQI. . . . . . . . . . . . . . . 56
6.10 Actual vs. Predicted air quality classes of "New-Delhi, India". . . . . . . 57
6.11 Statistics of predicted (Air Life) and actual AQI. . . . . . . . . . . . . . . 58
6.12 Classification summary of air quality classes (Linz, Austria). . . . . . . . 59
6.13 Classification summary of air quality classes (New-Delhi, India). . . . . 60

5/70
Listings
5.1 ARFF file structure with a sample data instance. . . . . . . . . . . . . . . 34

6.1 Sample input instances of the AQI prediction (Linz, Austria). . . . . . . . 54

6/70
Acronyms

CO carbon monoxide. 10, 13, 14, 18, 21

CO2 carbon dioxide. 22

NH3 ammonia. 20, 21

NO2 nitrogen dioxide. 10, 13–15, 18–21

O3 ozone. 10, 13, 14, 18, 20, 21

PM10 particulate matter. 10, 13, 14, 18, 20, 21

PM2.5 particulate matter. 10, 13, 14, 18, 20–22

SO2 sulfur dioxide. 10, 13–15, 18, 20, 21

aap ambient (outdoor) air pollution. 13

ANN artificial neural networks. 21

API Application Programming Interface. 32, 36

AQI Air Quality Index. 4, 5, 16, 17, 31, 50–52, 55–58

AQMD air quality monitoring device. 22

ARFF Attribute relation file format. 34

CART classification and regression. 19

CPCB central pollution control board. 19, 20

ECMWF European centre on medium-range weather forecast. 17

EEA European Environment Agency. 10

ETEX European tracer experiment. 18

7/70
EURAD European air pollution dispersion model. 18

FMI Finnish Meteorological Institute. 17

GFS global forecast system. 18

GPS Global Positioning System. 24, 31

hap household air pollution. 13

HIRLAM high resolution limited area model. 17

HTTP Hypertext Transfer Protocol. 36

IAEA International Atomic Energy Agency. 18

KNN k-nearest neighbors. 2, 41, 46, 48, 50, 53

KSPCB Karnataka state pollution control board. 20

LSTM long short-term memory. 22

MAE Mean absolute error. 4, 45–52

MLP multilayer perceptron. 21, 22

NCEP national centers for environmental prediction. 18

Pb lead. 10, 13, 14, 21

ppd parts per billion. 14

ppm parts per million. 14

RAE Relative absolute error. 4, 45–49, 51

REST Representational State Transfer. 31, 36

RMSE Root mean squared error. 4, 41, 45–52

RRSE Root relative squared error. 4, 45–51

8/70
SILAM System for integrated modelling of atmospheric composition. 17, 23

SMOTE Synthetic minority oversampling technique. 35

STP standard temperature and pressure. 18

SVM support vector machines. 21

US EPA United States Environmental Protection Agency. 16

WEKA Waikato Environment for Knowledge Analysis. 2, 4, 19, 31–33, 37–43, 48, 50, 62

WHO World Health Organization. 10, 14

WMO World Meteorological Organisation. 18

XGBR extreme gradient boosting regression. 22

9/70
Chapter 1

Introduction

Air quality is heavily dependent on the weather and climate. The change in climate af-
fects air quality by perturbing the ventilation rates (wind speed, mixing depth, convec-
tion, frontal passages), precipitation scavenging, chemical production, dry deposition,
natural emission, and background concentrations [28]. The air quality is measured in
terms of air pollutants and concentration levels of pollutants. The leading causes of
air pollution are the complex blend of particles, vapors, gases emitted from natural,
anthropogenic sources, and photochemical transformation processes [31].

Air pollution causes serious health issues for all human beings. According to the 2018
report of the European Environment Agency (EEA) and Carvalho’s viewpoint[13], more
than 500,000 people died in Europe in 2015 due to air pollution and one-sixth of all
deaths are related to air pollution worldwide. The World Health Organization (WHO)
has also reported that air pollution is a significant environmental risk to health and is
estimated to have caused approximately 4.2 million premature deaths worldwide in
year 2016 [8]. The major air pollutants are carbon monoxide (CO), nitrogen dioxide
(NO2 ), ozone (O3 ), lead (Pb), particulate matter with a diameter of about 10 microm-
eters or smaller (PM10 ), particulate matter with a diameter of about 2.5 micrometer
or smaller (PM2.5 ) and sulfur dioxide (SO2 ). However, pollutants like PM, NO2 , and
Ozone are the primary causes of air pollution deaths, based on the EEA report.

10/70
1.1 Aim of this work
This work aims to predict air quality using weather forecasts and machine learning.
As part of this work, an Android mobile application has been developed to provide
the current air quality and the predicted near feature air quality status. The need for a
mobile information system for people is to raise an immediate warning and awareness
about the poor air quality.

The following research questions will be answered in this thesis:

1. How to predict air quality using weather forecasts and machine learning?

2. How to implement air quality forecasting in real-time using an Android smart-


phone application to raise awareness among the people?

To build an air quality information system that provides the needed functionality, the
following steps are taken in this thesis:

• An appropriate data source for meteorological information and air pollution in-
formation needs to be selected and evaluated.

• A mobile solution will be developed to provide ubiquitous and location-based


air pollution information.

• A predictive approach to air quality information provisioning has been intro-


duced that allows forecasting air quality in the near future.

1.2 Contribution of this thesis


The work makes the following contributions:

1. A data set is collected for training the machine learning models.

2. Machine learning methods are developed and evaluated for air quality predic-
tion.

3. An Android application is developed for mobile air quality information.

4. An overall evaluation of the concept is provided.

11/70
1.3 Structure of this thesis
This thesis is structured as follows: In Chapter 2, the fundamentals of air quality and
weather data are described. In Chapter 3, related works are summarized. Details of
mobile app development and screenshots are shown in Chapter 4. Chapter 5 gives
an overall system structure and its implementation. Experiments and results are pre-
sented in Chapter 6. Finally, in Chapter 7, this thesis is concluded.

12/70
Chapter 2

Fundamentals of air quality and


meteorological data

This chapter gives an overview of the air pollutants and meteorological information,
including units of measurements and thresholds. Let us go one step backward to know
the reason for these air pollutants and their impact.

2.1 Air pollutants


Air pollutants are transported by atmospheric wind. The lifetime of air pollution parti-
cles ranges from a few days to a few weeks, depending on the location and the weather.
The size of the air pollution particles is small, considerably nanometers to microme-
ters, and cannot be seen. For example, air pollution from the east side of China moves
within five to seven days to the western part of North America; air pollution from the
east side of the United States moves to Europe in less than five days [15].

The dominant source of air pollution is greenhouse gases caused by industrial emis-
sions, transportation, depletion of the ozone layer, burning fossil fuels, and waste. Ad-
ditionally, other air pollutants that impact human health, including pollens, molds, and
smoke from wildfires [31]. The most harmful contaminants identified by EPA [23] are
carbon monoxide (CO), nitrogen dioxide (NO2 ), ozone (O3 ), lead (Pb), sulfur dioxide
(SO2 ), particulate matter PM10 and PM2.5 [36].

In general, air pollution is classified into two classes: household air pollution (hap)
and ambient (outdoor) air pollution (aap). Yet, people use coal and firewood for cook-

13/70
ing in rural households like Asia, Africa, and South America. When the air pollution
particle is inhaled, it directly enters the lungs and the bloodstream. So it affects human
health in various ways, causing respiratory problems and heart issues [15]. The fol-
lowing are diseases caused by air pollution: asthma, heart attacks, chronic bronchitis,
emphysema, strokes, and cancer. However, illness also depends on the human’s health
conditions, type of pollutants, and the level of exposure.

Table 2.1 shows the list of pollutants and their unit of measurements, including the
thresholds for health issues [7]. However, these concentrations can also be expressed
as mg/m3 , parts per million (ppm) (or) parts per billion (ppd) by volume through a
conversion factor [16]. WHO air quality guidelines provide these thresholds that have
been identified as values, and exceeding this value may cause severe health issues [46].

Air pollutants Unit of measure- Thresholds


ments
carbon monoxide (CO) µg/m3 10 µg/m3 8 hour mean
nitrogen dioxide (NO2 ) µg/m3 40 µg/m3 1 year
ozone (O3 ) µg/m3 120 µg/m3 8 hour mean
lead (Pb) µg/m3 0.5 µg/m3 1 year
particulate matter (PM10 ) µg/m3 40 µg/m3 1 year
particulate matter (PM2.5 ) µg/m3 20 µg/m3 1 year
sulfur dioxide (SO2 ) µg/m3 125 µg/m3 24 hours mean

Table 2.1: List of air pollutants, their unit of measurements and thresholds.

2.2 Meteorological data


Meteorology is the study of atmospheric science and its phenomena. To be more pre-
cise, it provides information about the weather and environment. The elements are
cloudiness, humidity, precipitation, pressure, sunshine, temperature, and wind. These
seven different weather elements play an essential role in determining air quality pat-
terns in time and space. However, this thesis relies on selected four weather elements:
humidity, pressure, temperature, and wind.

Wind speed could increase or decrease air pollutant concentrations. For instance, a
storm exhibits a high wind flow and the concentration level can change quickly [37].

14/70
When the wind flow is low, that is weak dispersion. Then the concentration level could
be high [19, 22]. The humidity level also impacts the concentration level. High humid-
ity is generally related to a high concentration of particular air pollutants such as CO,
PM, and SO2 . However, a low humidity level is associated with a low concentration of
specific air pollutants such as NO2 and SO2 [22]. Table 2.2 shows the different units of
measurements for the weather data.

Weather data Unit of measurements


Temperature Celsius(°C) / Fahrenheit(°F)
Humidity g/m3
Pressure Millibar
Wind Kmph

Table 2.2: List of weather data and its unit of measurements.

15/70
Chapter 3

Related work

In recent decades air pollution has become a vital issue. There are several investiga-
tions conducted to predict air quality. There are a lot of different approaches and differ-
ent models have been introduced. Some of them have been implemented for real-time
use, others are designed for ongoing improvement. For this thesis, several models and
research works have been referred to and compared. The basic idea of this thesis is
centered on air quality index (AQI) forecast [5] but with a different approach and pre-
dictive model. The following provides an overview and describes relevant approaches
of related work.

3.1 Air quality index models


Air quality forecasting also known as atmospheric dispersion modeling is used to esti-
mate the concentration level of air pollutants. It uses mathematical simulation to find
the air pollutant dispersion level in the atmosphere [11].

3.1.1 Instant-cast and Now-cast


Currently, the "world air quality index" project uses the Instant-Cast approach. Instant-
Cast reports the current pollution instead of the previous hour’s pollution. This ap-
proach converts the raw pollutants values to the air quality index (ranging from 0 –
500) [6]. Later, the air quality project has started to use the system called Now-cast.
United States Environmental Protection Agency (US EPA) has used this system for

16/70
converting raw pollutants into air quality index on average of 12 or 24-hour. The ap-
proach only takes the PM2.5 data as an input for forecasting. The following Now-cast
formula is used to calculate the AQI [6]

( PMObs − PM Min ) x ( AQIMax − AQIMin )


AQI = + AQIMin . (3.1)
PM Max − PM Min

PMObs is an observed 24-hour average concentration in µg/m3. PMMax is the maxi-


mum and PMMin is the minimum concentration level observed in the 24-hour period.
AQIMax is the maximum and AQIMin is the minimum of the air quality index scale.
However, these formulas can be adaptable based on different pollutants and conti-
nents.

3.1.2 SILAM - System for Integrated Modelling of Atmospheric


Composition
SILAM is a global-to-meso-scale dispersion model developed by the Finnish Meteoro-
logical Institute (FMI). This model provides an atmospheric composition, air quality,
emergency decision support applications, and the solution for the inverse dispersion
problem [12]. The inverse dispersion problem of atmospheric pollution appears when
some measurements show the presence of contaminants, which origin is unknown or
uncertain. The SILAM model determines the responsible emission source [43]. SILAM
can utilize data from various sources, for instances meteorological data from the Eu-
ropean centre on medium-range weather forecast (ECMWF) or high resolution limited
area model (HIRLAM) [44] numerical weather prediction models.

Currently, the SILAM model is based on the Lagrangian and Monte Carlo dispersion
models. The Lagrangian model is used for regulatory purposes and a meteorological
pre-processor to obtain required input quantities, output post-processor, and a set of
internal interfaces to handle the observed data [43]. A Monte Carlo model is a model
to predict pollutant concentrations even in low wind conditions. It relies on repeated
random sampling to obtain numerical results and estimates the possible outcomes of
an uncertain event [20].

The worlds air quality index project has used two types of SILAM models for air qual-
ity forecast, one for worldwide and another one for only Asia. For air quality forecast,

17/70
the model takes O3 , PM2.5 , and PM10 as input from the previous historical data and
produces the air quality index for the next 24 to 48 hours. The measurement units are
converted from ppm to mg based on the standard temperature and pressure (STP) US
standards [5].

Sofiev et al.[42] has experimented and compared the SILAM model performance against
the European tracer experiment (ETEX). The ETEX can understand and represent dis-
persion processes on the spatial and temporal scales and treat large amounts of data. It
is mainly used to produce more reliable data to validate atmospheric transport models.
It is sponsored by the European Commission, the World Meteorological Organisation
(WMO), and the International Atomic Energy Agency (IAEA).

3.1.3 EURAD Model - European Air Pollution Dispersion Model


The EURAD model is a real-time air pollutant forecasting system and also one of the
complex model systems. It has mainly focused on providing an atmospheric pollutant
forecast for Europe since June 2001 [29]. The system consists of the mesoscale meteoro-
logical model MM5, an Emission model, and a chemistry transport model. It forecasts
atmospheric pollutants, meteorological (MM5), and aerosol particles. The MM5 is the
5th generation mesoscale model used for creating weather forecasts and climate pro-
jections. The EURAD model has been developed since 2001 and improved over time
with many case studies and analyses.

The EURAD processes a massive set of data every day for the forecast. The forecast
input is obtained from the global forecast system (GFS) and GFS is a national centers
for environmental prediction (NCEP) weather forecast model that generates data for
atmospheric and land-soil variables, including temperature, wind, precipitation, soil
moisture, and atmospheric ozone concentration [25]. The output of the EURAD fore-
casts is the concentration of pollutants for the next 48 hours. The primary forecast air
pollutants are CO, NO2 , O3 , PM10 , and SO2 . The results of the estimates can be found
at http://www.uni-koeln.de/math-nat-fak/geomet/eurad/index_e.html.

The following are some other recent literature’s and works noted to understand the
existing process of air quality prediction and monitoring.

18/70
3.2 Comparative analysis of classification models
Singh et al.[41] created a machine learning model to predict air quality. They are us-
ing WEKA Java-based machine learning tools and supervised machine learning tech-
niques such as C5.0 and Rpart. The work presented by Singh et al. is close to the
approach taken in this thesis.

C4.5 is a well-known decision tree-based algorithm. The advanced and improved ver-
sion is the C5.0 algorithm. It generates several decision trees and combines them to
improve the prediction. It defines a rule to split the data at a node into classes to min-
imize entropy. It supports sampling and cross-validation. Rpart stands for recursive
partitioning, and the algorithm implements classification and regression to generate
the decision tree. It uses a computational metric to determine the best rule that splits
the data, at that node, into purer classes. It considers all of the predictor variables and
selects the best predictor to distinguish the classes. The algorithm offers the Entropy
or Gini Index method to construct the tree and define the rules.

Both Entropy and Gini Index are used to calculate the information gain. A node having
multiple classes is impure, whereas a node having only one class is pure. The Gini
Index measures the frequency of miscalculated classes in the data set. Gini Index has
values inside the interval [0, 0.5], whereas the interval of the Entropy is [0, 1]. Entropy
is more complicated as it uses logarithms for computation, and the calculation of the
Gini Index will be faster as it is consequent [24].

K
Entropy = ∑ − P(Ck ) log2 P(Ck ). (3.2)
K =1

Equation 3.2 states the Entropy equation, where P(Ck ) is the probability of class Ck in
a node [39].

K
Gini = 1 − ∑ P(Ck )2 . (3.3)
K =1

The above formula is used to calculate the Gini Index value, where P(Ck ) is the prob-
ability of class Ck in a node [39].

The data set of 2019 from the Delhi-ITO region’s central pollution control board (CPCB)
has been used to train the classifiers [17]. The data set contains the five pollutants: NO2 ,

19/70
NH3 , PM10 , PM2.5 , and SO2 . The input data are pollutants, and the output data is an
air quality index class, see Figure 4.1. Based on the C5.0 algorithm, PM2.5 has been
chosen as the best attribute for the decision tree split [41]. The accuracy obtained from
the C5.0 algorithm is 99.45%, with a kappa value of 99.27% and with an error rate of
0.55%.

The kappa value is a metric that compares an observed accuracy with an expected
accuracy [45]. The kappa value is interpreted as:

• light agreement if 0.01-0.02

• fair agreement if 0.21-0.40

• moderate agreement if 0.41-0.60

• substantial agreement if 0.61-0.80

• (almost) perfect agreement if 0.81-1.

The second approach, using the Rpart algorithm, which uses the Gini index method,
gives an accuracy of 96.71% with a kappa value of 95.61% and an error rate of 3.29%.
They have concluded that the C5.0 gives the best accuracy compared to other algo-
rithms.

3.3 Air quality prediction for industrial areas and smart


cities
Aruna Kumari et al.[10] have explained monitoring and predicting air quality in ur-
ban and industrial areas. The work covers the following topics Interpolation of data,
estimation, and analysis of air quality. The main data source for their experiments is
Karnataka state pollution control board (KSPCB), India. The air pollutants primarily
emitted by industrial production are NO2 , O3 , PM2.5 , PM10 , and SO2 . So based on
these pollutants, air quality is predicted.

Mahalingam et al.[35] have experimented by creating a machine learning model to pre-


dict air quality in urban areas. The primary data source for their experiments is from
the central pollution control board (CPCB) and the Ministry of Environment, Forest

20/70
and Climate change, India. For prediction, they have relied on two machine learning
algorithms, artificial neural networks (ANN) and support vector machines (SVM). An
artificial neural network (ANN) is a multi-layered artificial neural network that typi-
cally has three layers: input, hidden, and output [27]. The multilayer perceptron (MLP)
is a neural-based algorithm. The input values are defined inside the input layer nodes,
and each node is assigned with weights. It functions like a neuron interconnected and
splits complex data sets into several parts [35]. MLP provides two options to train
the model, either supervised or unsupervised [27]. support vector machines (SVM) is
a supervised machine learning technique and one of the classification algorithms[27].
It supports linear and non-linear kernel methods for classification. Specifically, this
machine learning technique has become a suitable tool for pattern recognition and
classification of data [35].

Mahalingam et al. collected a data set of eight air pollutants (CO, NO2 , NH3 , O3 , Pb,
PM2.5 , PM10 , SO2 ) and used these data as input. With this method, the air quality
index can be predicted. Eight layers of ANN are used in this implementation based on
Python and its libraries. Finally, they have concluded that the accuracy is 91.62% in the
case of ANN and 97.3% in the case of SVM.

3.4 Existing Android apps for monitoring and predicting


air quality
Several mobile applications are being developed to monitor and predict air quality.
There exist different approaches and methodologies. In the following, a selection of
smartphone applications is discussed.

Dutta et al.[21] describe an application with a sensor-based air quality monitoring sys-
tem called "AirSense". The device is named air quality monitoring device (AQMD)
and it has a smaller micro-controller board called Arduino Nano. It has an air quality
sensor MQ135 to detect high toxic gases and a Bluetooth module (HS-05).

The sensor unit sends the data <AQMD_ID, AQI_Reading> to the smartphone via
Bluetooth whenever there is a change in the measurement. Then the app shows the
heat map with air quality index based on the location it captured. In addition, this in-
formation will be stored in the cloud server as the data tuple <AQMD_ID,AQI_reading,

21/70
Phone_ID, Location, Time_Stamp, Status>, which is later shared with the "Airsense"
users who have no sensor unit. However, carrying such a sensor unit is complicated
and challenging to maintain in the long run. "AirSense" may be inadequate to calcu-
late the accurate air quality with a single sensor data. Similar to "Airsense", in this
thesis we will retrieve air quality information from the nearest station. We will evalu-
ate if this is sufficiently reliable and believe that it provides results comparable to active
measurement provided by a professional sensory equipment.

Cheng et al.[14] developed a cloud-based air quality monitoring and the forecast sys-
tem called "AirCloud". This system consists of various components, such as GPS, Blue-
tooth module, and air quality sensors. In the frontend client, the air quality is vi-
sualized in terms of current sensor data and particulate matter PM2.5 . The backend
is an analytical engine to create a model based on sensor data and ANN algorithm.
AirCloud also provides APIs for third-party applications developers. Three real-time
mobile applications are developed based on the AirCloud API service to provide air
quality information. Cheng et al. have manufactured 500 air quality monitoring de-
vice (AQMD) and deployed them across Beijing city. 53.6% accuracy was achieved in
predicting the air quality using PM2.5 pollutants and the ANN algorithm.

Praveen et al.[40] developed a framework for indoor air quality estimation and fore-
casting. The main idea of this work is to place low-cost air quality sensors inside the
classrooms at universities. The sensor provides the data of pollutants CO2 and PM2.5 .
Using the sensed parameters, Praveen et al. have applied multilayer perceptron (MLP)
and extreme gradient boosting regression (XGBR) algorithms for real-time air quality
estimation in other classrooms without sensors. Finally, they have created a machine
learning model with long short-term memory (LSTM), and have achieved an accuracy
of nearly 95% to 96% for forecasting. In addition, the IndoAirSense Android applica-
tion was developed for data acquisition and validation.

Our findings from the related studies are: All the applications mentioned above rely
on Bluetooth and GPS. Considering the recent changes in the Android Bluetooth and
Location API, the Android device version 10 and above are not allowed to access the
location from the background, and the Bluetooth has to have a BLE option [9] to scan
frequently. Due to these privacy changes and power optimization, we assume that
some of the apps from the related studies might have issues/inaccuracies while trans-
ferring the data between mobile and sensor units. However, in the case of the Air Life
Android application, we have already implemented the code that works with the cur-

22/70
rent API, and it accesses the location only in the foreground while accessing the app.

As the result of related studies, Singh et al.[41] work is closest to this thesis, and with
that idea, we have chosen a decision tree-based algorithm for creating a model. Like
the SILAM model, we have decided to predict the air quality index for the 24-hour
cycle and we have integrated the Android API to the latest.

23/70
Chapter 4

Mobile air quality information system

In this chapter, the development of the mobile app will be described from the user’s
perspective. In Chapter 5, the implementation details of the core of the system, the
machine learning and prediction approach will be detailed.

4.1 Smartphones and their capabilities


Smartphone devices and applications enable users to access any services on the go. The
following services are available from smartphones: the internet, maps, travel guides,
navigation systems, GPS, cameras, 5G technologies, and more.

Among the possible platforms, Android was selected for prototype development. Com-
panies such as Google, Apple, and Samsung frequently develop and release new smart-
phones with high-speed processing and more features. The motivation to choose the
Android platform is that it is equipped with high-end processors, large storage capac-
ity, multiple sensors, touch screens with user interface, and various network connec-
tions capabilities [32]. Also, it is more affordable to obtain one for all kinds of users.

4.2 Air Life - Android application


The developed app, AirLife, is an Android prototype application that provides air qual-
ity prediction in real-time. The developed machine learning model is stored inside the

24/70
app assets directory. During the prediction, this model will be accessed from the back-
ground. The application provides an intuitive user interface to display the current air
quality information and forecast. The essential function of this app is as follows,

The "Dashboard" provides precise information about the present air quality based on
the current location. It includes weather information, air pollutants, and options to
navigate to different activities. There are several risk levels shown in the dashboard.

The risk levels are categorized the following six classes Good, Moderate, Unhealthy
for a sensitive group, Unhealthy, Very Unhealthy, and Hazardous (see Figure 4.1).

Figure 4.1: Air Quality Index levels of health concern [1].

The "Dashboard" is shown in Figure 4.2. Next to the risk levels, there is also infor-
mation about the weather, such as temperature, pressure, humidity, and wind. At the
bottom of the dashboard, there is a list of air pollutant information shown as well. All
the information that is displayed on the dashboard is based on the current location.

Several other activities are available to enhance the functionality of the application, like
changing the languages, switching the themes, storing the favorite location to check air
quality information later, and chart to show the past 7 days air quality.

25/70
Prediction is the component where the main functionality takes place. The application
prepares the input data, including the forecasted weather information with general at-
tributes (see Listing 5.1 for the input attributes), and applies a machine-learning algo-
rithm to the model. The application displays the predicted air quality information for
the upcoming three days based on the location and the applied machine learning algo-
rithm. The accuracy of the location is determined by the mobile hardware and GPS. By
default, in Android, it is set to high accuracy, which means it determines the location
using GPS, Wi-Fi, or cellular networks. A sample prediction is shown in Figure 4.3.
Each card view in the prediction screen displays the day with the corresponding date,
the air quality index value with an indication, the location, and the forecasted weather
information.

Figure 4.2: Dashboard of Air Life. Figure 4.3: Prediction of Air Life.

26/70
The following details the app’s functionality. By clicking the screen in the region of
the icons, the following corresponding actions will be invoked (see Figure 4.4): Start-
ing from the left side, the refresh function will fetch the most recent data and reload
the entire screen. The switch language function is used to switch the app’s language
between German and English. The statistics function will open a new screen and vi-
sualize the past 7 days of air quality information. More information allows to display
textual information about the severity of air pollution.

Figure 4.4: Air Life app functions and user customization.

On the right side, the following functions are depicted: The switch theme function
allows the user to switch the app theme between dark and light themes. The add to
favorite function is used to save your current location to the favorite list so that the
user can check the air quality information of that specific location later, and the list
favorite function opens a new screen where it will list all the favorite locations saved

27/70
previously and fetch the current air quality information.

Finally, the prediction function will open a new screen to perform the machine learning
and air quality prediction (see Figure 4.3).

Figures 4.5 and 4.6 depict examples of good and hazardous air quality. The scale and
air quality index values are highlighted with corresponding colors. This way the user
gets a quick overview of the air quality situation.

Figure 4.5: Good air quality. Figure 4.6: Hazardous air quality.

There are few hardware and software requirements to run the Air Life app. The smart-
phone needs to run Android with a sufficiently strong processor and RAM. Location
and Internet permissions need to be granted. Table 4.1 summarizes the concrete mini-
mal requirements of the app.

28/70
Specification Minimum Requirements
Android SDK API 19
Processor Quad-core
Ram 2 GB
Storage 2 GB
Operating system Android KitKat
Permission required Location, Internet

Table 4.1: The minimum requirements of the Android device.

29/70
Chapter 5

Implementation of air quality prediction

This chapter provides an overview of the entire system implementation with a focus
on the functional core elements, i.e. machine learning and prediction. The system ar-
chitecture, data collection, pre-processing, and machine learning techniques to predict
air quality are described.

5.1 System architecture


The system architecture is developed based on a traditional client-server pattern, where
the client is the smartphone, and the weather and air quality data are queried from re-
spective servers. The whole application logic of the air quality prediction runs on a
smartphone app. As shown in Figure 5.1, the system consists of three major compo-
nents and a few other supporting components.

Figure 5.1: Air Life - System architecture.

30/70
The "AQI station" component is the nearest air quality station where the user commu-
nicates via the REST API and gets the latest air quality information based on the loca-
tion. "Weather info" is the component used to retrieve current and forecasted weather
information. The "Smart phone" is one of the core components where the central air
quality prediction will take place using the WEKA machine learning library. Note that
the database(DB) cache is installed on the smartphone, and it is primarily used to cache
the air quality and weather information for prediction and statistics. A screenshot of
the cached data is displayed in Appendix A.2.

The primary function of the application will provide the current air quality informa-
tion with its pollutants and the prediction for the next three days. The major reason for
implementing the logic on the smartphone is to provide GPS-based location informa-
tion. Furthermore, a smartphone provides sufficient computational power to perform
machine learning and prediction. On Android, the model can be easily stored in the
app’s assets directory.

Figure 5.1 shows the system architecture of Air Life. The app implements the mobile
information system (described in Chapter 4) and makes use of the WEKA ML tool
and its models. An HTTP request will be initiated from the smartphone to retrieve
the air and weather information based on the user’s location. The main communi-
cation channel will be the Internet, and the REST API’s methods are used to retrieve
the information accordingly. The air quality prediction process is shown in Figure 5.2.
The input for the machine learning algorithm are the general attributes including fore-
casted weather information (see Table 5.1).

Figure 5.2: Air quality prediction process using weather forecasts.

31/70
The output is the air quality prediction for the next three days, displayed in the An-
droid application. We chose a three-day period because the accuracy of predictions
over three days is poor and inaccurate.

In general, machine learning needs a data set to train, test, validate and develop a
reliable model. Developing a model using machine learning includes various stages,
as shown in Figure 5.3. It begins with data set collection, as detailed in Section 5.3. The
second stage is to split the complete data set for training, and testing. This is explained
in Section 5.4. The preceding stage will train the ML algorithm with the train data set.
The finishing stage is to create a model based on the evaluation and performance of the
best-fitting algorithm. The model created via the WEKA tool is then uploaded to the
assets directory of the mobile app. In the future, it is envisioned that the model can be
created at a server and downloaded from there. This has the advantage that a model
can be leveraged by multiple users.

Figure 5.3: The stages involved in developing the machine learning model.

5.2 Use of the WEKA library for machine learning


For this work, WEKA has been chosen. WEKA is a machine learning tool written
purely in Java. WEKA is an open-source tool developed at the University of Waikato,
New Zealand [3]. It includes several data mining and machine learning techniques,
including regression, classification, association, and clustering. Using the WEKA user
interface, console, or Java API, we can access algorithms for data analysis, training,
modeling, and testing.

Techniques: WEKA provides many techniques to apply in data mining and machine
learning. In addition, it offers various functions like explorer, experimenter, visual-
ization, and knowledge flow. Each function provides unique features and operations.
This work primarily relies on the function called explorer. Using the WEKA explorer
will provide various processes such as pre-processing datasets, classifying, clustering,

32/70
and visualization. The stages of the prediction process (as shown in Figure 5.3) are
implemented by WEKA techniques. For instance, various data pre-processing, filters,
sorting, removing duplicates, applying different algorithms, and cross-validation tech-
niques are used in this work. It will be explained in Section 5.3.

Language: WEKA is written in Java. This makes it easy to run WEKA on different
platforms and computers (e.g., PC and Android smartphone). It is compatible with
Java 7 [30] and later versions. The newest version of WEKA is 3.9 [3], and there is also
an additional package manager to integrate Python. It is worth mentioning that, this
work relies on the Java language for both the Android and WEKA tools.

5.3 Data collection and analysis


This section elaborates on the data collection, analysis phases, and the chosen tech-
niques for predicting air quality using machine learning.

Data set and attributes

The two most influential factors in machine learning are the data sets and the optimum
attributes that help to generate a reliable model. For this thesis, three categories of at-
tributes have been chosen: general, meteorological, and air quality attributes. The first
column of Table 5.1 contains general attributes such as the date, (date) time, country,
city, latitude, and longitude. In addition, the date is dissected into the month (1 to 12),
year, weekday (Monday to Sunday).

General attributes Meteorological attributes Air quality attributes


Country Nominal Temperature Numeric Air quality index Numeric
City Nominal Humidity Numeric Air quality classification Nominal
Latitude Numeric Pressure Numeric
Longitude Numeric Wind Numeric
Date time Date
Month Numeric
Year Numeric
Day Nominal
Date Numeric

Table 5.1: Attributes and their datatype.

33/70
The second column consists of meteorological attributes such as temperature, humid-
ity, pressure, and wind. The last column consists of air quality attributes such as the air
quality index and the air quality classification. The risk levels are categorize into the
following six classes: good, moderate, unhealthy for a sensitive group, unhealthy, very
unhealthy, and hazardous (see Figure 4.1). The training data set is made up of all of
these attributes together to generate a reliable model. For this thesis work, over 15,680
data instances have been collected for ten different countries from 2018-01 to 2020-07.

The data set is in the format called ARFF, shown in Listing 5.1. Experiments were
conducted over the entire data set and a subset of it.
1 @RELATION ’ A i r L i f e ’
2 @ATTRIBUTE l a t i t u d e numeric
3 @ATTRIBUTE l o n g i t u d e numeric
4 @ATTRIBUTE humidity numeric
5 @ATTRIBUTE temperature numeric
6 @ATTRIBUTE wind numeric
7 @ATTRIBUTE c i t y { Budapest , Chengdu , Dhaka , Dubai , Kansk , Linz , london −bexley ,
8 Munich , New−Delhi , P a r i s }
9 @ATTRIBUTE country { Austria , Bangladesh , China , France , Germany , Hungary ,
10 India , Russia , UAE,UK}
11 @ATTRIBUTE p r e s s u r e numeric
12 @ATTRIBUTE month numeric
13 @ATTRIBUTE r e c o r d e d d a t e date yyyy−MM−dd
14 @ATTRIBUTE year numeric
15 @ATTRIBUTE day {MONDAY, TUESDAY,WEDNESDAY,THURSDAY, FRIDAY ,SATURDAY,SUNDAY}
16 @ATTRIBUTE date numeric
17 @ATTRIBUTE a q i t y p e { good , moderate , u n h e a l t h y s e n s i t i v e , unhealthy ,
18 veryunhealthy , hazardous }
19 @ATTRIBUTE a q i numeric
20 @DATA
21 4 8 . 2 9 8 2 2 , 1 4 . 2 7 9 6 7 , 7 7 , 5 , 8 , Linz , Austria , 1 0 1 8 , 3 , 2 0 2 1 − 0 3 − 1 9 , 2 0 2 1 ,FRIDAY , 1 9 ,
22 good , 2 5
23 4 8 . 2 9 8 2 2 , 1 4 . 2 7 9 6 7 , 6 4 , 1 , 1 4 , Linz , Austria , 1 0 2 5 , 3 , 2 0 2 1 − 0 3 − 2 0 , 2 0 2 1 ,SATURDAY, 2 0 ,
24 good , 1 2

Listing 5.1: ARFF file structure with a sample data instance.

Specifically, Austrian and Indian data sets have been selected for experiments and pre-
diction. Data has been gathered from two sources for different attributes. The air
quality information has been collected from the world air quality index [1], and the

34/70
forecasted weather information has been collected from world weather data informa-
tion [4]. A screenshot of the raw data source is displayed in Appendix A.1.

Figure 5.4 shows the number of countries and their instances of the training data set.
From 2018-01 to 2020-01, 1826 instances of Austria and 1734 instances of India have
been collected. It should be noted that this data set was obtained prior to applying
filters and removing biased instances. Further, due to the availability of data for the
respective country, the number of instances vary.

Figure 5.4: List of countries and their instances in the training data set.

A Class balancer is applied to re-weight the instances in the data set so that each class
has the same total weight. A synthetic minority oversampling technique (SMOTE)
filter is applied to generate random data from the existing data set using oversampling
techniques. Both filters are explained in further sections.

Figure 5.5 shows the total number of instances as well as a comparison of actual and
re-weighted classes. For example, consider "Good" classes, which have 6154 instances;
after applying a class balancer and SMOTE filters, the result is 2613 instances of "Good"
classes.

35/70
Figure 5.5: Number of AQI instances with balanced classes.

Both the air quality index and world weather online data have been gathered through
REST API methods. The smartphone accesses the data by handing the current location
(latitude, longitude) to the REST API service and getting the data sources with the
corresponding responses. The following describes the API methods used to get the
data:

1. An HTTP GET method is invoked for getting the air quality information. The
parameters are listed in Table 5.2.

2. Based on the location, the forecasted weather information is obtained using HTTP
GET. The parameters are listed in Table 5.3.

BASE_URL https://api.waqi.info/feed/
geo latitude and longitude
token API access token

Table 5.2: List of API parameters to get air quality information.

36/70
BASE_URL https://api.worldweatheronline.com/premium/v1/weather.ashx?
key Premium API key
q Query to pass inputs like latitude, longitude, city name, country, and code
format JSON and XML
num_of_days Number of days to retrieve forecast weather
tp Time interval (tp=24 / tp=12)

Table 5.3: List of API parameters to get forecast weather information.

Influencing attributes and the WEKA filters

When it comes to attribute selection, it inevitably leads to a complex analysis of which


attributes correlate with each other. Fortunately, WEKA has filters and options to se-
lect the best-correlated attributes for training and to create a reliable machine learning
model. Essentially, those filters are categorized into two types: supervised and unsu-
pervised, filtering both instances and attributes from the given data set. Several filters
are also used for data pre-processing like interchanging the types, cleaning up, sort-
ing, and removing duplicates [47, Chapter 11.3]. For this work, several filters have
been used, they are: interquartile, class balancer, remove duplicates, and SMOTE.

Interquartile: This is a filter for detecting outliers and extreme values based on in-
terquartile ranges. The filter skips the class attribute. It is one of the unsupervised
attribute filters provided by WEKA [47]. This filter has been applied to the entire data
set to filter the outliers and extreme values.

Class balancer: The class balancer reweights the data instances so that all classes have
the same total weight. This filter has an extra option called discretization, which sets
the number of discretization intervals to use when the class is numeric [47]. This filter
has been applied to the data set to balance the biased classes.

Remove values/duplicates/range: This filter is used to remove values and duplicate


instances from the data set. In addition, this filter has several properties to select a
range of attributes and instances to remove [47]. After analyzing the outliers and ex-
treme values, this filter was applied to remove those resultant values and duplicates.

SMOTE: This is one of the supervised instance filters available in WEKA [47]. This
filter works by selecting examples in the feature space that are close together, drawing

37/70
a line between them, and drawing a new sample at a point along that line. This filter
generates a random data set from the existing data using the oversampling technique.

Select attribute / Attribute correlation analysis: Initially, many attributes were added
to the data set, such as all the pollutants shown in Table 2.1, timestamp, and postal
code, etc. which makes the model inconsistent and affects the performance. Yet, WEKA
provides a supervised attribute filter that can be used to determine the attributes. A
screenshot of the WEKA explorer is shown in Figure 5.6. It shows the option to se-
lect the attribute evaluation and the search method. In addition, it shows the list of
attributes and visualizations.

Figure 5.6: WEKA - Select attribute filter.

WEKA filters are very flexible and allow various search and evaluation methods to be
combined [48]. Two options, are provided one is to select an evaluator, and the other
is to choose a search method. The evaluator determines how the attribute subsets are
evaluated. The search option is used to determine the search method. For instance,
several search methods are available such as best first, ranker, and greedy step wise.

Based on the performance of the attribute evaluator, the following filters are selected:
GainRatioAttributeEval and the Ranker search method [47]. These filters provide sev-
eral functions and algorithms to find the best-correlated attributes on the given data
set.

38/70
The gain ratio attribute evaluator is chosen as an evaluator. This filter evaluates the
importance of an attribute by measuring the gain ratio concerning the class. The gain
ratio attribute evaluator provides extra options such as specifying how to distribute
counts for missing values and enabling capabilities checks. Equation 5.1 shows how
the gain ratio is calculated for an attribute A by measuring the information gain con-
cerning class C and H represents the entropy [26].

GainR(C, A) = ( H (C ) − H (C | A))/H ( A). (5.1)

The Ranker search method is chosen as a search method. This method ranks all the
attributes based on the evaluation results. This method treats the missing values as a
separate value for the attribute. The Ranker search method provides extra options such
as generating constant ranking, specifying the number of attributes to retain, specify-
ing the start set, and threshold size.

One more attribute evaluator has been considered for this work. It’s called Correla-
tion Attribute Evaluator. The correlation attribute evaluator checks the worth of the
attribute by measuring the correlation between the attributes and the class [2]. In this
evaluator, nominal attributes are considered on a value by value basis, and the over-
all correlation is a weighted average. However, GainRatio gives a better evaluation
than this correlation attribute evaluator. The final attributes are the ones mentioned in
Listing 5.1.

5.4 Training, testing, and validation


This section describes the training approaches. The results of the training are described
in Chapter 6. Several experiments have been conducted to create a model based on
various training, testing, and validation options for this work. The training options are
as follows: cross-validation, percentage split, and supply test data set.

Cross-validation: This is one of the standard evaluation techniques and is also avail-
able in WEKA, also known as n-fold validation. The technique is a systematic approach
to run a repeated percentage split. Based on the n-fold count, it divides the data set.
For example, consider that the n-fold is ten, then the data set will be divided into ten

39/70
parts. For each fold, it holds a part of the data set and averages the results. Each data
point is used once for testing, nine times for training [18].

Percentage split: This approach is used to randomly split the data set into training
and testing during the evaluation of a model. This technique is one way of doing
repeated training and testing of the classifier. It enables the evaluation to be based
on the percentage split. For example, consider that the split percentage is 80%. Then
the data set will be divided into two parts, 80% for training and 20% for testing. In
WEKA, re-running the split will produce the same results. To prevent this, WEKA has
an option to set the random seed generator [18].

Supply test data set: This approach is based on a user-specified data set. It means
the test will be conducted over the user-specified data with the chosen algorithm and
model [18]. This is a very limited option that is primarily used to test the data against
the provided training data set. There are no options for conducting a comprehensive
test.

Cross-validation and percentage split training methods have been experimented with,
and the detailed results are presented in the next chapter.

5.5 Machine learning techniques and tools


Technologies are intended to evolve continuously over time, making significant changes
across the world. With this, machine learning methods have been developed for over
60 years and have accomplished massive success in various domains [33, 49] such as
data science, medical, production, manufacturing, defences, and research works. In
environmental research, machine learning may be used to better understand air qual-
ity and predict near future air quality.

In general, there are two major approaches to predict air quality: either the air quality
index (numeric) or the air quality class (nominal). Predicting the air quality index is a
regression problem, and predicting the air quality class is a classification problem. In
this thesis, the regression approach is selected to predict the air quality index. Later,
the air quality class is mapped to air quality index intervals (scales).

Several machine learning algorithms are chosen for this thesis and compared to derive
the best fitting model (see Table 5.4). The main reason for choosing this set of algo-

40/70
rithms is based on the related studies, that mentioned those algorithms as promising
approaches, data set performance, and results.

Algorithms WEKA Classifier


ZeroR Rules
RepTree Trees
Random Forest Trees
KNN Lazy

Table 5.4: List of algorithms chosen for generating the models.

ZeroR a simple WEKA classifier algorithm to get the baseline for the prediction. It
uses the Zero-R algorithm to predict the mean (for a numeric class) or the mode (for
a nominal class). This algorithm is also known as a zero-rules algorithm in WEKA.
It is mainly used as a basic classifier to measure the performance of other classifiers.
The accuracy rate of all other classifiers must be higher than that the ZeroR classifier.
For example, the baseline accuracy conducted over the data set of Austria resulted
an RMSE of about 39.76%. For example, a Random Forest applied to the same data
resulted in an RMSE of about 6.77%. ZeroR is described in detail in [47, Chapter 11.4].

RepTree a decision tree-based algorithm available in WEKA [47, Chapter 11.4]. In


principle, it works by the divide and conquer method. Based on the given data set
and information gain, it builds the regression/decision tree. It is also known as a fast
decision tree learner algorithm, that uses reduced error pruning. There are several
parameters available to improve the algorithm’s performance. The applied parameters
are as follows: minimum weight of the instance in the leaf node, maximum depth of
the tree, enable/disable pruning, and the number of folds that determine the amount
of data used for pruning.

Among the most widely used classifiers are decision trees. Decision trees are simple
to understand and configure [38, 48, 50]. Over-fitting is one of the practical problems
when the algorithm tries to create a deeper decision tree. It mainly happens because
of the irregularities in the data set. To overcome such a problem, pre-pruning and
post-pruning are used [18, 34, 38]. Pre-pruning is an approach to stop the decision
tree’s construction a bit earlier. For example, it is preferred not to split the node if
its goodness measure is below the threshold value. In comparison, post-pruning is
an approach that builds the complete tree as much as possible until an over-fitting

41/70
problem stops it [18]. The following equations are used to calculate the information
gain of the decision tree.

k
H (Y ) = ∑ P(Y = yi ) log P(Y = yi ). (5.2)
i =1

l
H (Y | X ) = − ∑ P( X = xi ) log P(Y | X = xi ). (5.3)
i =1

IG (Y; X ) = H (Y ) − H (Y | X ). (5.4)

Let Y and X be the discrete variables that have the values y1 , y2 , . . . , yn and x1 , x2 , . . . , xn .
In this case, entropy and conditional entropy of Y is calculated as shown in Equation
(5.2) and Equation (5.3). After that, the information gain of X is calculated as shown in
Equation (5.4).

Random forest is an ensemble machine learning algorithm [47, Chapter 11.4]. This
means it supports both classification and regression problems. In theory, it consists of
many decision trees and outputs the classes by individual trees. This is also a deci-
sion tree-based algorithm available in WEKA classification. This algorithm has many
advantages. For example, it is easier to handle missing values, avoids over-fitting prob-
lems by having multiple trees, and runs efficiently in larger data sets. There are several
parameters available to improve the algorithm’s performance, the applied parameters
are as follows: prediction batch size, maximum depth of the tree, number of randomly
chosen attributes, number of iterations, and random number seed [18, 34, 38].

This algorithm supports both classification and regression. It operates by storing the
whole training data set and querying it to locate the k most similar training patterns
when making a prediction or finding the closest neighbor. In principle, it uses the
nearest neighbor search algorithm and distance function. There are different types of
search algorithms available in WEKA to find the k closest neighbors. The most impor-
tant one is the distance function, including Chebyshev (see Equation 5.5), Euclidean
(see Equation 5.6), Manhattan (see Equation 5.7) and Minkowski (see Equation 5.8).

42/70
Having p and q as data points, d( p, q) is the distance between the two points [38, 48,
50].

dChebyshev ( p, q) = maxi (| pi − qi |). (5.5)

s
n
d Euclidean ( p, q) = ∑ ( p i − q i )2 . (5.6)
i =1

n
d Manhattan ( p, q) = ∑ | p i − q i |2 . (5.7)
i =1

n
d Minkowski ( p, q) = ( ∑ | pi − qi |m )1/m . (5.8)
i =1

There are several parameters available to improve the performance of the algorithm.
However, few of the parameters have been used in this work. First, the k value de-
fines the number of neighbours to use during the prediction, and others like choosing
the search algorithm and the distance function. One more essential parameter is dis-
tance weighting. The sum of these distances is calculated for each class value, and a
weighted polling value is obtained. The class value with the highest weighted voting
value is considered the class of the new observation. In WEKA, there are two dif-
ferent options available for distance weighting as shown in Equation (5.9) and Equa-
tion (5.10).

N
1
Weightinverse = ∑ ( p(d( p, q)2 )/n ). (5.9)
i =1

N q
Weightsimilarity = ∑ (1 − (d( p, q)2 )/n). (5.10)
i =1

N represents the training data set from this equation, and n represents the number of
attributes in the data set. p and q as data points, d( p, q) is the distance between the two
points.

43/70
Chapter 6

Experiments and results

This chapter presents the performance measures and a comparison of the various algo-
rithms. It also exhibits the real-time predicted results. Experimentation is carried out
using a data set containing 15680 instances.

6.1 Performance measures of the algorithm


The measures compare the following two values: pi and ai , where pi represents the
probability of the prediction and ai represents the actual value. n is the total number
of observations in the data set.

Correlation coefficient: The correlation coefficient is used to measure the statistical


correlation between the actual and predicted values. A positive correlation coefficient
shows that if one value increases, also the other increases. A negative correlation coeffi-
cient refers to the opposite situation, i.e., the increase of one class of values corresponds
to the decrease of the other. In case this value is zero, there is no correlation between
the two value classes.

In this work, the correlation coefficient r is calculated as follows:

S
r = √ PA , where (6.1)
SP S A

∑i ( pi − p̄)
SPA = , (6.2)
( n − 1)

44/70
∑i ( pi − p̄)2
SP = , and (6.3)
( n − 1)

∑i ( ai − ā)2
SA = . (6.4)
( n − 1)

In these equations, p̄ is the mean of the predicted values and ā is the mean of the actual
values.

Mean absolute error (MAE): This value expresses how close the predictions are to the
actual values and provides an average of absolute errors. According to the magnitude,
all errors are treated in the same way [48, Chapter 5.9]. This error is calculated as

| p1 − a1 | + ... + | pn − an |
MAE = . (6.5)
n

Root mean squared error (RMSE): The most widely used measure represents the stan-
dard deviation of the errors that occur when a prediction is made on a data set. This
measure is similar to MAE, but the root of the value is considered while determining
the accuracy of the model. It is calculated as
r
( p1 − a1 )2 + ... + ( pn − an )2
RMSE = . (6.6)
n

Relative absolute error (RAE): The errors are normalized by the simple predictor that
predicts average values and is accounted as a total absolute error [48, Chapter 5.9]. As
shown in Equation 6.7, this value is calculated as

| p1 − a1 | + ... + | pn − an |
RAE = . (6.7)
| a1 − ā| + ... + | an − ā|

Root relative squared error (RRSE): This measure uses the simple predictor and av-
erages the actual values from the training data, denoted by ā. The main property of
RRSE is, that it relates the squared errors of the predictor to the squared errors of a

45/70
simple predictor (predicting the mean ā), it is calculated as
s
( p1 − a1 )2 + ... + ( pn − an )2
RRSE = . (6.8)
( a1 − ā)2 + ... + ( an − ā)2

6.2 Experiments and results of the training


In the following, the experimental approaches and corresponding results are described.
The first approach is to train the model and apply 10-fold cross-validation to the entire
data set. The second approach is to train the model and use the percentage split option.
The last approach trains the model on a subset of the entire data set, i.e, Austrian and
Indian data using 10-fold cross validation.

• The first training approach is performed on the entire data set using 10-fold cross-
validation. We have collected 15680 instances and 15 attributes for training the
model. Listing 5.1 shows the sample set of instances, and Figure 5.4 illustrates
the instance count based on countries.

The training results of three different machine learning algorithms are shown in
Table 6.1. 10-fold cross-validation takes the whole training data set. It uses a
random sample of 90% for training and 10% for evaluation. Based on these re-
sults, the random forest algorithm has the highest positive correlation coefficient
of 0.99, compared to the other two algorithms.

Measures RepTree Random Forest KNN


Correlation coefficient 0.92 0.99 0.98
MAE 7.45 4.46 1.23
RMSE 9.64 7.25 4.95
RAE 37.65% 5.18% 6.25%
RRSE 37.60% 7.07% 19.33%

Table 6.1: Performance measures with 10-fold cross validation training, entire data set.

The results of KNN are closest to those of the random forest, however the RRSE
of 19.33% is an order of magnitude higher than the one of the random forest
(7.07%).

46/70
Figure 6.1 visualizes the MAE and RMSE error rates and Figure 6.2 visualizes the
RAE and RRSE error rates comparing the three machine learning algorithms.

Figure 6.1: Comparison of MAE and RMSE using the 10-fold cross-validation method.

Figure 6.2: Comparison of RAE and RRSE using the 10-fold cross-validation method.

47/70
• The second training approach is performed using a percentage split on the entire
data set. Percentage split is one of the common well-known training options and
also available in WEKA. The data set is divided into two parts based on the split
percentage, i.e., one part for training and the second part for testing. For this
approach, the split percentage is set to 80%. Accordingly, 80% of the data is used
for training, and 20% is used for testing.

The training results of three different machine learning algorithms by applying


the percentage split method are shown in Table 6.2. Based on these results, the
correlation coefficient of the random forest algorithm has a higher positive corre-
lation and fewer errors than RepTree and KNN. Despite this, the RAE and RRSE
error rates are higher than those measured with the 10-fold cross-validation method.

Measures RepTree Random Forest KNN


Correlation coefficient 0.92 0.97 0.95
MAE 8.14 4.14 3.143
RMSE 10.11 6.14 7.93
RAE 39.31% 20.02% 15.18%
RRSE 37.57% 22.81% 29.49%

Table 6.2: Performance measures with split percentage training.

Figure 6.3: Comparison of MAE and RMSE using the percentage split method.

48/70
Figure 6.3 visualizes the MAE and RMSE error rates and Figure 6.4 visualizes
the RAE and RRSE error rates from the second approach using a percentage split
with three machine learning algorithms.

Figure 6.4: Comparison of RAE and RRSE using the percentage split method.

• The third training approach is performed on a specific data set using 10-fold
cross-validation again with three machine learning algorithms. For this approach,
Austrian and an Indian data are selected to train the model. We trained the model
using the Austrian data set with 1826 instances from 2018-01 to 2020-01.

Table 6.3 shows the corresponding correlation coefficient results and error mea-
sures.

Measures RepTree Random Forest KNN


Correlation coefficient 0.95 0.98 0.97
MAE 9.30 3.64 2.22
RMSE 12.00 6.77 8.12
RAE 26.81% 10.50% 6.40%
RRSE 30.20% 17.03% 20.43%

Table 6.3: Training on the Austrian data set and its performance measures.

49/70
The random forest results have a correlation of about 0.98 and the MAE is 3.64.
In comparison, KNN has a correlation of about 0.97 and the MAE is 2.22. Despite
this, RMSE and RRSE of RepTree and KNN are higher than those of the random
forest.

Figure 6.5 visualizes the classifier errors that occurred during the training of the
model using Random Forest algorithm with 10-fold cross validation. Figure 6.5
is a screenshot captured from the WEKA machine learning tool.

Figure 6.5: Classifier error using the random forest algorithm.

The X-axis represents the actual AQI and the Y-axis represents the predicted AQI.
Clicking on the data point will show the detailed information. The selected in-
stance illustrates the classification of an Austrian data set. The result shows that
the predicted AQI is 81.37, where as the actual AQI is 83, with a little difference.
However, it falls into the same air quality category as "Moderate". In general, a
linear correlation between the actual and predicted values can be observed.

Figure 6.6 visualizes the results of sample prediction on the Austrian data set
using the random forest algorithm. The applied training option is 10-fold cross-
validation. We have selected 50 sample instances to see the results of the corre-
lation between the predicted and actual AQI. For example, the highlighted 31st
data point indicates that the predicted AQI value is 73 and the actual AQI value

50/70
is 92, where the difference is 19. Yet, the AQI class falls under the same category
as "Moderate" (see Figure 4.1).

Figure 6.6: The actual and the predicted AQI values are shown for each instance.

The following training was conducted on the Indian data set with 1734 instances
from 2018-01 to 2020-01. The same 10-fold cross-validation technique is applied
during this approach. The corresponding correlation coefficient results and error
measures are shown in Table 6.4.

Measures RepTree Random Forest KNN


Correlation coefficient 0.93 0.97 0.95
MAE 9.92 5.6275 1.80
RMSE 18.51 8.68 9.53
RAE 9.48% 5.37% 1.72%
RRSE 13.80% 7.22% 6.35%

Table 6.4: Training on Indian data set and its performance measures.

Figure 6.7 visualizes the results of sample prediction on the Indian data set us-
ing the random forest algorithm. The applied training option is 10-fold cross-
validation. We have selected 50 sample instances to see the results of the corre-
lation between the predicted and actual AQI. For example, the highlighted 4th

51/70
data point indicates that the predicted AQI value is 126 and the actual AQI value
is 144, where the difference is 18. Yet, the AQI class falls under the same category
as "Unhealthy for sensitive groups" (see Figure 4.1).

Figure 6.7: The actual and the predicted AQI values are shown for each instance.

All the experiments with the given data set and results were successfully recorded.
A summary of the three approaches and their findings is as follows: The first two
approaches train on the entire data set with three different algorithms and two training
options. Significantly, the combination of random forest with 10-fold cross-validation
has proven that the correlations are better-fitting and have fewer errors. That leads us
to use the same combination for the third approach. To be precise, the reason for this
approach is that the Austrian and Indian AQI values are relatively different because
of the varying environments. The third approach was conducted on the particular
data using a random forest algorithm with a 10-fold cross-validation, and the results
showed an average difference compared to the previous two approaches.

The primary distinction occurred during training on the specific data set. For exam-
ple, while training on only Austrian data, the error rate of MAE results was 3.64 and
RMSE results were 6.77, which is less than the previous two approaches. The possible
causes of this difference are the decision split condition and the 10-fold cross valida-
tion. The decision tree splits based on the highest correlated attributes and its data,
such as location, humidity, temperature, wind, and pressure.

52/70
6.3 Response time to build a model
Response time is the total amount of time it takes to respond to a request. Here, it is
measured as the time taken to generate the machine learning model. Tables 6.5 to 6.7
show the response time to generate the model based on the algorithms and their in-
puts. KNN is not included because it does not require the prediction model, as stated
in Section 5.5.

Algorithms Instances Approach Response time (seconds)


RepTree 15680 10-fold cross-validation 0.11
Random Forest 15680 10-fold cross-validation 3.79

Table 6.5: The time taken to build a model with 15680 instances.

Algorithms Instances Approach Response time (seconds)


RepTree 1826 10-fold cross-validation 0.04
Random Forest 1826 10-fold cross-validation 0.46

Table 6.6: The time taken to build a model by using the Austrian data set of 1826 in-
stances.

Algorithms Instances Approach Response time (seconds)


RepTree 1734 10-Fold cross validation 0.01
Random Forest 1734 10-Fold cross validation 0.36

Table 6.7: The time taken to build a model by using the Indian data set of 1734 in-
stances.

The response time to build a machine learning model depends on the size of the data
instances and the training techniques. With an increase in the data set and instances,
the performance and response time may change.

53/70
6.4 Real-time results
The following gives a quick summary of air quality predictions in real-time using the
Android smartphone app Air Life.

To get real-time results, experiments are conducted by selecting an Austrian and an


Indian data set. Training is based on actual air quality and weather data sets. The
real-time test are then based on weather forecasts and general data (see Table 5.1).

The "Air Life" Android application will prepare the input data, including weather fore-
casts and general attributes. The output is the prediction of the air quality index for the
next three days on a 24-hour cycle. For instance, the application prepares the weather
forecasting data for the next three days and passes it to the model.
1 @RELATION ’ A i r L i f e ’
2 @ATTRIBUTE l a t i t u d e numeric
3 @ATTRIBUTE l o n g i t u d e numeric
4 @ATTRIBUTE humidity numeric
5 @ATTRIBUTE temperature numeric
6 @ATTRIBUTE wind numeric
7 @ATTRIBUTE c i t y { Budapest , Chengdu , Dhaka , Dubai , Kansk , Linz , london −bexley ,
8 Munich , New−Delhi , P a r i s }
9 @ATTRIBUTE country { Austria , Bangladesh , China , France , Germany , Hungary ,
10 India , Russia , UAE,UK}
11 @ATTRIBUTE p r e s s u r e numeric
12 @ATTRIBUTE month numeric
13 @ATTRIBUTE r e c o r d e d d a t e date yyyy−MM−dd
14 @ATTRIBUTE year numeric
15 @ATTRIBUTE day {MONDAY, TUESDAY,WEDNESDAY,THURSDAY, FRIDAY ,SATURDAY,SUNDAY}
16 @ATTRIBUTE date numeric
17 @ATTRIBUTE a q i t y p e { good , moderate , u n h e a l t h y s e n s i t i v e , unhealthy ,
18 veryunhealthy , hazardous }
19 @ATTRIBUTE a q i numeric
20 @DATA
21 4 8 . 2 9 8 4 4 , 1 4 . 2 7 9 6 7 , 7 6 , 1 9 , 1 8 , Linz , Austria , 1 0 1 8 , 0 4 , 2 0 2 1 − 0 4 − 0 6 , 2 0 2 1 ,
22 TUESDAY, 0 6 , ? , ?
23 4 8 . 2 9 8 4 4 , 1 4 . 2 7 9 6 7 , 8 3 , 1 1 , 7 , Linz , Austria , 1 0 1 6 , 0 4 , 2 0 2 1 − 0 4 − 0 7 , 2 0 2 1 ,
24 WEDNESDAY, 0 7 , ? , ?
25 4 8 . 2 9 8 4 4 , 1 4 . 2 7 9 6 7 , 7 6 , 6 , 2 6 , Linz , Austria , 1 0 1 9 , 0 4 , 2 0 2 1 − 0 4 − 0 8 , 2 0 2 1 ,
26 THURSDAY, 0 8 , ? , ?

Listing 6.1: Sample input instances of the AQI prediction (Linz, Austria).

54/70
Listing 6.1 shows the sample input instances that are passed to the model for predic-
tion. It includes all the 15 attributes, instances for the next three days. The question
mark symbol refers to a value that should be predicted.

Figure 6.8 shows the air quality classes and their index value boundaries. We have
numbered the classes from 1 to 6 for ease of understanding.

Figure 6.8: Air quality classes and value range.


Real-time experiments in Linz, Austria - The resulting time series of actual vs. pre-
dicted values is depicted in Figure 6.9.

Figure 6.9: Time series of actual vs. predicted air quality for "Linz, Austria".
Table 6.8 summarizes the corresponding AQI classes. A first observation is that the
predicted values match the actual ones well. In particular trends of the actual AQI

55/70
(increase/decrease) correspond to the trend of the predicted AQI value. The result
is even more encouraging when looking looking at the resulting classification of the
AQI value. Table 6.8 shows that only 3 out of 16 days, there was a mismatch (class 2
predicted but actual class was class 1).

19-03-2021
20-03-2021
21-03-2021
22-03-2021
23-03-2021
24-03-2021
25-03-2021
26-03-2021
27-03-2021
28-03-2021
29-03-2021
30-03-2021
01-04-2021
04-04-2021
06-04-2021
07-04-2021
Dates

Actual AQI Class 2 2 2 2 2 2 1 2 2 1 2 1 2 1 1 1


Predicted AQI Class 2 2 2 2 2 2 1 2 2 1 2 2 2 2 1 2

Table 6.8: Actual vs. Predicted air quality classes of "Linz, Austria".
Table 6.9 contains the consolidated statistics of predicted and actual air quality indexes
for the location "Linz, Austria".

Mean Std CV Min Max


Actual AQI 50 12.30 24.58 26 68
Predicted AQI 57 7.6 13.54 43 70

Table 6.9: Statistics of actual and predicted (Air Life) AQI.


The prediction is observed for 16 days with a 24-hour cycle, which is visualized in
Figure 6.10.

Figure 6.10: Overall results of a 16-day prediction for "Linz, Austria".

56/70
The overall results show a mean value of 50 for actual AQI and 57 for predicted AQI.
The overall distance is 7. The reason for this difference is based on the highly correlated
attributes such as location, humidity, temperature, wind and pressure. In addition, it
depends on the decision tree split. The applied algorithm is random forest 10-fold
cross validation.
Real-time experiments in New-Delhi, India - The resulting time series of actual vs.
predicted values in New Delhi is depicted in Figure 6.11.

Figure 6.11: Time series of actual vs. predicted air quality for "New-Delhi, India".

Table 6.10 summarizes the corresponding AQI classes. Also the experiments with the
New Delhi data set shows promising results, the predicted values match the actual
ones well. Table 6.10 shows that also in this data set, only 3 out of 13 days, there was a
mismatch with respect to classification (class 3 predicted but actual class was class 4).
20-03-2021
21-03-2021
22-03-2021
23-03-2021
24-03-2021
25-03-2021
27-03-2021
28-03-2021
29-03-2021
30-03-2021
02-04-2021
05-04-2021
06-04-2021
Dates

Actual AQI Class 4 4 4 4 4 3 3 3 4 4 3 3 4


Predicted AQI Class 4 4 4 4 3 3 3 3 3 4 3 3 3

Table 6.10: Actual vs. Predicted air quality classes of "New-Delhi, India".

57/70
Table 6.11 contains the consolidated statistics of predicted and actual air quality in-
dexes for the location "New-Delhi, India".

Mean Std CV Min Max


Actual AQI 153 22.51 14.75 111 190
Predicted AQI 146 5.57 3.80 138 155

Table 6.11: Statistics of predicted (Air Life) and actual AQI.

The prediction is observed for 13 days with a 24-hour cycle, which is visualized in
Figure 6.12.

Figure 6.12: Overall results of a 13-day prediction for "New-Delhi, India".

The overall results show a mean value of 153 for actual AQI and 146 for predicted AQI.
The overall distance is 7.

Based on the AQI values, the air quality classes (see Figure 6.8) can be derived. Com-
pared to the Austrian data, the Indian data comprises higher AQI values.

An overview confusion matrix is used to further analyze the classification outcome.


The predicted classes are represented in the columns of the matrix, whereas the actual
classes are in the rows of the matrix.

58/70
Figure 6.13: Confusion matrix (Linz, Austria).

The overview confusion matrix in Figure 6.14 and summary in Table 6.12 shows that,
Linz, Austria’s overall prediction accuracy is 81%. Based on the observed 16-day pre-
diction, a True Positive(TP) value is 10 and a True Negative(TN) value is 3.

Figure 6.14: Overview confusion matrix for 16-day prediction (Linz, Austria).

Accuracy Precision Recall F1-Score Total Classified


81% 76% 1.0 86% 16

Table 6.12: Classification summary of air quality classes (Linz, Austria).

59/70
Figure 6.15: Confusion matrix (New-Delhi, India).

The confusion matrix in Figure 6.16 and Table 6.13 shows that, New-Delhi, India’s
overall prediction accuracy is 76%. Based on the observed 13-day prediction, a True
Positive(TP) value is 5 and a True Negative(TN) value is 5.

Figure 6.16: Overview confusion matrix for 13-day prediction (New-Delhi, India).

Accuracy Precision Recall F1-Score Total Classified


76% 62% 1.0 76% 13

Table 6.13: Classification summary of air quality classes (New-Delhi, India).

60/70
The overall findings of the real-time prediction are as follows: The machine learning
model of Austrian data gives a correlation coefficient of 0.98, and the accuracy of the
real-time prediction is 81%. That is, out of 16 predicted instances, 13 were correctly
classified, and 3 were not correctly classified.

61/70
Chapter 7

Conclusion

In this thesis, a system has been proposed that can predict air quality using weather
forecasts and machine learning. The main task was to create a machine learning model
to predict the air quality and implement it for real-time prediction using an Android
smartphone. Historical data has been used for both air quality and weather data. Pre-
diction is implemented by leveraging the WEKA machine learning tool.

Overall, 15,680 data instances have been collected for ten countries from 2018-01 to
2020-07. Several experiments have been conducted during training the model, includ-
ing data cleanup, removing data anomalies, training with different machine learning
algorithms, finding a correlation between attributes, and several training options. Then
the trained model has been imported in the Android application. The newly developed
custom Android application can predict and visualize the air quality estimates based
on the current location. The application is available in two geographical regions, Aus-
tria and India.

The following are the primary outcomes of this work: A trained machine learning
model was successfully created for air quality index prediction. The results of the
real-time prediction are observed and compared with the actual air quality index for
specific duration. Using a random forest algorithm, the overall prediction accuracy for
Austria is 81% and for India it is 76%.

This thesis scope can be extended further to improve the reliability and precision of
the predictions. Further, the training can be extended to other countries and regions.
Finally, hourly predictions can be envisioned in the next version of the application. In
the future, it is envisioned that the model can be downloaded from a server through

62/70
the Android application. The app is intended to contribute to awareness about air
pollution, yet more awareness and initiatives will be necessary to counteract problems
of air pollution and climate change in the future.

63/70
Appendix

A.1. Raw Data Source

Figure A.1: A screenshot of the raw data source - "Linz, Austria".

Figure A.1 shows the raw data source that was used for training the machine learning
model. This screenshot was captured from the WEKA data source viewer. The data
source consists of three different categories, such as general data, weather data, and air
quality data.

General data consists of values for the following attributes: city, country, latitude, lon-
gitude, recorded-date, year, day of the month, and date.

Air quality data consists of air quality information and an air quality class. The air
quality classes are classified into six: good, moderate, unhealthy for sensitive groups,
unhealthy, very unhealthy, and hazardous.

The following are the air quality classes and their details.

1. "Good": Air quality is considered satisfactory, and air pollution poses little or no
risk.

64/70
2. "Moderate": Air quality is acceptable; however, for some pollutants there may be
a moderate health concern for a very small number of people who are unusually
sensitive to air pollution.

3. "Unhealthy for sensitive groups": Members of sensitive groups may experience


health effects. The general public is not likely to be affected.

4. "Unhealthy": Everyone may begin to experience health effects; members of sen-


sitive groups may experience more serious health effects.

5. "Very unhealthy": Health alert - everyone may experience more serious health
effects.

6. "Hazardous": Health warnings of emergency conditions. The entire population


is more likely to be affected.

Weather data consist of humidity, temperature, wind, and pressure.

A.2. Air Life Data Source

Figure A.2: A screenshot of the Air Life app’s cached data source.

Figure A.2 shows the data base (cache) of the Air Life Android app. The data source
consists of air quality and weather information. This cached information will be used
for further predictions and statistics.

65/70
Bibliography

[1] “WAQI” World Air Quality Index. May 10, 2021. URL: https://waqi.info/ (vis-
ited on 05/10/2021).
[2] “WEKA Wiki” Online courses and informations. June 10, 2021. URL: https://waika
to.github.io/weka-wiki/ (visited on 06/10/2021).
[3] “WEKA” Open-source machine learning tool based on Java language. May 10, 2021.
URL : https://www.cs.waikato.ac.nz/ml/weka/ (visited on 05/10/2021).
[4] “WWO” World Weather Online. May 10, 2021. URL: https://www.worldweathero
nline.com/ (visited on 05/10/2021).
[5] Air Quality Forecasting Models. URL: https : / / aqicn . org / forecast / models/
(visited on 08/06/2021).
[6] Air Quality Now-Cast and Instant-Cast. URL: https://aqicn.org/faq/2015-03-
15/air-quality-nowcast-a-beginners-guide/ (visited on 08/06/2021).
[7] Air quality standards. Jan. 18, 2022. URL: https://ec.europa.eu/environment/
air/quality/standards.htm (visited on 01/18/2022).
[8] Ambient (outdoor) air pollution. July 13, 2021. URL: https://www.who.int/news-
room / fact - sheets / detail / ambient - (outdoor) - air - quality - and - health
(visited on 07/13/2021).
[9] Android Privacy Changes. URL: https://developer.android.com/about/versio
ns/10/privacy/changes (visited on 10/10/2021).
[10] N S Aruna Kumari et al. “Prediction of Air Quality in Industrial Area”. In: 2020
International Conference on Recent Trends on Electronics, Information, Communica-
tion Technology (RTEICT). 2020, pp. 193–198. DOI: 10.1109/RTEICT49044.2020.
9315660.
[11] Atmospheric dispersion modeling. URL: https://en.wikipedia.org/wiki/Atmosph
eric_dispersion_modeling (visited on 08/06/2021).
[12] Atmospheric dispersion modeling. URL: https://en.wikipedia.org/wiki/SILAM
(visited on 08/06/2021).

66/70
[13] Helotonio Carvalho. “Air pollution-related deaths in Europe - time for action”.
eng. In: Journal of global health 9.2 (2019), pp. 020308–020308. ISSN: 2047-2978.
[14] Yun Cheng et al. “AirCloud: a cloud-based air-quality monitoring system for
everyone”. eng. In: Proceedings of the 12th ACM Conference on embedded network
sensor systems. SenSys ’14. ACM, 2014, pp. 251–265. ISBN: 9781450331432.
[15] Climate change, air pollution and global challenges : : understanding and perspectives
from forest research. eng. Burlington, 2013. URL: https : / / www . sciencedirect .
com/science/book/9780080983493.
[16] Converting air pollutants. Sept. 10, 2021. URL: https://www.breeze-technologi
es.de/de/blog/air-pollution-how-to-convert-between-mgm3-%5C%C2%5C%
B5gm3-ppm-ppb/ (visited on 09/10/2021).
[17] CPCB (Central Pollution Control Board). URL: https : / / app . cpcbccr . com / ccr
(visited on 01/19/2022).
[18] Data Mining with Weka. June 17, 2021. URL: https : / / www . futurelearn . com /
courses/data-mining-with-weka (visited on 06/17/2021).
[19] Arthur T DeGaetano and Owen M Doherty. “Temporal, spatial and meteorolog-
ical variations in hourly PM2.5 concentration extremes in New York City”. eng.
In: Atmospheric environment (1994) 38.11 (2004), pp. 1547–1558. ISSN: 1352-2310.
[20] Ranil Dhammapala, Clint Bowman, and Jill Schulte. “A Monte Carlo method for
summing modeled and background pollutant concentrations”. In: Journal of the
Air & Waste Management Association 67.8 (2017). PMID: 28278032, pp. 836–846.
DOI : 10 .1080 /10962247 .2017 .1294546. eprint: https: / /doi .org / 10. 1080/
10962247 . 2017 . 1294546. URL: https : / / doi . org / 10 . 1080 / 10962247 . 2017 .
1294546.
[21] Joy Dutta et al. “AirSense: Opportunistic crowd-sensing based air quality mon-
itoring system for smart city”. In: 2016 IEEE SENSORS. 2016, pp. 1–3. DOI: 10.
1109/ICSENS.2016.7808730.
[22] Hamdy K Elminir. “Dependence of urban air pollutants on meteorology”. eng.
In: The Science of the total environment 350.1 (2005), pp. 225–237. ISSN: 0048-9697.
[23] EPA Criteria Pollutants. URL: https://www.cdc.gov/air/pollutants.htm (vis-
ited on 10/11/2021).
[24] Gini Vs. Entropy. URL: https : / / quantdare . com / decision - trees - gini - vs -
entropy/ (visited on 10/05/2021).
[25] Global Forecast System. URL: https://www.ncei.noaa.gov/products/weather-
climate-models/global-forecast (visited on 10/05/2021).

67/70
[26] S. Gnanambal et al. “Classification Algorithms with Attribute Selection: an eval-
uation study using WEKA”. English. In: International Journal of Advanced Network-
ing and Applications (2018). URL: https://www.proquest.com/scholarly-journa
ls/classification-algorithms-with-attribute/docview/2059600908/se-2.
[27] Jiawei Han. Data mining : concepts and techniques. eng. 3rd ed. Morgan Kaufmann
series in data management systems. Burlington: Elsevier Science, 2011, pp. 279–
328. ISBN: 9780123814807. URL: https : / / www . sciencedirect . com / science /
book/9780123814791.
[28] Daniel J. Jacob and Darrell A. Winner. “Effect of climate change on air quality”.
In: Atmospheric Environment 43.1 (2009). Atmospheric Environment - Fifty Years
of Endeavour, pp. 51–63. ISSN: 1352-2310. DOI: https : / / doi . org / 10 . 1016 /
j.atmosenv.2008.09.051. URL: https://www.sciencedirect.com/science/
article/pii/S1352231008008571.
[29] Hermann Jakobs et al. “A real-time forecast system for air pollution concentra-
tions”. In: (Jan. 2002).
[30] Java SE 7. June 20, 2021. URL: https://www.oracle.com/java/technologies/
javase/javase7-archive-downloads.html (visited on 06/20/2021).
[31] Patrick L. Kinney. “Climate Change, Air Quality, and Human Health”. In: Amer-
ican Journal of Preventive Medicine 35.5 (2008). Theme Issue: Climate Change and
the Health of the Public, pp. 459–467. ISSN: 0749-3797. DOI: https://doi.org/
10 . 1016 / j . amepre . 2008 . 08 . 025. URL: https : / / www . sciencedirect . com /
science/article/pii/S0749379708006909.
[32] Xun Li et al. “Smartphone Evolution and Reuse: Establishing a More Sustain-
able Model”. In: 2010 39th International Conference on Parallel Processing Workshops.
2010, pp. 476–484. DOI: 10.1109/ICPPW.2010.70.
[33] Assar Lindbeck and Dennis J Snower. “Multitask Learning and the Reorgani-
zation of Work: From Tayloristic to Holistic Organization”. In: Journal of labor
economics 18.3 (2000), pp. 353–376.
[34] Machine Learning with Java. June 21, 2021. URL: https://www.codingame.com/
playgrounds/7163/machine-learning-with-java---part-6-random-forest
(visited on 06/21/2021).
[35] Usha Mahalingam et al. “A Machine Learning Model for Air Quality Predic-
tion for Smart Cities”. In: 2019 International Conference on Wireless Communica-
tions Signal Processing and Networking (WiSPNET). 2019, pp. 452–457. DOI: 10 .
1109/WiSPNET45539.2019.9032734.

68/70
[36] Ioannis Manisalidis et al. “Environmental and Health Impacts of Air Pollution:
A Review”. eng. In: Frontiers in public health 8 (2020), p. 14. ISSN: 2296-2565.
[37] L Natsagdorj, D Jugder, and Y.S Chung. “Analysis of dust storms observed in
Mongolia during 1937–1999”. eng. In: Atmospheric environment (1994) 37.9 (2003),
pp. 1401–1411. ISSN: 1352-2310.
[38] Simon Parsons. “Introduction to Machine Learning by Ethem Alpaydin, MIT Press,
0-262-01211-1”. eng. In: Knowledge engineering review 20.4 (2005), pp. 432–433. ISSN:
0269-8889.
[39] H. Rhys. Machine Learning with R, the tidyverse, and mlr. Manning, 2020. ISBN:
9781638350170. URL: https://books.google.at/books?id=qjszEAAAQBAJ.
[40] Praveen Kumar Sharma et al. “IndoAirSense: A framework for indoor air quality
estimation and forecasting”. In: Atmospheric Pollution Research 12.1 (2021), pp. 10–
22. ISSN: 1309-1042. DOI: https://doi.org/10.1016/j.apr.2020.07.027. URL:
https://www.sciencedirect.com/science/article/pii/S130910422030218X.
[41] Anish Singh, Raja Kumar, and Nitasha Hasteer. “Comparative Analysis of Clas-
sification Models for Predicting Quality of Air”. In: 2020 IEEE 5th International
Conference on Computing Communication and Automation (ICCCA). 2020, pp. 7–11.
DOI : 10.1109/ICCCA49541.2020.9250805.
[42] M Sofiev et al. “A dispersion modelling system SILAM and its evaluation against
ETEX data”. eng. In: 40.4 (2006), pp. 674–685. ISSN: 1352-2310.
[43] Mikhail Sofiev and Pilvi Siljamo. “Forward and Inverse Simulations with Finnish
Emergency Model Silam”. In: Air Pollution Modeling and Its Application XVI. Ed.
by Carlos Borrego and Selahattin Incecik. Boston, MA: Springer US, 2004, pp. 417–
425. ISBN: 978-1-4419-8867-6.
[44] Per Undén et al. “HIRLAM-5 Scientific documentation”. In: (2002).
[45] Anthony J Viera, Joanne M Garrett, et al. “Understanding interobserver agree-
ment: the kappa statistic”. In: Fam med 37.5 (2005), pp. 360–363.
[46] WHO Air quality guideline values. Sept. 10, 2021. URL: https : / / www . who . int /
news - room / fact - sheets / detail / ambient - (outdoor) - air - quality - and -
health (visited on 09/10/2021).
[47] I. H. (Ian H.) Witten. Data mining : practical machine learning tools and techniques.
eng. 3rd ed. https://doi.org/10.1016/C2015-0-02071-8: Morgan Kaufmann, 2011.
ISBN: 9780123748560. URL : https://www.sciencedirect.com/science/book/
9780123748560.

69/70
[48] Ian H Witten. Data mining : practical machine learning tools and techniques. eng.
Fourth edition. https://doi.org/10.1016/C2015-0-02071-8: Morgan Kaufmann, an
imprint of Elsevier, 2017. ISBN: 9780128042915.
[49] Yang Yuan Zhou and Mantilla. “Predicting Traffic Accidents Through Hetero-
geneous Urban Data: A Case Study”. In: In Proceedings of the 6th International
Workshop on Urban Computing(UrbComp 2017) (2017).
[50] Metin Zontul et al. “Wind speed forecasting using reptree and bagging methods
in Kirklareli-Turkey”. eng. In: Journal of Theoretical and Applied Information Tech-
nology 56.1 (2013), pp. 17–29. ISSN: 1992-8645.

70/70

You might also like