You are on page 1of 30

1

Predicting PM2.5 on Bangalore historical data


using various machine learning and deep
learning algorithm
2

ACKNOWLEDGMENT:
3

ABSTACT:
Satellite aerosol data allows for historical estimates of PM2.5 levels in areas that have only
recently begun systematic PM2.5 monitoring. While estimating PM2.5 levels outside of the
model-training period, most previous models indicated a drop in accuracy. An ensembles
machine learning technique was used in this study to produce trustworthy PM2.5 hindcasts.
In order to fill in the missing satellite data, multiple imputation was used. Finally, a spatial
clustering strategy was used to compensate for any unobserved spatial variation in the
modelling domain, Bangor. For each region, a random forest, a generalised additive model,
and an extreme gradient boosting model were trained. In the end, a generalised additive
ensembles model was devised to incorporate predictions from diverse methods. With a cross-
validation R2 (RMSE) - 0.79 (21 g/m3), the ensemble forecast accurately described the
spatiotemporal spread of daily PM2.5. In comparison to national models, the cluster-based
subregional models outperformed and increased the CV R2 by a factor of about 0.05. In
comparison to prior research, our model gave more accurate daily (R2 = 0.58, RMSE = 29
g/m3) and monthly (R2 = 0.76, RMSE = 16 g/m3) out-of-range predictions. Historical PM2.5
levels may be accurately predicted using our hindcast modelling system.
Experts and researchers in India's fight against pollution have a major challenge: predicting
the quality of the air we breathe. Nevertheless, short- or long-term exposure to PM25
particulate matter harms the cardiovascular system of humans. For long-term exposure,
studies have shown that it can lead to abrupt death. In general, there are two ways to
anticipate PM25: First, a chemical transport model. Models based on data (Statistical
Models). Many academics have already implemented CTM & Data models to various places
throughout the world. The emission rates for air contaminants are calculated using the
appropriate chemic processes in the chemical-based models. Next, data-driven models such
as Series Data Regression Analysis, Random Forest, SVM, and various hybrids like CNN and
LSTM, ARIMA and ANN, and Deep RNN models were employed by numerous researchers.
This paper, too, proposes a data-driven method to air pollution prediction. Prediction of
PM25 AQI levels using meteorological and pollutant data is the focus of this study. Bellandur
area in Bangalore, Karnataka, has been sampled from the Tutiempo Network, SL, and the air
pollutant PM25 (Massive air pollutant) data has been acquired from Open Weather Ltd.
during the same period. For the hyper parameters, this study used the usual statistical
algorithms such as linear regression (LR), DT, random forest (RF), and XG-Boost (XG-
Boost). When it comes to predicting outcomes, the Random Forest algorithm provides the
most generalizable results with R 2=70% performance above other models. The performance
of each model is compared to the performance of the other models in order to determine the
most generalised models for PM25 prediction. In order to evaluate the model's performance,
the generalisation error was determined using RMSE, MSE, and MAE.
4

Contents
ABSTACT:................................................................................................................................3
CHAPTER 1: INTRODUCTION..............................................................................................7
BACKGROUND....................................................................................................................9
PROBLEM STATEMENT..................................................................................................10
AIMS AND OBJECTIVES..................................................................................................11
SCOPE OF THE STUDY....................................................................................................11
SIGNIFICANCE OF THE STUDY.....................................................................................11
STRUCTURE OF THE STUDY.........................................................................................12
SCOPE OF THE STUDY....................................................................................................12
CHAPTER 2: LITERATURE REVIEW.................................................................................13
INTRODUCTION................................................................................................................13
AIR POLLUTION IN ATMOSPHERIC SCIENCES.........................................................13
PARTICULATE MATTER.............................................................................................13
SOURCE OF PARTICULATE MATTER......................................................................14
PM2.5 HEALTH EFFECTS............................................................................................14
AIR QUALITY INDEX...................................................................................................14
METHOD OF PREDICTING PM2.5..................................................................................14
CHEMICAL TRANSPORT MODELS...........................................................................15
DATA BASED MODELS...............................................................................................16
TRADITIONAL STATISTICAL METHODS................................................................16
DEEP LEARNING METHODS......................................................................................18
DISCUSSION......................................................................................................................19
SUMMARY.........................................................................................................................20
CHAPTER -3 RESEARCH METHODOLOGY.....................................................................21
INTRODUCTION................................................................................................................21
RESEARCH METHODS.....................................................................................................21
DATASET DESCRIPTION.............................................................................................21
DATA DICTIONARY.....................................................................................................21
DATA PROCESSING.....................................................................................................22
PROPOSED METHOD/MODELS (REGRESSION).........................................................24
XG-BOOST......................................................................................................................25
ANALYTICAL FRAMEWORK.............................................................................................26
SUMMARY.........................................................................................................................26
CHAPTER-6 CONCLUSIONS AND RECOMMENDATIONS............................................27
5

INTRODUCTION................................................................................................................27
DISCUSSION AND CONCLUSION..................................................................................27
CONTRIBUTION TO KNOWLEDGE...............................................................................27
FUTURE WORK.................................................................................................................28
References................................................................................................................................29
6

CHAPTER 1: INTRODUCTION
Bangalore, the capital of the state of Karnataka, is renowned as the "Silicon Valley of India"
because of its high concentration of technology companies. Bangalore's population has grown
dramatically in the last few decades as a result of both urbanisation and globalisation. There
are many multinational and start-up enterprises in Bangalore, which makes it an ideal
location for businesses. As a result, air quality is a more significant consideration. (Wang,
2020)
What is the AQI (Air Quality Index) for Bangalore? Not as nice as it used to be prior to a few
decades. PM10 and PM2.5 are proven to be the most harmful pollutants in the air, according
to a new study. Essentially, they are airborne micro- and irritating PM particles that can
wreak havoc on the cardiorespiratory system by clogging airways and causing irritation.
(Wang, 2020)
Historical data can be used to predict PM2.5 concentrations. As a result, statistical and
machine learning models based on historical data are being used to forecast PM2.5 levels.
Decision Tree-Regressor, KNN-Regressor, Linear-Regressor, Random-Forest Regressor0,
XG-Boost-Regressor, & Artificial Neural Network are some of the more commonly used
models for predicting future outcomes. Predictions of PM2.5 concentrations in this study will
be made using all of the aforementioned models. It is possible to avoid harm to human life by
accurately predicting PM2.5 concentrations based on a variety of meteorological parameters.
(Madhuri.N, 2021)
As a result, reactive gases are the primary contributors to air pollution. 2. Particles of a very
fine size. As the quantity of these particles rises, so too does the risk of health complications,
reduced vision, and even airport delays or cancellations. Air pollution concentrations and
extent vary among sites due to the sorts of industry, transport, and other activities that are
present in the immediate neighbourhood of each location. As a result, small particles (PM2.5)
has emerged as among the most damaging pollutants for public health. Particles smaller than
2.5 m in diameter are referred to as "minor particles" or "airborne particles." Short or long-
term exposure to high concentrations of PM2.5 can cause cancer, heart disease, respiratory,
metabolic, and obese illnesses, as well as sudden death. (Xiao, 2018)
Bangalore, the capital of the Indian state of Karnataka, has been plagued by high pollution
levels and severe air pollution. As a result, data analytics must be used to forecast city PM25
concentrations. Prior to financial losses and public health awareness, the analysis can aid
airport operators in taking pre-emptive actions. AQI is a single value that represents the
concentrations of air pollutants in the atmosphere. According to the standardized AQI chart
published by the Government, a region can be categorised as excellent, poor, or severe
polluted depending on the AQI readings. (Kumar, 2021)
The visibility at the airport has been impacted by the fine particulate matter in urban areas.
Predictive analytics will use the visibility forecasts as their future score. In the long run, this
serves to lessen the impact on air traffic. These forecasts are also helpful in limiting the
damage to the economy. New rules/policies and services will be introduced to the public as a
result of proposed visibility prediction models.
7

With the use of predictive analytics and modelling, this environmental problem can be
addressed. Chemical transport models and data-based models have been widely utilised to
predict air contaminants. Researchers have employed time series-based LR, RF, ARMA,
LSTM, and other hybrid models to predict PM2.5 values using various types of data in the
view of data-based models. (Patil, 2021)
An analogous approach is proposed in this study to estimate PM2.5 levels in Karnataka's
capital city of Bangalore utilising meteorological data and statistical data models including
such linear regression, decision trees, random forests, and extreme gradient boost. This can
enable government as well as other agencies take good preventative steps in advanced to
reduce the haze harm to human production and life, which has a highly precise for the
wellbeing of the human's health, by predicting the concentration of PM2.5 using observation
meteorological data.
8

BACKGROUND
Particulates, biomedical air pollutants as well as other toxic emissions or substances have
been found to be blended with Earth's atmosphere as a result of the air pollution (air). As a
result, it causes a wide range of health problems and even death for people and other animals.
Then, over a long length of time, the natural ecosystem is harmed.
CO, VOCs, NO, SO, and Particulates (PM) are the main sources of pollution in the air quality
layer (PM2.5). These chemicals have a negative impact on human health and the
environment. As a result of the tremendous rise in urbanisation and industrialisation, similar
emission of aerosols is likewise increasing on a daily basis. PM2.5 is one of the air pollutants
that has the most detrimental impact on human life and the public at large, according to the
EPA. This demonstrates how important air quality has become in urban places. In Proceeding
of the 30th Chinese Processing and Decision Conference, 'Prediction of Urban Pm2.5 Levels
Using Wavelet Neural Networks'
Concern over the quality of our planet's air has grown as air pollution has risen. The air we
breathe contains a wide variety of toxins. There is a lot of attention paid to the PM2.5
atmospheric pollutants, which really is responsible for the haze in our environment, by
government agencies responsible for the Environment and Climate and People departments.
Aerodynamic particles in the atmosphere with a diameter that is less than and equal to 2.5
micrometre metres are referred to as Particulate Matter (PM2.5). Even at low levels of
concentration, it has a significant impact on air quality and transparency. PMs are so small in
comparison to the other air pollutants that they are difficult to detect. Power plants, industrial
production, and vehicle emissions all contribute to the high concentration of harmful and
toxic PM2 in the air. The wind can carry these tiny particles great distances. This has a
significant effect. (Akiladevi R, 2021)
Public knowledge of air quality is generally provided by the government in the form of an air
quality (AQI). Using concentrations levels of pollutants such PM2.5, oxides of Sulphur,
nitrogen, ozone, carbon monoxide, and sulphur dioxide, it may be determined which pollutant
has the most influence on human health. PM2.5 is the most dangerous air pollutant and has a
direct impact on human health, hence estimating its concentration values on an hourly or 24-
hour basis is of great scientific importance.
Ozone, NO, and PM2.5 concentrations can be predicted using climatic data, source of air
pollutants, and local topography, according to recent studies. When it comes to air pollution
concentrations, meteorological conditions have an important effect. Meteorological elements
were said to have reacted with pollutants and toxins and then been transferred to numerous
regions, according to Seinfeld. As a result, meteorological conditions have a significant
impact on air pollution levels. So, researchers are attempting to anticipate the PM2.5 levels
based solely on meteorological data. (Shi, 2018)
Two approaches will be used to predict PM2.5 levels: 2. Data-driven models, which include
chemical models. Various chemical formulae and chemical reactions will be used to create
chemical models. Observational data and statistical approaches will be used to construct data-
driven models. Monitoring stations have been constructed in a number of cities across the
world that record all weather data and the median degree of saturation of air pollutants on an
9

hourly, eight-hour, or 24-hour basis. Models like MLR, SVM or deep learning-based models
like ANN and CNN are utilised to predict PM2s in recent research. models are evaluated
using the mean squared error and root mean squared errors.
Similar to the previous work, this study proposes data-based models such Multiple Linear
Regression, Decisions Tress, Randomized Forest, and XG-Boost algorithms to generate
accurate predictions of PM2.5 concentration levels and experiments MAE, MSE, and RMSE
measures to evaluate the models' performance. Finally, the best generalizable prediction
model is selected. Karnataka's capital city of Bangalore's pollution and weather data were
compiled from numerous sources. These PM2.5 forecasts will aid in determining the extent
of haze in a given area based on a variety of meteorological factors. For the benefit of the
people, it's a good idea to take preventative actions in advance to minimise damage to human
productivity from haze by governmental and other authorities. (Ma, 2021)
PROBLEM STATEMENT
The problem statement of this research is to apply Supervised ML and DL algorithms
(Decision-Tree Regressor, KNN-Regressor, Linear-Regressor, Random-Forest-Regressor,
XG-Boost-Regressor), and Deep learning algorithm (Artificial Neural Network) to predict the
PM2.5 concentration level hourly for Bangalore city from 2010 to 2021. Different ML and
DL model tried out and compared the result, and more generalizable model chosen.
The use of various data-based models such as regression analysis, LSTM, RF and SVM as
well as some hybrid models combining ARMA with CNN or NN have been applied in this
context by numerous researchers. Various historical and meteorological data have been
incorporated into these studies. Until now, many studies have used ground-measure PM2.5
data, auxiliary records, APM site geospatial data, historical information like period, season,
day of the week, satellite sensor readings, timestamp data, and meteorological data to predict
PM2.5 for a variety of purposes around the world, including for air pollution monitoring. In
addition to weather data, these studies are providing good results for PM2.5 prediction.
According to, meteorological data, major source of air pollution, and local topography are the
most important factors in determining PM2.5 concentrations. Meteorological considerations
play a significant impact in predicting PM2.5 levels in the atmosphere.
The Issue that, this research resolving is to assist the Authorities to implement preventive step
to save health of the public and to create new policies to enhance the standard of air, that
eventually minimises the damages to public due to small particulates (PM2.5) (PM2.5). So,
this research is taking meteorology data gathered from of the Tutiempo Networks S.L. by
web scraping in Bellandur area, Bangalore during year of 2013-2016 & PM25 information
was collected for much the same time from the third party AQI. (Xiao, 2018)
In order to make accurate predictions of PM2.5 concentrations, this study uses solely
meteorological and PM2.5 AQI data and classic statistical models such linear regression, DT,
RF, and XG-Boost algorithms. Each of the models was compared to the other to determine
which was the most generalizable in terms of its ability to predict PM2.5 concentrations. The
models are evaluated using performance measures like MAE, MSE, and RMSE.
10

AIMS AND OBJECTIVES


Observational meteorological data from Bangalore, India will be used to develop a statistical
approach to the PM25 air quality index (2013-2016). It is the purpose of the project to
anticipate (PM25) values using meteorological data as well as provide solutions to climate
and environmental agencies to take good preventative actions in advance for the welfare of
the human health and to reduce haze damage on the human productivity and life. (UnjinPak,
2020)
The research objective is formulated based on aim of this study which are as follows:
1. To try different ML, DL approach for modelling and predicting the PM2.5.
2. To understand the pattern and relationship between historical data and air pollutant
PM2.5 concentration levels.
3. To propose various types of supervised ML and DL algorithms and selecting the
simpler and simpler and more generalized model to predict PM2.5 level.
4. To evaluate the performance of machine learning and deep learning algorithms.
SCOPE OF THE STUDY
Traditional statistical models are being developed for the prediction of daily PM25 values
using meteorological data. The scope of study is limited, yet there are numerous ways to meet
the stated goals.
1. It is possible to use features or variables gathered from many sources as mixes
incorporating or variables in a framework. However, this study is experimenting with
data from a single air monitoring station in Bangalore, India, looking at
meteorological mixed air pollutant (PM2.5) data.
2. These predictions can be made using a wide range of classical and deep learning
methods. As a result, this study is only looking at classic models like MLR and DT,
RF, and XG-Boost, which are best suited for small datasets and regression tasks.
3. This study is focused on Bangalore, Karnataka's capital city, because of the city's air
pollutants and high concentrations of PM2.5.
SIGNIFICANCE OF THE STUDY
The daily/hourly increase in India's population is a major contributor to the country's rising
pollution levels. The rapid rise of "Silicon Valley" and the development of IT companies is
also a factor in Bangalore's population growth. This results in a high reliance on automobiles.
The quality of air is deteriorating as a result of this, as additional contaminants mix with the
fresh air. People's health and well-being are seriously harmed by fine particulate matter
(PM2.5).
Because of the tremendous increase in the global population, air pollution is becoming ever
more of an issue. As a result, the number of cars on the road and the amount of toxic
pollution emitted by factories is steadily rising. As a result of these, the clean quality of the
air is naturally polluted. Particulate matter (PM2–5) is the most harmful of all air
contaminants to human health. With the help of data from an air monitoring centre
(Bellandur, Bangalore), this study predicts (PM2–5), the intensity levels of air pollution.
(Xiao, 2018)
11

So, it is important to forecast PM2.5 concentration level in air using historical data collected
by monitoring systems.

STRUCTURE OF THE STUDY


Structure of the research would be explained in this part. For starters, background research on
air pollution, including information on pollution sources, pollutants' impacts on human
health, and methodologies for predicting PM2.5 concentrations is presented in Chapter 1's
Section 1.1. The issue statement and focus area of the study are discussed in Section 1.2,
while Section 1.3 outlines the goals and objectives of the study. Similarly. The study's scope
and significance are discussed in detail in sections 1.4 and 1.5.
The theoretical domain backdrop and highlights of atmospheric science challenges were
described in section 2.2 of chapter 2. Particulate matter, its sources, and its health impacts
were also covered in the presentation. Finally, it explains what air quality measurements are
and why they're important to know. A similar discussion of data pre-treatment, feature
engineering and technique was found in Section 2.3. Lastly, in section 2.5, it was mentioned
that the studies that preceded this one revealed research gaps. Section 2.6 concludes the
review with a summary of the findings. (Zeng, 2016)
This chapter 3 provides a quick overview of the methodology. Various data pre-treatment
stages such as addressing missing values, univariate and bivariate analysis as well as multi-
collinearity and feature selection/importance strategies were described in section 3.2. In
Section 3.3, we reviewed the mathematical and theoretical underpinnings of various proposed
methods, such as linear regression, decision trees, random forests, and XG-boost. Then we
talked about performance indicators like MAE, MSE, and RMSE. Finally, we've talked about
the research and summary's analytical framework.
Analyses and designs utilised to carry out this research are discussed in detail in Chapter 4.
Section 4.2 deals with univariate analysis and determining if the variables are normally
distributed. For the purpose of determining correlations between variables and detecting
multicollinearity, bivariate analysis is performed in Section 4.3 using scatter plots and pair
plots. Boxplots are used to identify outliers, and the interquartile range approach is used to
handle them. Section 4.4 uses tree-based models to choose and prioritise features. Section 4.5
discusses the various configurations employed while maximising the system parameters for
model construction. For this project, process design & tool or technology deployment were
described in sections 4.6 & 4.7. (Sun, 2015)
5.1 and 5.2 of chapter 5 cover the various findings produced by the suggested models and the
various performance measures used it to assess the models based on the test data,
respectively.
Finally, in section 6, the study's conclusions, its impact on the field, and its recommendations
for further research are explored.
SCOPE OF THE STUDY
In order to take preventive actions to manage air quality and, eventually, improve human life,
this study aims to inform the government of India and humans about the levels of PM2.5 that
12

should be expected. As a result, this study aims to improve the quality of life for people and
increase awareness about the increasing concentration of PM2.5 in the air.

CHAPTER 2: LITERATURE REVIEW


INTRODUCTION
In this chapter, we'll learn about air quality and how much it affects human health from the
perspective of atmospheric science. Particulate matter air pollution and its sources are the
primary topics of discussion. AQI, or air pollution index, is also discussed, and how it relates
to human health. Section 2.4 discusses the various methods for predicting air pollutants.
Section 2.4. In addition, researchers from around the world discuss the methods they use to
analyse different types of data. Finally, it summarises the chapter's main points and discusses
any remaining knowledge gaps that exist as a result of this research.
AIR POLLUTION IN ATMOSPHERIC SCIENCES
9 out of 10 people in the world breathe air with high levels of pollutants, which kills 7 million
people a year and causes 4.2 million deaths annually, according to the World Health
Organization (WHO). This means that the effects of air pollution on the environment are
widespread and severe. Atmospheric Sciences researchers are increasingly concerned about
this issue.
It is a common type of pollution. Wind carries a variety of hazardous pollutants and gases
from one location in a city to another. Pollution in the air that has a negative impact on
human health can be characterised as excessive quantities of toxins or pollutants in the air. As
a result, air pollution has become the most pressing environmental concern for human beings
around the globe. Many scientists around the world are working to reduce emissions and
improving air quality. (Zhong, 2019)
Smog and Cutround pollution are the two most common types of air pollution. 2. Particulate
Matter or Soot. By reacting sunlight and fossil fuel emissions, Smog pollution is created.
Chemicals, soils, smoke, and dust inside the form of gas are combined with air to create soot
air pollution. Both smog and soot pollution are caused by the combustion of fossil fuels in
automobiles, lorries, factories, power plants, incinerators, and other engines. It measures agi
values of sulphur dioxide (so), nitrogen oxides (NO), lead (Pb), ozone (0,), carbon monoxides
and particulates matter (PM) in the city of Bangalore by organisations such the Central
Pollution Control Board and the System of Air Quality & Weather Forecasting and Research
(SAFAR) (PM).
This study focuses mostly on predicting the impact of small particulate matter (PM) on public
health.
PARTICULATE MATTER
Liquid and solid particles in the atmosphere make up PM, a common air pollutant that may
be found throughout the world. Particulates (PM) is among the most dangerous pollutants in
the atmosphere because it contains a wide range of concentrations of many pollutant species.
Toxic gases released by coal-fired power stations and paper mills are the primary producers
13

of PMs in the atmosphere. The physical and chemical features of PM will change as it
spreads from one location to another. sulphates, nitrates, and ammonium are only a few of the
inorganic ions that make up particulate matter (PM). Particulate matter also contains
biological elements such as allergens and microbiological substances. (Huang, 2015)
SOURCE OF PARTICULATE MATTER
Particulate matter can be divided into two categories based on particle size. Pollutants with
particle sizes of less than 10 micrometres are referred to as PM, or "Coarse Fraction."
Likewise, if the particle size is less than 2.5 micrometres, it is referred to as "Fine Fraction"
and is designated as PM. Comparatively speaking, PM is the most dangerous pollutant to
human health, followed closely by PM10.
Diesel and gasoline cars are the principal contributors of particulate matter in the atmosphere.
Second, these emissions are produced by the combustion of tobacco smoke, coal, and
biomass in a variety of domestic and industrial settings. Cement plants and ceramics or brick
industries as well as mining operations are major sources, according to research. Emissions
can move hundreds of kilometres from the originating area due to volcanic dust storms,
woodland or grassland fires.
PM2.5 HEALTH EFFECTS
WHO says Karnataka's capital city, Bangalore, is now one of the world's most polluted cities.
In India, air pollution kills an estimated 2 million people each year, making it the country's
fifth-leading cause of death. It also shows that India has the high mortality rate in the world
from respiratory and asthma-related conditions. Specifically, 50 % of students in the capital
area are displaying serious lung damage as a result of exposure to the chemical.
The PM2.5 particles can be inhaled and are extremely small in diameter. As a result, the
thoracic region of the respiratory system is easily penetrated. A variety of health impacts,
including breathing and heart-related difficulties, can occur after short-term (hours to days) or
long-term (months to years) exposure, and long-term exposure can even result in sudden
death. As a result, tiny airborne particles, whether they're gas or solid, can enter the lungs and
bloodstream and cause bronchitis, which in turn can lead to heart problems and even sudden
death. Children and the elderly are particularly vulnerable to its harmful effects on their eyes,
throats, and lungs. Asthma and allergy sufferers are particularly vulnerable to the impacts of
small pollutants, which can exacerbate their symptoms. (Trung, 2021)
AIR QUALITY INDEX
AQI is an index that measures the quality of the air on a daily basis. If the air is clean or
contaminated, then the accompanying health implications for humans are shown by this index
value. Air pollution can cause a variety of health issues depending on the AQI score for air
pollutants. The recommended AQI value for every pollutant in terms of public health is often
set using AQI calculations for major sources of air pollution.
With the use of the Air Quality Index (AQI), regulatory and pre-emptive measures can be
taken on the impact on health caused by a wide range of health issues. It also serves as a
means of educating the public about the importance of reducing emissions from indoor &
man-made sources. Air quality index values above 70 indicate increased levels of pollution
and a correspondingly worsened quality of life for everyone. Real-time pollutant AQI values
14

are associated with colour schemes and images as well as a variety of category names and
associated health issues.
METHOD OF PREDICTING PM2.5
Predictions of air pollution can be made using two different methods. Chemical transport
models are one of these models. 2. Models based on statistical data. The CTM mechanism is
used to forecast how diverse chemical compositions may cause air pollution. In order to
eliminate the usage of theoretical models to estimate air quality, data-based approaches are
utilised. Traditional statistical methods are also used in these data models. Models based on
deep learning.
CHEMICAL TRANSPORT MODELS
The differential equations for chemical and physical processes can be found in CTM-based
models. The emission rates of the required air contaminants and their precursors are provided
by the CTM models, which are predictive in nature. Its concentration values are predicted
numerically. As stated by Manders, the CTM methods include: Two. Meteorology 3.
Methods of transportation and diffusion Chemical alterations, for example. Even though the
physiochemical process is the same for all CTM processes, the chemical composition and
particle size distribution vary from one CTM to another. EMEP, WRF-CHEM, CMAQ, &
Lotos Euros are just a few of the open source CTMs now in use. The next paragraphs go into
detail on the various chemical transport mechanisms. (Ma, 2021)
There has been only a limited amount of research done in the topic of air pollution so far. In
the past, pollution data for the Bangalore region between 1900 to 2000 was used by the
researcher. Researchers employed Trajectory analysis to gauge air quality in this
investigation. The Emission Archive for Global Atmospheric Studies and the Comprehensive
Emission Registry for Transport Emissions are the sources of data used for this time because
they both have a limited amount of information.
This is followed by reliable data bases for air pollutants, which the researcher created on a
decade-by-decade basis for different components like CH, CO, NO2, SO, and TSP. Power
plants and transportation were the primary sources of CO and particulate matter between
1990 and 2000, according to the study. Particulate matter levels were found to be higher than
recommended for the time period, according to the report.
According to a subsequent study, vehicle emissions in Bangalore were quantified between the
years of 2000 and 2005 by employing vehicle counts, utilisation factors, and emission factors.
There were a number of ways in which this information was gathered, including: 2. Central
Research Institute's external traffic data (traffic from outside the city, as well as traffic
passing through the city). Road transport emissions concentrations and trends were calculated
as a part of the study's goal. The Emission Factor is calculated using an activity-based
technique. During the period 2002-2005, two-wheelers produced 40% of the TSP emissions,
followed by cargo vehicles (29%), buses (19%), cars (10%), and automobiles (8%). (2
percent). TSP levels rose between 8Gg to 10Gg between 2000 and 2005. (Around 31 percent
increment). Researchers began to pay more attention to air pollution predictions when the
SPM concentrations were elevated in comparison to those found in previous studies. (Patil,
2021)
15

Bangalore's population and use of automobiles have grown rapidly in the last decade as a
result of urbanisation. Particulate matter emissions from traffic vehicles in Bangalore were
the focus of Sindhwani and Goyal's new study in 2014. After 2006, particulate matter
emissions had risen as a result of an increase in vehicle use, according to one study. Using the
previous methods of Emission analysis and Activity-based approach, this study looked at
trends in PM emissions from various types of vehicles use over the years 2006-2010 and
found that, when compared to other studies, 2W vehicles emitted 60 percent more than those
using other methods.
DATA BASED MODELS
Air pollution and its patterns can now be assessed using data mining techniques after the use
of CTM-based algorithms like Trajectory analysis & Emission factor analysis. Predicting air
quality using data-based algorithms can be done in two ways. This includes, but isn't limited
to, traditional statistical methods. 2. In-depth study.
TRADITIONAL STATISTICAL METHODS
Recent research has incorporated models such as Time - series data, Regression Analysis,
SVM, Random Forest, and ARMA as part of more standard statistical methods. The
following are a few of the studies mentioned.
Using data mining tools, the researcher examined current air pollution patterns and forecasts
for Bangalore. The CPCP was utilised to collect the study's data, which covered the years
2011-2015. Dates were cleaned in the pre-processing step by removing noise and replacing in
missing entries with appropriate values. Analysing, analysing, and scaling features were all
used in accordance with the distribution of data. Furthermore, RStudio and Tableau were
used to create visualisations of prediction outcomes. The descriptive analysis utilised in this
study to assess and describe the essential characteristics of data for the sample. Predictive
analysis was then utilised in statistics & machine learning approaches to forecast the
continuous or discrete variable using previous data. (Ma, 2021)
In Time series regression prediction with time stamped data, this researcher indicated that
predictive analysis was applied. Taneja, on the other hand, offered multilayer perceptron with
linear regression algorithms. (Kumar, 2021) As a result, it was determined that pollution
levels in Bangalore tend to rise throughout the winter and fall during the summer and
monsoon seasons.
Since then, Shah, A. K. and Singh, A. B., Dahiya have suggested three distinct methods for
predicting PM2.5 levels, including Bayesian Regularization, Levenberg Marquardt (LM), and
scaled conjugate gradient (SCM).
Both the climatic data (from IMD in Bangalore) and the target data (PM2.5) were obtained
from the US Embassy in Bangalore's Bellandur website. There are five columns in this
dataset, each with 6740 observations. During the 29 months between 2016 to mid-2018, these
values were taken at such an interval of 3 hours and included the hour of each day,
temperatures, wind speed, humidity levels & rainfall. The average of the observations was
used to fill in the blanks during the data preparation stage. 70 percent of the meteorological
conditions was used for training, 15 percent for testing, and the final 15 percent for validation
in this study.
16

In overfitting settings, Bayesian regularised ANN is more robust than normal back
propagation and reduces the time-consuming cross validation. It had a training accuracy of
95.64 percent and a test accuracy of 94.99 percent.
As far as I know, it's mostly utilised for lease square situations that aren't linear. Gauss-
Newton and Gradient descent methods are combined to reduce sum square errors. The LM
algorithm's training and skills of the participants is 95.64% and 91.86%, respectively.
Because the search is performed out in conjugate directions, and the step size is varied
throughout each iteration, the Weighted Conjugate Gradient method generated the fastest
convergence. This method displays regression accuracy of 92.06 per cent during both the test
and training phases, respectively.
Climate and PM2.5 statistics from the US State dept Data were sourced by the authors of.
The data set contained hourly PM2.5 measurements from 2009 to 2016 again for city of
Beijing. First-order linear interpolation was employed by the authors to fill in the gaps left by
missing values. They used statistical models such as ARMA, NN, and SVM to model the
data. Researchers discovered that the ARMA model performed well in predicting short-term
time sequence variation, but lost its capacity to forecast long-term time series variation over
time. Because the link between the predictor and the response variable is assumed to be
linear, they came up with the ARMA model. Because of this, it was found that the NN and
SVR experiments produced more accurate findings than the ARMA model did. RMSE (Root
- Mean square Error) and SSE (Root - Mean square Error) are the performance measures
employed in this study (Sum of Squared Error). (Madhuri.N, 2021)
In light of the findings, each monitoring station in the country collected data from Sept 2013
to July 2015 on a variety of air contaminants. Temperatures, relative humidity (RH), wind
direction, and wind direction were taken from a platform operated by the China
Meteorological Bureau (http://en.weather.com.cn). Significant prognostic and meteorological
variables have been used to build a model. There was a correlation coefficient utilised for
each meteorological feature's average period to determine the correlation coefficient. Three-
day forecasts of ground-level PM2.5 concentrations were predicted using Neural Networks.
This study used two separate methods to evaluate the performance of models: regression
techniques used accuracy measures, and classification used crucial threshold values. They
employed MSE and RMSE measurements for global accuracy and TPR and FPR as
performance indicators for categorization. (Huang, 2015)
Random Forest, XG-Boost, and Deep learning models were employed and proposed in the
field of multi-source sensing data. Ground-level PM2.5 aqi & meteorological data from 2015
to 2018 were analysed in this study. Geographic information (longitude and latitude) and
historical data sets (year, weekday, and season) were also incorporated into the analysis. Air
Pollution Monitoring (APM) station and the National Department of Environment (ICT
website) obtained this data.
The interpolation method was employed to fill in the blanks in the data as part of the pre-
processing stage. Because the data collected could not be read by a computer, it was
converted to a format that could be used for modelling by combining it with other data.
Random Forest utilised the Feature Importance approach to select the most essential features,
while XG-Boost used the Recursive feature removal method to remove features that had a
17

high MSE. Deep learning models employed the feature permutation method to choose
features. Deduction of the mean was used to standardise the characteristics through the use of
standardisation techniques like normalisation. A variety of data for building learning models
was scaled using feature scaling strategies [-1,1]. (Patil, 2021)
The Random Forest model was applied and analysed using correct Hyper Parameter Tuning
in the framework of modelling. Grid Search cross evaluation was done to optimise the 10-
folds by taking R2 into consideration. MAE & RMSE were among the tools employed to
gauge the effectiveness of the system. Grid Search CV was used to tune the Hyper Parameter
Tuning for the Extreme Gradient Boosting models to achieve the best results using 10-fold
cross validation by using R2 as a statistic. This model outperformed the random forest model
in terms of model performance. Compared to the previous model, the performance
measurements yielded low error values (RF). Adam optimizer utilised Deep Neural Networks
(DNNs) with a layered architecture of six levels. As a result, the model's overfitting was
reduced using 12 and LI regularisation procedures. Afterwards, the DNN model of deep
learning has low R2 values as well as a small decrease in final evaluation mistakes.
Additional MSE & RMSE metrics for models showed that the XG-Boost model performed
better than the competition.
Using satellite remotely sensed to estimate PM2.5concentration levels, researchers in China
created a national-scale GWR framework for PM, concentrations forecasts in China (Ma and
colleagues, 2014).
On the basis of principal component analysis and least square support vector machine
techniques, research on everyday PM2.5 concentration forecasting by the Cuckoo search
optimised algorithm was proposed by Sun Wei and Sun Jingyi.
Many parametric algorithms have been employed, and the number of non-parametric
algorithms is quite limited, according to this study's authors assertion. So, the author came up
with the idea of using Random Forests to anticipate PMs concentration in the Conterminous
United States every 24/7.
In some cases, current methods for predicting air pollution are unable to account for long-
term relationships and spatial correlations. That's why the author collected data from 12
Beijing air monitoring sites between January 1, 2014, and March 28, 2016, for this article.
Again, for prediction of contaminants in the atmosphere, an LSTM NN model was proposed
by the researcher. LSTM layers were employed to extract key features from of the historical
data, which included meteorological & time stamp data, in this case. By integrating into it,
researchers had presented a model to achieve high performance. LSTME was compared
against the ARMA model, the SVM model, the Time delay neural network, and the standard
LSTM model by the author. Finally, compared to other models, this one produced superior
performance outcomes. Predictions performed admirably for 1-hour activities, with a MAPE
of 11.93 percent and outcomes for 3-to-24-hour tasks of 31.47 percent MAPE demonstrating
satisfactory results.
DEEP LEARNING METHODS
The authors have presented some deep leaning hypotheses in the later stages. To find patterns
in data, the multi-layer architecture in deep learning is utilised to extract inherent features
from the layers. The following are the most important deep learning studies: Many climate-
18

based Eulerian and LaGrange models, as well as trajectory models, were found to be the most
commonly utilised models in Japan to estimate PM2.5 concentrations. Deep recurrent neural
networks (DRNN) plus pre-training approach (DynPT), utilising auto encoders, were
proposed by the researcher in this study. A neural net with recursive connections
outperformed more conventional methods like multiple linear regression or NN without
connections in terms of daily average air pollution prediction, according to the results of the
newly suggested DRNN. (Kumar, 2021)
Using data from 35 Beijing AQI monitoring stations, the author developed the Wavelet
Artificial Neural model for predicting PM25 levels. Interpolation and random forests
algorithm techniques were utilised to handle outliers and missing values as part of feature
engineering. Sliding window approach and high-dimensional temporal characteristics were
used to augment training data and improve prediction accuracy. In the same way, the feature
integration technique was implemented to mine the features. With the use of the Light-GBM
technique they recommended combining predicted data and historical information to build
the final dataset for air quality prediction. XG-Boost, Cat-Boost, and DNN were used to
compare the proposed model to the competition. The most generalizable theory was
examined by comparing its performance indicators, MAE and MSE, with those of other
models. In Proceeding of the 30th Chinese Control & Decision, a Wavelet Neural Network is
used to predict urban PM2.5 concentrations. (Shi, 2018)
Shanghai's environmental data was used extensively by the author of. Again for 3 years (2015
to 2017), the data set was gathered manually It was utilised as a training set including
pollution and weather data from 2015 and 2016. For the PM2.5 prediction, CNN and LSTM
were combined in a new model using the 2017 data set as a test set. The spatial characteristics
of air pollution were extracted using the CNN. The series data features were extracted using
LSTM. With spatial correlation features missing from the LSTM model, the author found that
the model's prediction performance was significantly worse than that of the suggested model
(CNN+LSTM). As a result, incorporating time series data based on the spatial features into a
model could result in better performance. As a final conclusion, this study concluded that the
Back propagation, CNN, RNN, and LSTM NN were not acceptable for spatiotemporal
prediction due to low performance. Each model's performance was compared using measures
like RMSE and CORR to see which one was the most generalizable. To avoid over-fitting,
regularisation techniques were employed.
A hybrid model combining ARIMA and an Artificial Neural Network was developed in this
study as a means of improving PM10 prediction accuracy. Author claims that hybrid model
was more accurate than models used independently because of the author's proposed hybrid
model. However, there is need for more research because of the population's vulnerability to
rising levels of air pollution and the need to identify effective means of monitoring and
controlling it.
DISCUSSION
The air quality index for various air contaminants has been predicted using a variety of
statistical methodologies, including Time series analysis, ARMA, Linear regression, RF &
SVM up until now. PM2.5 concentrations have been forecasted based on three-hourly
averages of time series data that includes temperature, wind speed/direction, relative
humidity and rainfall. The data was then used to train a model that included Bayesian
19

Regularization, Levenberg-Marquardt (LM) & scaled conjugate gradient (SCM). (Huang,


2015)
However, the meteorological data used in this study is not hourly. Climate and PMs data
from of the years [2013-2016] were used in this study, which focused on Delhi, India's capital
city. As a result, the pace of urbanisation and industrialisation is accelerating at an alarming
rate. This shows that additional research in this area is needed to regulate PM25 AQI levels
(which have a greater influence on human health) in a way that is beneficial in saving human
health.
Following a review of prior studies, the prediction of PM values will be made using statistical
models based solely on weather data in this study. New Bangalore, among the most polluted
cities in the world, has been studied using observational metrological data from 2013 to 2016.
Regression metrics were utilised to evaluate the accuracy of the predictions made in this
study using linear regression, decision trees, random forests, XG-boost, and other methods of
data analysis.
SUMMARY
The aim of this document is to examine the available research on the accuracy of predicting
PM2.5 concentrations. What are the many types of pollution and where do they come from?
That and other questions are addressed in Section 2.2. Afterwards, it was explained why the
PM2.5 air pollutant was chosen as the focus of this investigation, and what are the main
sources of particulate matter. Finally, it explained how it is affecting people's health and what
aqi assesses and why it is computed for each pollutant were discussed in the report. Two
alternative strategies are mentioned in Section 2.3 for estimating PM2.5 based on literature
reviews in various articles; both techniques are used for same Models of Chemical Transport
Two. A Data-Based model It was addressed in section 2.3.1 how researchers implemented the
various processes in the CTM-based model in their investigations. This is similar to the data-
based models where researchers have developed traditional approaches such as time series
and linear regression as well as random forest, SVM, and ARMA models. Like with the deep
learning model, numerous researchers have put forth several models and hybrids, such as
CNN + LSTM and the ARIMA+ANN model, that use back propagation, Deep RNN,
Wavelet Neural Network, and ARIMA. (Akiladevi R, 2021)
Traditional statistical models, thus far, researchers have tried LR, RF, and SVM with various
climate, spatiotemporal, meteorological, senor data, and other types of data. A forecast of
PM2.5 concentration levels using solely meteorological data is possible. data gathered at an
air quality monitoring facility. Other standard statistical methods, including as linear
regression, decision trees, random forests, and XG-boost, are also being tested and evaluated
using regression metrics in this study.
20

CHAPTER -3 RESEARCH METHODOLOGY

INTRODUCTION
In this chapter, we'll go over the planned research approaches for resolving this project's data
description and explanation. There will be a collection of data on PM25's meteorological and
AQI characteristics from various sources. AQI index readings and associated health
implications will also be mentioned. Verifying for missing values, univariate analysis, and
checking for normalcy are all steps in data processing. Next, the bivariate analysis, dealing
with multiple collinearities, outlier detection, feature scaling & feature selection/importance
approaches will be discussed. Figures and tables related to this investigation will be
presented. In section 3.3, we'll explore linear regression, decision trees, random forests, and
XG-boost models. In section 3.4, performance measures including MAE, MSE, and RMSE
would be disused and compared to other models in order to identify the most possible by
means for prediction purposes. This research's framework is outlined in part 3.5, and a
chapter summary is discussed in section 3.6.
RESEARCH METHODS
The concentration of air pollutants in pure air has been predicted by various researchers using
a variety of methods up to this point. They looked at a variety of public and commercial data
sources to collect their historical data, and then employed a variety of pre-processing
approaches to deal with it. They'd offered a slew of standard statistics and machine learning
models for in-depth analysis and precise PM25 predictions. Model performance has
previously been assessed using MSE and RMSE measurements.
Similarly, data will be gathered from two separate sources and combined to generate the final
dataset in this study. The raw data will then be processed using relevant pre-processing
techniques. MLR, DT, RF, & XG-Boost algorithms would be used as part of the approach
model. Finally, we'll look at how well each model can predict future results in order to
improve our accuracy. The model's performance will then be evaluated using the MSE,
MAE, and RMSE performance indicators. Below, you can find all the information you need
to know about submitting a Methodology proposal. (Kumar, 2021)
21

DATASET DESCRIPTION
The data required for this research is collected from Tutiempo Network S. L. which is
weather forecast provider for various countries across the globe and majority of the cities are
included. We are scrapping the data for the Bangalore from this website during the research.
The data collection is done requests library from python and sample data for the month of
January 2021, [ https://en.tutiempo.net/climate/01- 2021/ws-432950.html], The data obtained
from 2010 [ Jan – Dec] to 2021 [ Jan – Dec] and it has 8 predictors variables which are T,
TM, Tm, H, PP, VV, V, VM. The target variables PM2.5 concentration level is collected
from [ https://openweathermap.org] The final dataset contains 9 variables and 4320
observations.
DATA DICTIONARY
The dataset used in the research has below attributes:
T: - Avg. annual Temperature in °C.
TM: - max Temperature in °C.
Tm: - min Temperature in °C.
SLP: - Atmospheric Pressure at sea level in hPa.
H: - Avg. humidity in %.
VV: - Avg. visibility in Km.
V: - Avg. speed of wind in Km/h.
VM: - Max wind speed sustained in Km/h.
PM2.5: - Avg. PM2.5 μm.

The sample dataset of this research is look like as:


T TM SLP H VV Tm V VM PM2.5
12.3 21.4 1019.6 70 1.9 6.4 6.7 14.8 152.554167
13.7 22 1018.2 63 1.8 7 7.2 18.3 152.320833
14.9 21.9 1018.3 73 0.6 7.1 2 7.6 319.7375
15 24 1017.4 81 0.6 7.6 2.2 7.6 332.708333
16.3 24 1017.2 80 0.8 12 1.3 9.4 279.6
18.7 26.3 1015.3 64 1.6 9.5 1.7 7.6 179.116667
14.6 21.6 1018.4 72 2.7 9.4 9.4 18.3 54.7916667
14.7 21.7 1017.1 71 1.8 8.4 10.9 22.2 93.375
14.8 22.6 1016.8 67 2.1 8.5 9.3 18.3 103
17.4 25.4 1015.8 63 1.9 10.3 6.7 14.8 132.208333
18.1 26.7 1013.8 61 2.6 10.2 5 11.1 127.708333
18.3 23.2 1011.2 65 1.9 12.7 10 22.2 109.333333
22

DATA PROCESSING
Raw data received from various sources can't be cleaned up and used for analysis in general.
The data must be prepared in a precise way to produce appropriate results from applied
modelling. Therefore, prepressing the data before feeding it into a model is critical and plays
a vital part in data analysis.
1. Handling Null/Missing values:
The first step is to check for any null values in the dataset, and then to handle them in two
different ways.
- Identifying and removing any null values.
- Using Sklearn's Imputer Class and the mean/median approach.

2. Univariate Analysis:
The most basic data analysis method. It describes and summarises the data in order to identify
trends. Examining the prevalence of each feature is also a requirement for quality assurance.
Histograms, distribution graphs, and probability plots will be used to do this.

3. Bivariate Analysis:
Using two variables, a quantitative analysis is performed in order to determine the
relationship between them. Through the use of scatter plots and correlation and causality
analysis, this will be done. In order to discover the correlation between each dataset variable
and [PM 2.5], this analysis is carried out. Before matching the regression line, the scatter
plots are used to show the relationship between two numerical variables in a visual form.
There is a strong linear association (Linear correlation) between these two variables, as seen
in the graphs. A drop in one quantity has no effect on a rise in another since there is no link
between the two. Between -1 and 1, connection r is always present. -1 is a negative
correlation, +1 is a positive correlation, and 0 is a linear correlation that does not exist.
(Kumar, 2021)

4. Handling Multi collinearity:


There is a requirement to check for multi collinearity because the dataset contains so many
different independent attributes. Multicollinearity happens when two or more features have a
23

high correlation with one another. There will be no unique or independent information in the
regression if the variables are heavily connected.
Multi-collinearity can be easily identified by using the following methods:
- Two independent features can be compared by plotting a pair plot graph.
- Using the correlation matrix and heat map, we can see how closely the different
features are linked.

5. Outlier Detection:
It is possible for outliers to distort the general conclusions drawn from a data set by
introducing extremely small or extremely high numbers. Inter Quartile Range (IQR) is used
to find outliers in a data set using a box plot (IQR). Outliers are defined as points that fall
more than 1.5 times inside the interquartile range up or down third quartile or the first
quartile, according to the norms of statisticians. With domain-specific knowledge and a
thorough understanding of the data, we are able to determine the appropriate threshold value
whereby the data makes perfect sense. (Huang, 2015)

PROPOSED METHOD/MODELS (REGRESSION)


Predictive analytics relies heavily on regression analysis in this study. Four distinct
supervised models are being used and tested to see how well they can forecast the future.
Selecting a model with a higher degree of generalizability and better accuracy for the PM25
values. The models are outlined in the following paragraphs:
linear regression is used to represent the connection among one or more extracted features in
guided machine learning models. It is used to predict continuous values by using unique
information from the data as input for regression models. Take the relationship between the
input X and output Y as an example of how this can be done. Single independent
characteristics are used to forecast the amount of output in regression procedures. A slopped
line can be used to draw the linear graph to a data in order to establish this relationship.
24

Regression models are used to solve issues that have few assumptions. like as:
- There should be a linear relationship between the dependent and relationship between
the independent variable
- A normal distribution assumption is made. (It means 0).
- It presupposes that each mistake term is distinct from the rest. It should have a
consistent fluctuation in its output.
Calculating the scores of an outcome variable (dependent) by comparing them to the scores
of predictor variables is what we mean by the term "prediction" (independent). An algorithm
is used to construct a regression curve that passes across all the data points. Perfect suited line
is designed to forecast values as closely as feasible based on actual and observed data. As a
result, the primary goal of best fit is to reduce the discrepancy between anticipated and actual
values (Residuals). This technique is known as the least square’s method. As a result, the
quality of the regression line is improved by reducing the total squared errors. Univariate
linear regression has a hypothesis.
25

Find the slope of the graph of the line that minimises the goodness of fit in this linear
regression. RSS, or the residue sum of squares, can be used to describe this expense. The sum
of the squared deviations between the estimated and real relation among X and Y is what this
expression represents.
To put it another way, using PM2.5 and numerous independent factors, MLR may predict a
dependent variable's value (y) based on the value of one or even more independent variables
(x) (Predictor variables). Errors between expected and actual values are measured by the cost
function. To acquire the optimum regression fit line, this cost function must be minimised.
(Akiladevi R, 2021)

XG-BOOST
Boosting is a decision tree combination similar to random forests. Both random forests and
gradient boosting build trees one by one, but there are two major differences. When it comes
to classification, gradient boosting and regression, random forests integrate findings at the
conclusion of the procedure (by averaging or using "majority rules").
In predictive analytics, the Extreme XG-boost Algorithms (XG-Boost) is a machine learning
method that can be used for a variety of issues including tree boosting (sequential ensemble).
In order to increase performance, it gradually integrates weak learners with more accurate
models (strong learners). Gradient Decent technique, which iteratively optimises the model's
loss by updating weights, is used in gradient boosting to update weights by employing
gradient (a direction in the loss function). Difference between projected and actual values is
what makes up the Loss Function (LF). Mean Squared Error (MSE) loss will be used in the
regression problem.

ANALYTICAL FRAMEWORK
Analytical frameworks can be used to tackle business challenges in numerous ways. The
CRISP-DM framework has been used in this investigation, however. In order to complete the
entire project life cycle, this approach uses a new analysis at each stage. After defining the
study's goals, the data interpretation step lets researchers identify patterns in the data and
handle data types so that they can input it into a modelling programme.
Once the data have been analysed, they can be used in the modelling stage, where they can be
evaluated using the model they just built. A variety of measures will be used to gauge the
model's generalizability, and the best model will be selected for final deployment and closely
monitored.
26

SUMMARY
This chapter's conclusion summarises the study's approach. There have been a variety of
methods employed by scientists to date in order to make predictions about pollution
concentrations in the air. Prediction of PM25 was analysed using a wide range of traditional
statistics and deep learning methods.
Additionally, Tutiempo Networks SL's weather observations were incorporated into this
study. Web scraping technologies like Beautiful Soup are used to extract the relevant table
from the raw data obtained as html pages. Between January 2013 and April 2016,
metrological data was collected for Delhi, which yielded eight distinct features. Then,
Openweathermap.org has been used to collect PM25 AQI data for the same time period. By
combining meteorological and AQI data, the total dataset is formed. Using pertinent plots and
graphs, we've explored handling incomplete data, multivariate analysis, bivariate analysis,
dealing with multiple collinearities, outlier detection, and feature selection/importance.
(Kumar, 2021)
Different classic statistical methods for modelling are then described and assessed using
metrics such as MAE (mean absolute error), MSE (mean squared error), and RMSE (relative
mean square error). Lastly, we've discussed the framework that we employed for our
research.

CHAPTER-6 CONCLUSIONS AND RECOMMENDATIONS


INTRODUCTION
Section 6.2 of this chapter discusses how the research objectives described in Section 1.3 are
used to develop conclusions. 6.2 and 6.3 deal with the project's contribution to knowledge
and the project's future direction and recommendations.
DISCUSSION AND CONCLUSION
In order to arrive at these results, it is necessary to refer back to the original section 2.14
research objectives. This is what we hope to learn from this study:
- To investigate the correlation between meteorological conditions and air pollutant
(PM25) AQI values utilising statistical approaches/graphs.
To achieve this goal, histograms and frequency plots are used to analyse each piece of
meteorological data to determine where the data points are most concentrated and how dense
they are. (Ma, 2021)
27

When the study looked at the frequency and percentage and how the data was tailed
(kurtosis) compared to the normal distribution, it saw platykurtic kurtosis. It is then utilised to
see how each independent feature correlated with target feature PM25 in a bivariate study.
SLP and PM25 have a positive correlation, while Tm and T have a strong negative
correlation.
- Finding a more generalised model for predictions and proposing various supervised
learning methods
In order to forecast the PM25 level from meteorological data, supervised machine learning
methods such as Regression analysis, Decision Tree, Random Forest and XG-Boost model
have been developed. The Random Forest algorithm performs better than other models when
compared. In terms of training and testing, this model has been fairly successful. As a result,
it was chosen as one of the most generalizable models for this study, with an R2 of 70%.
- Analysis of MAE, MSE, and RMSE as well as residual analysis is used to monitor the
effectiveness of the regression methods (Regression).
MAE, MSE, and RMSE have been used to evaluate the presented models. As the model's
complexity increases, researchers have noticed a little drop in the error term. The model's
performance improves as a result. MSE and RMSE are the finest performance metrics for
evaluation. This is because as a model gets more complicated the error levels of MSE &
RMS decrease significantly.
CONTRIBUTION TO KNOWLEDGE
Ozone, NO2, and PM25 concentrations are influenced by local topography, weather data, air
pollution sources, and new studies on pollutant concentration prediction. Meteorological
conditions play an important effect in determining the variance in air pollution
concentrations.
So yet, this study has simply looked at weather data. Predictions of PM2.5 concentrations are
highly correlated with daily average temperatures, sea level atmospheric temperatures, and
average visibility. The findings of this study can be used by government bodies to develop
new policies or preventative actions based on India's standard air quality index. (Huang,
2015)
FUTURE WORK
In New Delhi, India's capital city, this study tested the forecasting of PM2.5 utilizing solely
meteorological data over a short time period. After that, the system's performance is assessed
manually. Therefore, future studies will take into account using additional metrological data
to handle overfitting issues and utilising deep learning techniques to improve performance
and validated using real-time data. It also serves as a foundation for the development of
research in the areas of weather forecasting, catastrophe warning and policy formulation for
the aviation & agricultural industries.
28

References
Akiladevi R, N. D. B. N. K. V. N. P., 2021. Prediction and Analysis of Pollutant using.
[Online]
Available at: https://www.ijrte.org/wp-content/uploads/papers/v9i2/A2837059120.pdf
[Accessed 21 October 2021].
Huang, C., 2015. High-Resolution Spatiotemporal Modeling for Ambient PM2.5 Exposure
Assessment in China from 2013 to 2019. [Online]
Available at: https://pubs.acs.org/doi/full/10.1021/acs.est.0c05815?
casa_token=FhXEjX04_DsAAAAA%3AsiyFpqHMsva-
cCwlcLdGnYvGXEaoyY2sKUh5Edfpk1C0xrt3k4MYH96REtmPA7NQE4D2ZISdukiHiA
[Accessed 28 June 2020].
29

Kumar, S., 2021. A machine learning-based model to estimate PM2.5 concentration levels in
Delhi's atmosphere. [Online]
Available at: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7710640/
[Accessed 18 Febraury 2021].
Madhuri.N, 2021. Prediction of Spatial-Temporal PM2.5 using Non-Parametric Technique.
[Online]
Available at: https://www.irjet.net/archives/V8/i8/IRJET-V8I8453.pdf
[Accessed 18 August 2021].
Ma, J., 2021. Application of the XGBoost Machine Learning Method in PM2.5 Prediction: A
Case Study of Shanghai. [Online]
Available at: https://aaqr.org/articles/aaqr-19-08-oa-0408
[Accessed 18 March 2021].
Patil, B. U., 2021. OPTIMIZATION OF HYPER PARAMETERS. [Online]
Available at: https://ijits-bg.com/contents/IJITS-2021-No3/2021-N3-07.pdf
[Accessed 18 October 2021].
Shi, J., 2018. Time Series Forecasting (TSF) Using Various. [Online]
Available at: https://arxiv.org/ftp/arxiv/papers/2204/2204.11115.pdf
[Accessed 18 January 2018].
Sun, H., 2015. Improvement of PM2.5 and O3 forecasting by integration of 3D numerical
simulation with deep learning techniques. [Online]
Available at: https://www.sciencedirect.com/science/article/pii/S2210670721006466?
casa_token=8dgF1ESToyAAAAAA:x98xxuYKFCkroG6ZWGvgFIWDUOjh0JOd7vVjM1E
fvnWcVj4f3vCB6DHvEA23cBhUPDwgUEyx
[Accessed 04 September 2018].
Trung, T., 2021. PM2.5 Forecast System by Using Machine Learning and WRF Model, A
Case Study: Ho Chi Minh City, Vietnam. [Online]
Available at: https://aaqr.org/articles/aaqr-21-05-oa-0108
[Accessed 5 March 2020].
UnjinPak, 2020. Deep learning-based PM2.5 prediction considering the spatiotemporal
correlations: A case study of Beijing, China. [Online]
Available at: https://www.sciencedirect.com/science/article/pii/S0048969719334813?
casa_token=AKLz6aGElkMAAAAA:yFRwUvhtFs8abgl7ftbiFLNhqok4ZPfgfU6K5I3SnvN
OEArkoctDRvAoDX52pxzJ8FRQowwK
[Accessed 25 June 2019].
Wang, H.-P., 2020. PM2.5 Prediction Model Based on Combinational Hammerstein
Recurrent Neural Networks. [Online]
Available at:
https://www.researchgate.net/publication/347408054_PM25_Prediction_Model_Based_on_C
ombinational_Hammerstein_Recurrent_Neural_Networks
[Accessed 18 December 2020].
Xiao, F., 2018. An improved deep learning model for predicting daily PM2.5 concentration.
[Online]
30

Available at: https://www.nature.com/articles/s41598-020-77757-w


[Accessed 28 May 2018].
Xiao, Q., 2018. An Ensemble Machine-Learning Model To Predict Historical PM2.5
Concentrations in China from Satellite Data. [Online]
Available at: https://pubs.acs.org/doi/abs/10.1021/acs.est.8b02917
[Accessed 19 January 2018].
Xiao, Q., 2018. An Ensemble Machine-Learning Model To Predict Historical PM2.5
Concentrations in China from Satellite Data. [Online]
Available at: https://pubs.acs.org/doi/full/10.1021/acs.est.8b02917?
casa_token=xAaSO2rSzoYAAAAA%3ACFYkPJB7nbnpsfc8TwztrM_DsQ3nVc-
CmDwIgRh0kMLgQ46aGbLvtpA7VJ9orKywNyJ7qMeQeLb0-A
[Accessed 28 July 2019].
Zeng, Q., 2016. Deep Learning Architecture for Estimating Hourly Ground-Level PM2.5
Using Satellite Remote Sensing. [Online]
Available at: https://ieeexplore.ieee.org/abstract/document/8685685?casa_token=mjvB0IWg-
lQAAAAA:mHAsA0XnQXV92CMFf16qvdF6tdjNpacFJPI6VS4DaTj6rNUh9aVxnOr0xt6Ip
ld3MgrB4BOU
[Accessed 29 August 2018].
Zhong, J., 2019. Relatively weak meteorological feedback effect on PM2.5 mass change in
Winter 2017/18 in the Beijing area: Observational evidence and machine-learning
estimations. [Online]
Available at: https://www.sciencedirect.com/science/article/pii/S0048969719302414?
casa_token=hvnZWJR0620AAAAA:2MYxTz5VVcihzJXQJ7TkbAFeiD_k358-
etL4eTvoI5rzQwJ-JjDzcexiuSct6u8HoGvjmGYT
[Accessed 18 June 2020].

You might also like