Professional Documents
Culture Documents
(ESE1008)
Project Report
Declaration of Originality
I understand that Plagiarism is the act of taking and using the whole or
any part of another person’s work and presenting it as my own without
proper acknowledgement.
Class: PE06
1
Content Page
1. Pre-Project Plan
2. Monitor
3. Introduction
4. Data Cleaning
6. Further Insights
7. Data Modeling
8. Conclusion
9. Reflection
2
1. Pre-Project Plan
Goal Setting
I shall check the project rubric to ensure all items are done before submission.
levels?
6. How do the SO2 and NO2 levels vary with rainfall with time?
3
Monitor
3. Introduction
4
Today, the world is rapidly developing. Urbanisation is the process of increasing
the proportion of people living in urban areas. [1] This is significantly seen in
Beijing over the past 30 years with many croplands being converted to urban
areas. By the end of 2017, Beijing had a population of 21.7 million and an
urbanisation rate of 86.5%. In the past 30 years or so, the city’s population has
nearly tripled, and its build-up area has nearly eight-folded. [2] Although
urbanisation has many advantages to the economy of a country, it comes with
many adverse circumstances in terms of air quality and other environmental
factors like temperature, pressure, rainfall, and dewpoint. Poor air quality can
affect the health of people living in that region. Poor air quality can affect people
of all ages. Some short-term effects include irritation of the eyes, nose, and throat
and shortness of breath. This can aggravate asthma and other respiratory
conditions. Extended periods of inhaling poor air can affect the heart and
cardiovascular system, leading to lung cancer and heart disease. [3]
The Concatenate node was used to join the data of each station into one single
dataset.
The Concatenated table allows us to learn more about the variables in the data
5
There are 70128 observations for each location
17 variables
Two rule engine nodes were used to append a longitude and latitude column
into the dataset using the ‘Location_GPS” data given.
6
7
Latitude
Longitude
8
The classified values now show the Longitude and Latitude of the stations.
However they are strings.
The string to number node was used to transform the latitude and longitude
values into Number (double).
The transformed table now shows that the latitude and longitude values are
identified as numbers.
9
10
4. Data Cleaning
The statistics node was used to show how many missing values are there in the
dataset
11
Name of the variable Number of missing values
year 0
month 0
day 0
hour 0
PM2.5 1704
PM10 1374
SO2 1665
NO2 2257
CO 3788
O3 2933
TEMP 73
PRES 70
DEWP 73
RAIN 71
WSPM 57
Latitude 0
Longitude 0
Missing Value node was used to remove all the rows with missing values in
them
12
The output table from the missing value node shows that the number of rows has
reduced from 70128 to 63275. 6853 rows were lost due to missing values. There
are now 19 variables compared to the previous 17 due to the addition of the
longitude and latitude columns.
Statistics node was used again to verify that there is no more missing data.
13
14
The statistics view now shows that there are no missing values in all the variables
which implies that the data is now clean.
15
CSV Writer node was used to export the clean data into a new CSV file.
16
17
5. Exploratory Data Analysis
18
5.2. PM2.5
The box and whisker plot shows that the PM2.5 levels are positively skewed
and a significant amount of data lies beyond the upper whisker (>234) which
interprets that there are potential outliers.
19
5.3 PM10
The box and whisker plot shows that the PM10 levels are positively skewed
and a significant amount of data lies beyond the upper whisker (>295) which
interprets that there are potential outliers.
20
5.4. Sulfur Dioxide
21
Interquartile range = 18 – 2 = 16
The box and whisker plot shows that the SO2 levels are positively skewed and
a significant amount of data lies beyond the upper whisker (>41) which
interprets that there are potential outliers.
Interquartile range = 64 – 15 = 49
The box and whisker plot shows that the NO2 levels are positively skewed and
a significant amount of data lies beyond the upper whisker (>137) which
interprets that there are potential outliers.
22
5.6. Carbon Monoxide
The box and whisker plot shows that the CO levels are positively skewed, and
a significant amount of data lies beyond the upper whisker (>2600) which
interprets that there are potential outliers.
23
5.7. Ozone
Interquartile range = 88 – 17 = 71
The box and whisker plot shows that the O3 levels are normally distributed, and
a significant amount of data lies beyond the upper whisker (>194) which
interprets that there are potential outliers.
5.8. Windspeed
24
50% of the windspeed levels in both locations are below 1.4m/s
Windspeed should be interpreted as continuous data type to have a meaningful
insight.
The box and whisker plot shows that the windspeed levels are positively
skewed and a significant amount of data lies beyond the upper whisker (>4.2)
which interprets that there are potential outliers.
25
5.9. Temperature
50% of the temperature levels in both locations are below 14.6 degrees Celsius
Temperature should be interpreted as continuous data type to have a
meaningful insight.
Temperature levels range from -16.8 degrees Celsius to 41.4 degrees Celsius.
The box and whisker plot shows that the temperature levels are negatively
skewed, and none of the data lies beyond the upper and lower whiskers (>41.4
and <-16.8) which interprets that there are no outliers.
26
5.10. Pressure
The box and whisker plot shows that the pressure levels are normally
distributed, and none of the data lies beyond the upper and lower whiskers
(>1042 and <-982.8) which interprets that there are no outliers.
27
5.11. Dewpoint
50% of the dewpoint levels in both locations are below 2.8 degrees Celsius
Dewpoint should be interpreted as continuous data type to have a meaningful
insight.
Dewpoint levels range from -35.3 degrees Celsius to 28.5 degrees Celsius.
The box and whisker plot shows that the dewpoint levels are normally
distributed, and none of the data lies beyond the upper and lower whiskers
(>28.5 and <-35.3) which interprets that there are no outliers.
28
5.12. Rain
The reason why the box and whisker plot shows the upper whisker, upper
quartile, median, lower quartile and lower whisker at 0 is because majority of
the rain data lies on 0 so any rain data with a value more than 0 can be
considered as an outlier.
29
5.13. Wind Direction
30
31
From the correlation matrix, we can tell which variables have a positive, negative or
no correlation. We can also tell if these correlations are strong or weak.
32
PM10 – DEWP
SO2 – NO2
SO2 – CO
SO2 – PRES
NO2 – PRES
CO – PRES
O3 – TEMP
O3 – DEWP
O3 – WSPM
PRES – WSPM
DEWP – RAIN
No correlation
PM2.5 – PRES
PM2.5 – RAIN
PM10 – PRES
PM10 – RAIN
SO2 – RAIN
NO2 – DEWP
NO2 - RAIN
CO – RAIN
O3 – RAIN
TEMP – RAIN
TEMP – WSPM
RAIN – WSPM
Weak negative correlation
PM2.5 – O3
PM2.5 – TEMP
PM2.5 – WSPM
PM10 – O3
PM10 – TEMP
PM10 – WSPM
SO2 – O3
SO2 – TEMP
SO2 – DEWP
SO2 – WSPM
NO2 – O3
NO2 – TEMP
NO2 – WSPM
CO – O3
CO – TEMP
CO - DEWP
CO – WSPM
O3 – PRES
33
PRES – RAIN
DEWP – WSPM
34
These are the scatter plots to support the correlation claims and view them
graphically
35
36
37
38
39
40
41
5.15. Calculated Fields
42
6. Further Insights
against the months of the year. The bar graph represents the rain while the
line graph represents the temperature. The second graph is the same as
the first but using the rain and temperature values from Dingling instead.
The last graph shows the average air pressure against the months of the
43
year. The 3 graphs are on top one of the other in a single dashboard for
clearer viewing of the correlation between the variables. From the graphs,
we can see that in general both locations have the same trend of rain and
also increases. This increase in rain and temperature usually occurs in the
middle of the year only, between June to August, before reducing again
towards the end of the year. Furthermore, we can see that air pressure
levels reduce during the hot rainy season and increases as the temperature
After further analysation of the data, I can deduce than in general, the
maximum rain and temperature occurs in July for both locations. After
looking into the days of the month of July, I plotted another bar and line
44
We can see that both locations experience almost the exact same rainfall
and temperatures in the days of the month. We can see that the
temperatures in both locations do not vary much through the month and
stays within 25 to 30 degrees Celsius. However, we can see that the rain in
both locations is not consistent. For the most part, little to no rain occurs
selected days.
45
I researched the reasons why the correlation between temperature, rainfall
to the Earth’s axis being tilted at 23.4 degrees. Due to the geographical
location of China, from June to September, the rays from the sun hit China
at a steep angle causing more energy to reach China during that period.
Together with longer hours of daylight during summer, this causes higher
and Dingling are greater in the middle of the year compared to the start and
it experiences summer at this time where the air in the region heats up,
expands, and rises. This forms a region of low pressure over the area. At
the same time, places like Australia that are in the southern hemisphere are
experiencing winter. The low temperatures cause the air to be cold and
dense, creating an area of high pressure. As air tends to move from areas
of high pressure to areas of low pressure, this forms winds blowing from
Australia towards China as southwest monsoon winds. This wind is dry and
cold. As the winds pass the equator, the Coriolis effect occurs causing the
46
and picks up moisture as it passes over the Pacific Ocean and brings heavy
rains to China.
general trend of higher rainfall in the middle of the year, peaking around July
and August. Furthermore, the air pressure line graph confirms this theory
as it can be seen that the air pressure decreases during the summer period
and temperature. The first graph shows the average ozone levels against
the relative humidity levels. The second graph shows the average ozone
levels against the temperature and the last graph shows the average ozone
against the months in a year. The graphs were stacked on each other in a
single dashboard for better comparison and convenience for the viewer to
47
From the graphs we can see that in general, ozone levels rise in the middle
of the year and are lower at the start and end of the year. We can also see
that temperature and ozone levels have a direct relationship where when
relative humidity also affects the ozone levels inversely. This means that
temperatures in the middle of the year which explains why ozone levels are
pollutants from sources like fossil fuel combustion, vehicles, landfills and
more. This reaction can be catalysed in the presence of heat and sunlight
Relative humidity is the ratio of the actual amount of water vapour present in
the air to the total amount of water vapour the air can hold at a given
temperature. At higher temperatures, the air can hold more water vapour,
causing the relative humidity to decrease. This means that a high relative
humidity, temperatures are lower, so this is the reason why at higher relative
48
humidity can also allow for the rehydration of desiccated pathogens, which
As seen from the map, which I plotted using the latitude and longitude values of each
Southeast of Dingling.
To compare these two locations and prove which pollutant is affected by wind, I
plotted a bar graph of the average amount of each pollutant in each wind direction for
each location.
49
From the graph, we can see that all the pollutants are affected by wind. This
is evident as from the graph, we can see that carbon monoxide, ozone,
nitrogen dioxide, PM10 and PM2.5 and sulfur dioxide increase and
the pollutants may be affected by the wind, the vary in differing amounts.
50
Northeast winds is an increase of 613ug/m3 while the difference in SO2
I conducted some research to find out more about why this occurs.
The reason for this is the density of each pollutant. Carbon monoxide,
PM10 and PM2.5 is less dense than air, causing a slight wind to cause a
Ozone is denser than air so it requires a greater wind speed to change the
Another way to see that wind affects the pollutant levels is by seeing the
pollutant levels in each location when wind is present and when it is not. For
the example below, I compared the level of pollutants in each location when
there was a wind coming from the North-Northwest direction and when
there is no wind.
51
The bar graph below shows the average level of each pollutant in Aotizhongxin
This is the bar graph showing the average level of each pollutant in Aotizhongxin when
52
This is the bar graph showing the average level of each pollutant in Dingling when
53
The bar graph below shows the average level of each pollutant in Dingling when there
is no wind
wind no wind
O3 50 65.2 26 58.1
It can be seen that in both locations, most of the pollutants decrease in level when
wind is present. This is because when there is no wind, the stagnant conditions can
54
cause the pollution levels build up and become more concentrated while winds can
disperse the pollutants and dilute the concentrations. A likely reason for the rise in
ozone levels with an increase in wind can either be that the wind speeds in the North-
Northwest directions are very low, causing little to no dispersal of the Ozone due to its
density compared to other pollutants. Another possibility for this is that there were
winds, but it was blowing ozone from a different region of China with high ozone
levels, causing the Ozone levels in both location to rise when wind was present.
6.4. What are the conditions of the air quality in each location?
the level of air pollution in the country. Different countries have different air
pollutants can pose a risk to public health, the air quality index acts as an
Air Quality Index are PM10, PM2.5, NO2, O3, CO and SO2. However,
sometimes NH3 and Pb levels are also taken into account to determine the
AQI.
55
Category AQI Range Definition
considered
no risk
pose a moderate
population of people
sensitive to air
pollution.
56
general public is not
likely to be affected
experience health
of sensitive groups
may experience
emergency conditions.
likely to be affected.
For my data, I will be determining the AQI levels in each location using
PM10, PM2.5, NO2, O3, CO and SO2 levels only as I do not have NH3 and
Pb data.
PM2.5
57
The pie-chart shows that in both locations, the PM2.5 levels lie under the
very poor and severe category. This can pose an extreme health risk to
public health. PM2.5 are particles that are less than 2.5 micrometers in
penetrate deeply into the lungs, irritate and corrode the alveolar wall, and
diffusion affecting the heart, kidneys and other organs. The causes of
these high PM2.5 levels are most likely from the emissions of motorised
vehicles and factories. In China, there are about 253.76 million vehicles
and 399.4 thousand industrial enterprises. With China being the greatest
population country and a manufacturing giant for the world, this results in
PM10
58
The pie chart depicts that in both locations, more than 50% of the time, the
PM10 levels are within the very poor and severe category. Pm10 are
these particles can penetrate deep into the lungs. Exposure to high
blood pressure, heart attack, strokes and premature death. The sources of
59
NO2
The piechart shows that on average, the NO2 levels in Aotizhongxin lies
under the very poor category while the NO2 levels in Dingling lie under the
Poor category. This means that the NO2 levels in Dingling is lower than
Aotizhongxin and Dingling. Road traffic and gas is the main cause for NO2
city, a lot of traffic runs through Aotizhongxin. Dingling on the other hand,
through there, NO2 levels will not be as high. NO2 can cause irritation of
eyes, nose and throat and when inhaled might cause lung irritations and
60
O3
The piechart shows the O3 levels in both locations are very similar ranging
from the good to very poor category. This means that the ozone levels in
the locations have been well maintained. The only concern would be
CO
61
The CO levels in both locations are also very similar, ranging from the
good to moderate category. This is a good sign which the general public
SO2
The SO2 levels in both locations do not differ much and range between
satisfactory to moderate category. This is also a good sign and should not
62
As it seem that the PM2.5 and PM10 levels are of the greatest concern with
the rest of the pollutants being within safe levels, the average AQI of both
locations are good and the air quality is good for the most part as an overall.
mater in the atmosphere. From the graph we can see that Dingling has a
greater level of good pollutant levels. This is most likely due to the
levels?
63
I plotted the two line graphs above to view the correlation between
temperature and average CO levels. The first graph shows the average
carbon monoxide levels against the months of a year. The second graph
China experiences summer with high temperatures in the middle of the year
which explains why the CO levels are lower in the middle of the year.
I did some further research to understand why the CO levels are higher at
lower temperatures.
64
The main cause of carbon monoxide is the incomplete combustion of fuel.
As mentioned before, there are about 253.76 million motorised vehicles in
China with Beijing ranking the highest of any Chinese city with 5.4 million
vehicles. As seen from the line graph, carbon monoxide levels are highest
at the start and end of the year, when it is winter in China and
temperatures are low. It is also at this time of the year that many people
use gas heaters that burn natural gas to keep an area warm during the
winter. At these low temperatures, the air is denser so more air is
introduced into the combustion process. This causes the air to fuel ratio to
be very high, causing a leaner mixture. If the mixture is too lean, it will not
burn effectively and cause incomplete combustion. In normal combusttion,
the fuel contains carbon and hydrogen while the air contains oxygen.
During combustion, the carbon and hydrogen combines with the oxygen to
produce carbon dioxide and water but if incomplete combustion occurs,
the carbon does not completely oxides, forming soot and carbon monoxide
in the process.
6.6. How do the SO2 and NO2 levels vary with rainfall with time?
65
I plotted the line graphs above to view the correlation of SO2 and NO2
levels with rainfall. The first graph shows the average NO2 and SO2 levels
against the months of a year. The purple line represents SO2, and the
brown line represents NO2. The second graph shows the average rain
against the months of the year. I stacked these graphs in a single
dashboard to view their correlation better and with greater convenience.
The second graph shows that the rainfall increases from the start to the
middle of the year and decreases again towards the end of the year. This
has been explained in question 1 as to why the rainfall is higher during the
middle of the year. From the first graph we can see that the NO2 and SO2
levels drop during the middle of the year when rainfall is highest.
The main sources of nitrogen dioxide are gasoline vehicles, power plants,
diesel powered industrial equipment and industrial boilers. These sources
emit large amounts of nitrogen oxide gas which oxides with oxygen in the
air to produce nitrogen dioxide.
The main sources of sulfur dioxide are from the burning of fossil fuels by
power plants and petroleum refineries. These processes release sulfur into
the environment as a by-product which combines with oxygen in the air to
66
form sulfur dioxide gas. Sulfur dioxide can also be produces naturally from
volcanic activity and biological decay, but this does not contribute much to
the sulfur dioxide levels in Aotizhongxin and Dingling.
The nitrogen dioxide gas is soluble in water, so it combines with water in the
air to produce nitric acid and nitrogen oxide gas. When rainfall occurs, the
raindrops are acidic due to the nitric acid dissolved within it, causing acid rain.
This contributes to 25% of the acidity in rainwater. Therefore, when rainfall
occurs, the nitrogen dioxide will dissolve in it which explains the drop in NO2
when rainfall increases.
As for the sulfur dioxide, it is also soluble in water and combines with water in
the air to produce sulfuric acid. When rainfall occurs, the raindrops are acidic
due to the sulfuric acid dissolved within it, causing acid rain. Furthermore, as
sulfuric acid is a strong acid, it can further dissociate to give hydrogen ions
and sulfate ions, causing the concentration of hydrogen ions in the sulfuric
acid to increase substantially, lowering the pH of rainwater even more, making
the acid rain more detrimental. This contributes to the other 75% of the acidity
in rainwater. Therefore, when rainfall occurs, the sulfur dioxide will dissolve in
it which explains the drop in SO2 when rainfall increases.
Acid rain can cause many environmental and infrastructure problems in the
world. It causes lakes to become so acidic that fish cannot survive, it releases
toxins like aluminum ions into our water supply, kills trees and crops and can
deteriorate stone buildings and structures. Furthermore, acid rain can also
occur in far away areas from where the SO2 and NO2 are emitted due to wind
currents.
67
7. Data Modeling
68
The partitioning node was used to divide the data so that 70% will be the training set
used by linear regression while the other 30% will be the test set that will be used by
the regression predictor. This was done to check the accuracy of the linear
regression learner without adding in new data.
69
The regression predictor node was used to append a prediction column in the data to
show what the linear regression model predicts the PM2.5 level is going to be.
The numeric scorer node was used to calculate the Root Mean Square Error
(RMSE) to check the accuracy of the linear regression model. The RMSE it
produced was 30.021 and the R-squared value it produced was 0.847. The R-
squared value is a value that ranges from 0 to 1 and it is stated as a percentage from
0% to 100%. It determines how many percent of the data fits on the regression line.
This means that 84.7% of my data fits well on the regression line.
70
To improve the accuracy of my data, I decided to configure the partitioning node to
use 80% of the data for the training set used by linear regression while the other
20% will be the test set that will be used by the regression predictor.
After executing the linear regression model, I managed to get a R-squared value of
0.853 and a RMSE of 29.457. This means the predicted data is now more accurate.
To visually view the correlation between the predicted PM2.5 value and the actual
one, I used the CSV writer node to save the new dataset with the predicted PM2.5
values in my computer to plot a scatterplot of the predicted PM2.5 against the actual
PM2.5.
71
We can see that the scatterplot shows a strong relationship of the predicted PM2.5
against the actual PM2.5 which proves its R-squared value. We can see that there
are some outliers at lower levels of actual PM2.5.
72
To view the accuracy of the linear regression model further, I plotted a residual plot
which plots the residuals against the predicted PM2.5 level. The residuals are
basically the difference between the actual PM2.5 level and predicted PM2.5 level.
The more accurate the prediction, the closer the residuals will be to zero.
We can see that the residual plot is heteroscedastic. Though the trend line shows
that most of the data lie at 0. The red line shows the outliers.
8. Conclusion
I learnt how important analysing air pollutants is now as it can ensure the
safety of people and our planet. Analysing such data and gaining insights can
also allow us to tackle the problem better and find ways to reduce its effects.
I learnt that many of our everyday activities contribute to a large sum of the air
pollutants without even realising. Now that I have this knowledge and its ill
effects, I now know what to avoid reducing air pollutants.
I learnt that China has higher temperatures in the middle of the year and lower
temperatures at the start and end of the year. This affects the rainfall, air
pressure, ozone levels and carbon monoxide levels. I also learnt about how wind
can affect the level and types of air pollutants, how governments determine the
safety level of the country’s air quality and how rainfall occurs in China and how
some pollutants can make even rain dangerous.
9. Reflection
73
In terms of software, I now have a deeper understanding of how tableau and
KNIME works and learnt to appreciate how it can turn complicated data into
simple visual charts to gain insights from and predict data.
I feel that I have now grasped the information learnt in class better through
this project.
Some problems I faced during this project is juggling this with other
schoolwork. I also had struggles in finding data that correlates well to form
sophisticated questions with deep meaning behind them. I feel my time
management is not the best which is why I had these challenges during the
project.
10. References
[2] What Can the Growth of the Beijing Metropolitan Area Teach Us About Cities?
| Peak Urban (peak-urban.org)
74