You are on page 1of 74

Data Visualisation and Analytics

(ESE1008)

Project Report

Declaration of Originality

I am the originator of this work and I have appropriately acknowledged


all other original sources used in this work.

I understand that Plagiarism is the act of taking and using the whole or
any part of another person’s work and presenting it as my own without
proper acknowledgement.

I understand that Plagiarism is an academic offence


and if I am found to have committed or abetted the offence of plagiarism
in relation to this submitted work, disciplinary action will be enforced.

Submitted By Student’s Signature

Jashan Deepak Nandwani ___ ___

Class: PE06

AY2021/2022 OCT SEMESTER

1
Content Page

1. Pre-Project Plan

2. Monitor

3. Introduction

4. Data Cleaning

5. Exploratory Data Analysis

6. Further Insights

7. Data Modeling

8. Conclusion

9. Reflection

10. References (if any)

2
1. Pre-Project Plan

Goal Setting

I aim to complete my project by 20/01/2022.

I shall take initiative to find out the information needed.

I shall check the project rubric to ensure all items are done before submission.

My data set is _PRSA_Data_Aotizhongxin_20130301-2017022,


PRSA_Data_Dingling_20130301-20170228 and Location_GPS

My preliminary questions that I will answer from my data set:

1. Is there a correlation between temperature, rainfall, and pressure?

2. Does temperature and relative humidity affect ozone levels?

3. Are any of the pollutants affected by the wind?

4. What are the conditions of the air quality in each location?

5. Is there a correlation between temperature affecting Carbon Monoxide

levels?

6. How do the SO2 and NO2 levels vary with rainfall with time?

3
Monitor

Task/Milestone By When Actual Comment


Completed (on-time/delay/
Date early)
Download the data. 21 Nov 2021 20 Nov 2021 early
Understand the rows and
columns.
Background research of 28 Nov 2021 23 Nov 2021 early
delivery mode, function of
eCommerce.
Perform data cleaning. 3 Dec 2021 23 Nov 2021 early

Perform data transformation. 8 Dec 2021 23 Nov 2021 early

Exploratory Data Analysis 17 Dec 2021 3 Jan 2022 late

Submit Report 1 24 Dec 2021 24 Dec 2022 on-time

Answer my preliminary 9 Jan 2022 15 Jan 2022 late


questions

Data modeling 10 Jan 2022 16 Jan 2022 late

Final report conclusion and 14 Jan 2022 23 Jan 2022 late


reflection

Create Dashboard 17 Jan 2022 17 Jan 2022 On time

3. Introduction

4
Today, the world is rapidly developing. Urbanisation is the process of increasing
the proportion of people living in urban areas. [1] This is significantly seen in
Beijing over the past 30 years with many croplands being converted to urban
areas. By the end of 2017, Beijing had a population of 21.7 million and an
urbanisation rate of 86.5%. In the past 30 years or so, the city’s population has
nearly tripled, and its build-up area has nearly eight-folded. [2] Although
urbanisation has many advantages to the economy of a country, it comes with
many adverse circumstances in terms of air quality and other environmental
factors like temperature, pressure, rainfall, and dewpoint. Poor air quality can
affect the health of people living in that region. Poor air quality can affect people
of all ages. Some short-term effects include irritation of the eyes, nose, and throat
and shortness of breath. This can aggravate asthma and other respiratory
conditions. Extended periods of inhaling poor air can affect the heart and
cardiovascular system, leading to lung cancer and heart disease. [3]

As urbanisation is an ongoing process, leading to a further deterioration in the air


quality and environment, the data was collected at Aotizhongxin and Dingling to
measure the levels of pollutants in the air and some other environmental
variables in two different locations within Beijing. This was done to ensure that
the air quality in Beijing is at an acceptable level and to deduce trends and gather
insights of these variables to hopefully combat the growing issue in the near
future.

The Concatenate node was used to join the data of each station into one single
dataset.

The Concatenated table allows us to learn more about the variables in the data

5
There are 70128 observations for each location

17 variables

Name of variable Data type


year Numerical, discrete between 2013 to
2017
month Numerical, discrete between 1 to 12
day Numerical, discrete between 1 to 31
hour Numerical, discrete between 0 to 23
PM2.5 Numerical, continuous range from 4 to
271
PM10 Numerical, continuous range from 7 to
310
SO2 Numerical, continuous range from 2 to 67
NO2 Numerical, continuous range from 2 to
126
CO Numerical, continuous range from 200 to
3900
O3 Numerical, continuous range from 2 to
207
TEMP Numerical, continuous range from -5.2 to
32
PRES Numerical, continuous range from 992.4
to 1028.7
DEWP Numerical, continuous range from -21.8
to 22.8
RAIN Numerical, continuous range from 0 to
0.2
wd Categorical, N, NE, E, SE, S, SW, W,
NW
WSPN Numerical, continuous range from 0.1 to
5.1
station Categorical, Aotizhongxin and Dingling

Two rule engine nodes were used to append a longitude and latitude column
into the dataset using the ‘Location_GPS” data given.

6
7
Latitude

Longitude

8
The classified values now show the Longitude and Latitude of the stations.
However they are strings.

The string to number node was used to transform the latitude and longitude
values into Number (double).

The transformed table now shows that the latitude and longitude values are
identified as numbers.

9
10
4. Data Cleaning

The statistics node was used to show how many missing values are there in the
dataset

11
Name of the variable Number of missing values
year 0
month 0
day 0
hour 0
PM2.5 1704
PM10 1374
SO2 1665
NO2 2257
CO 3788
O3 2933
TEMP 73
PRES 70
DEWP 73
RAIN 71
WSPM 57
Latitude 0
Longitude 0

Missing Value node was used to remove all the rows with missing values in
them

12
The output table from the missing value node shows that the number of rows has
reduced from 70128 to 63275. 6853 rows were lost due to missing values. There
are now 19 variables compared to the previous 17 due to the addition of the
longitude and latitude columns.

Statistics node was used again to verify that there is no more missing data.

13
14
The statistics view now shows that there are no missing values in all the variables
which implies that the data is now clean.

15
CSV Writer node was used to export the clean data into a new CSV file.

16
17
5. Exploratory Data Analysis

5.1. Statistical Result

Column Min Max Mean Standard Median


Deviation
PM2.5 3 713 74.3076 76.9894 49
PM10 2 948 96.9551 88.0727 72
SO2 0.5712 229 14.5687 19.71 7
NO2 2 290 43.2842 35.8586 33
CO 100 10,000 1,089.1751 1,094.0104 700
O3 0.2142 500 62.7076 56.6036 54
TEMP -16.8 41.4 13.6489 11.4122 14.6
PRES 982.8 1,042 1,009.7823 10.5128 1,009.1
DEWP -35.3 28.5 2.326 13.7858 2.9
RAIN 0.0 52.1 0.0653 0.8068 0.0
WSPM 0.0 11.2 1.7815 1.2532 1.4

18
5.2. PM2.5

50% of the PM2.5 levels in both locations are below 49ug/m 3


PM2.5 should be interpreted as continuous data type to have a meaningful
insight.

PM2.5 levels range from 3ug/m3 to 713ug/m3.

Interquartile range = 104 – 17 = 87

The box and whisker plot shows that the PM2.5 levels are positively skewed
and a significant amount of data lies beyond the upper whisker (>234) which
interprets that there are potential outliers.

19
5.3 PM10

50% of the PM10 levels in both locations are below 72ug/m 3


PM10 should be interpreted as continuous data type to have a meaningful
insight.

PM10 levels range from 2ug/m3 to 948ug/m3.

Interquartile range = 137 – 31 = 106

The box and whisker plot shows that the PM10 levels are positively skewed
and a significant amount of data lies beyond the upper whisker (>295) which
interprets that there are potential outliers.

20
5.4. Sulfur Dioxide

50% of the SO2 levels in both locations are below 7ug/m 3


SO2 should be interpreted as continuous data type to have a meaningful
insight.

SO2 levels range from 0.5712ug/m3 to 229ug/m3.

21
Interquartile range = 18 – 2 = 16

The box and whisker plot shows that the SO2 levels are positively skewed and
a significant amount of data lies beyond the upper whisker (>41) which
interprets that there are potential outliers.

5.5. Nitrogen Dioxide

50% of the NO2 levels in both locations are below 33ug/m 3


NO2 should be interpreted as continuous data type to have a meaningful
insight.

NO2 levels range from 2ug/m3 to 290ug/m3.

Interquartile range = 64 – 15 = 49

The box and whisker plot shows that the NO2 levels are positively skewed and
a significant amount of data lies beyond the upper whisker (>137) which
interprets that there are potential outliers.

22
5.6. Carbon Monoxide

50% of the CO levels in both locations are below 700ug/m 3


CO should be interpreted as continuous data type to have a meaningful insight.

CO levels range from 100ug/m3 to 10000ug/m3.

Interquartile range = 1300 – 400 = 900

The box and whisker plot shows that the CO levels are positively skewed, and
a significant amount of data lies beyond the upper whisker (>2600) which
interprets that there are potential outliers.

23
5.7. Ozone

50% of the O3 levels in both locations are below 54ug/m 3


O3 should be interpreted as continuous data type to have a meaningful insight.

O3 levels range from 0.2142ug/m3 to 500ug/m3.

Interquartile range = 88 – 17 = 71

The box and whisker plot shows that the O3 levels are normally distributed, and
a significant amount of data lies beyond the upper whisker (>194) which
interprets that there are potential outliers.

5.8. Windspeed

24
50% of the windspeed levels in both locations are below 1.4m/s
Windspeed should be interpreted as continuous data type to have a meaningful
insight.

Windspeed levels range from 0m/s to 11.2m/s.

Interquartile range = 2.3 – 1 = 1.3

The box and whisker plot shows that the windspeed levels are positively
skewed and a significant amount of data lies beyond the upper whisker (>4.2)
which interprets that there are potential outliers.

25
5.9. Temperature

50% of the temperature levels in both locations are below 14.6 degrees Celsius
Temperature should be interpreted as continuous data type to have a
meaningful insight.

Temperature levels range from -16.8 degrees Celsius to 41.4 degrees Celsius.

Interquartile range = 23.3 – 3.2 = 20.1

The box and whisker plot shows that the temperature levels are negatively
skewed, and none of the data lies beyond the upper and lower whiskers (>41.4
and <-16.8) which interprets that there are no outliers.

26
5.10. Pressure

50% of the pressure levels in both locations are below 1009.1hPa


Pressure should be interpreted as continuous data type to have a meaningful
insight.

Pressure levels range from 982.8hPa to 1042hPa.

Interquartile range = 1018.1 – 1001.28 = 16.82

The box and whisker plot shows that the pressure levels are normally
distributed, and none of the data lies beyond the upper and lower whiskers
(>1042 and <-982.8) which interprets that there are no outliers.

27
5.11. Dewpoint

50% of the dewpoint levels in both locations are below 2.8 degrees Celsius
Dewpoint should be interpreted as continuous data type to have a meaningful
insight.

Dewpoint levels range from -35.3 degrees Celsius to 28.5 degrees Celsius.

Interquartile range = 15 – (-9.3) = 24.3

The box and whisker plot shows that the dewpoint levels are normally
distributed, and none of the data lies beyond the upper and lower whiskers
(>28.5 and <-35.3) which interprets that there are no outliers.

28
5.12. Rain

Rain should be interpreted as continuous data type to have a meaningful


insight.

Rain levels range from 0mm to 52.1mm

The reason why the box and whisker plot shows the upper whisker, upper
quartile, median, lower quartile and lower whisker at 0 is because majority of
the rain data lies on 0 so any rain data with a value more than 0 can be
considered as an outlier.

29
5.13. Wind Direction

Wind direction should be interpreted as categorical data type to have a meaningful


impact. As seen from the pie charts of the average wind directions in the 5 years,
Aotizhongxin experiences more wind from the Northeast and East-Northeast
directions while Dingling experiences more wind from the North-Northwest direction.

5.14. Correlation between variables

30
31
From the correlation matrix, we can tell which variables have a positive, negative or
no correlation. We can also tell if these correlations are strong or weak.

Strong positive correlation


 PM2.5 - PM10
 PM2.5 - CO
 NO2 – CO
 TEMP - DEWP
Weak positive correlation
 PM2.5 - SO2
 PM2.5 – NO2
 PM2.5 – DEWP
 PM10 – SO2
 PM10 – NO2
 PM10 – CO

32
 PM10 – DEWP
 SO2 – NO2
 SO2 – CO
 SO2 – PRES
 NO2 – PRES
 CO – PRES
 O3 – TEMP
 O3 – DEWP
 O3 – WSPM
 PRES – WSPM
 DEWP – RAIN
No correlation
 PM2.5 – PRES
 PM2.5 – RAIN
 PM10 – PRES
 PM10 – RAIN
 SO2 – RAIN
 NO2 – DEWP
 NO2 - RAIN
 CO – RAIN
 O3 – RAIN
 TEMP – RAIN
 TEMP – WSPM
 RAIN – WSPM
Weak negative correlation
 PM2.5 – O3
 PM2.5 – TEMP
 PM2.5 – WSPM
 PM10 – O3
 PM10 – TEMP
 PM10 – WSPM
 SO2 – O3
 SO2 – TEMP
 SO2 – DEWP
 SO2 – WSPM
 NO2 – O3
 NO2 – TEMP
 NO2 – WSPM
 CO – O3
 CO – TEMP
 CO - DEWP
 CO – WSPM
 O3 – PRES
33
 PRES – RAIN
 DEWP – WSPM

Strong negative correlation


 TEMP – PRES
 PRES – DEWP

34
These are the scatter plots to support the correlation claims and view them
graphically

35
36
37
38
39
40
41
5.15. Calculated Fields

I created a variable called relative humidity with dewpoint and temperature

42
6. Further Insights

6.1. Is there a correlation between temperature, rainfall, and pressure?

To view the correlation, I plotted 3 graphs. The first graph is a graph of

average rain in Aotizhongxin and average temperature in Aotizhongxin

against the months of the year. The bar graph represents the rain while the

line graph represents the temperature. The second graph is the same as

the first but using the rain and temperature values from Dingling instead.

The last graph shows the average air pressure against the months of the

43
year. The 3 graphs are on top one of the other in a single dashboard for

clearer viewing of the correlation between the variables. From the graphs,

we can see that in general both locations have the same trend of rain and

temperature levels. It can be seen that when temperature increases, rain

also increases. This increase in rain and temperature usually occurs in the

middle of the year only, between June to August, before reducing again

towards the end of the year. Furthermore, we can see that air pressure

levels reduce during the hot rainy season and increases as the temperature

and rainfall levels drop.

After further analysation of the data, I can deduce than in general, the

maximum rain and temperature occurs in July for both locations. After

looking into the days of the month of July, I plotted another bar and line

chart to see the distribution of rain and temperature in this month.

44
We can see that both locations experience almost the exact same rainfall

and temperatures in the days of the month. We can see that the

temperatures in both locations do not vary much through the month and

stays within 25 to 30 degrees Celsius. However, we can see that the rain in

both locations is not consistent. For the most part, little to no rain occurs

throughout the month and there will suddenly be extreme rainfall on

selected days.

45
I researched the reasons why the correlation between temperature, rainfall

and pressure occurs in Aotizhongxin and Dingling.

From June to September, China experiences summer season. This is due

to the Earth’s axis being tilted at 23.4 degrees. Due to the geographical

location of China, from June to September, the rays from the sun hit China

at a steep angle causing more energy to reach China during that period.

Together with longer hours of daylight during summer, this causes higher

temperatures in China from June to September.

According to the line graph, it is evident that temperatures in Aotizhongxin

and Dingling are greater in the middle of the year compared to the start and

end, peaking around June to September.

During this June to September, another phenomenon occurs which is

known as the southeast monsoon. As China is in the Northern hemisphere,

it experiences summer at this time where the air in the region heats up,

expands, and rises. This forms a region of low pressure over the area. At

the same time, places like Australia that are in the southern hemisphere are

experiencing winter. The low temperatures cause the air to be cold and

dense, creating an area of high pressure. As air tends to move from areas

of high pressure to areas of low pressure, this forms winds blowing from

Australia towards China as southwest monsoon winds. This wind is dry and

cold. As the winds pass the equator, the Coriolis effect occurs causing the

wind to deflect to become southeast monsoon winds. This wind heats up

46
and picks up moisture as it passes over the Pacific Ocean and brings heavy

rains to China.

According to the bar charts, the southeast monsoon is evident as there is a

general trend of higher rainfall in the middle of the year, peaking around July

and August. Furthermore, the air pressure line graph confirms this theory

as it can be seen that the air pressure decreases during the summer period

and increases during winter.

6.2. Does temperature and relative humidity affect ozone levels?

I plotted 3 graphs to show the correlation between humidity, ozone levels

and temperature. The first graph shows the average ozone levels against

the relative humidity levels. The second graph shows the average ozone

levels against the temperature and the last graph shows the average ozone

against the months in a year. The graphs were stacked on each other in a

single dashboard for better comparison and convenience for the viewer to

see the correlation between the variables.

47
From the graphs we can see that in general, ozone levels rise in the middle

of the year and are lower at the start and end of the year. We can also see

that temperature and ozone levels have a direct relationship where when

temperature increases, ozone levels increase. Lastly, we can see that

relative humidity also affects the ozone levels inversely. This means that

when relative humidity increases, ozone levels decrease. However, we can

see that a change in temperature causes a greater change in ozone levels

than relative humidity. As I learnt from question 1, China experiences higher

temperatures in the middle of the year which explains why ozone levels are

higher in the middle of the year.

I conducted some research as to why this correlation occurs.

Ground level ozone is formed through chemical reactions with precursor

pollutants from sources like fossil fuel combustion, vehicles, landfills and

more. This reaction can be catalysed in the presence of heat and sunlight

so during summer, when the hours of sunlight is longer and temperatures

are higher, a higher amount of ground level ozone is produced.

Relative humidity is the ratio of the actual amount of water vapour present in

the air to the total amount of water vapour the air can hold at a given

temperature. At higher temperatures, the air can hold more water vapour,

causing the relative humidity to decrease. This means that a high relative

humidity, temperatures are lower, so this is the reason why at higher relative

humidity, the level of ground level ozone decreases. An increase in relative

48
humidity can also allow for the rehydration of desiccated pathogens, which

decreases their resistance to ozone action.

6.3. Are any of the pollutants affected by the wind?

As seen from the map, which I plotted using the latitude and longitude values of each

location, Dingling is North-Northwest of Aotizhongxin and Aotizhongxin is South-

Southeast of Dingling.

To compare these two locations and prove which pollutant is affected by wind, I

plotted a bar graph of the average amount of each pollutant in each wind direction for

each location.

49
From the graph, we can see that all the pollutants are affected by wind. This

is evident as from the graph, we can see that carbon monoxide, ozone,

nitrogen dioxide, PM10 and PM2.5 and sulfur dioxide increase and

decrease when subjected to the same wind directions. However, although

the pollutants may be affected by the wind, the vary in differing amounts.

For example in Aotizhongxin, the difference in CO levels between North and

50
Northeast winds is an increase of 613ug/m3 while the difference in SO2

levels between North and Northeast winds is an increase of 8ug/m3.

I conducted some research to find out more about why this occurs.

The reason for this is the density of each pollutant. Carbon monoxide,

PM10 and PM2.5 is less dense than air, causing a slight wind to cause a

larger difference in pollutant levels. Nitrogen dioxide, Sulfur dioxide and

Ozone is denser than air so it requires a greater wind speed to change the

levels of these pollutants.

Another way to see that wind affects the pollutant levels is by seeing the

pollutant levels in each location when wind is present and when it is not. For

the example below, I compared the level of pollutants in each location when

there was a wind coming from the North-Northwest direction and when

there is no wind.

51
The bar graph below shows the average level of each pollutant in Aotizhongxin

when there is no wind

This is the bar graph showing the average level of each pollutant in Aotizhongxin when

there is wind coming from the North-Northwest direction.

52
This is the bar graph showing the average level of each pollutant in Dingling when

there is wind coming from the North-Northwest direction.

53
The bar graph below shows the average level of each pollutant in Dingling when there

is no wind

Location Aotizhongxin Dingling

Pollutant Value with no NNW wind Value with NNW wind

wind no wind

CO 3400 753.1 1966 862.5

O3 50 65.2 26 58.1

SO2 13 8.6 8 8.6

NO2 197 36.2 44 23.1

PM10 256 77.3 117 72.2

PM2.5 229 38 110 54.8

It can be seen that in both locations, most of the pollutants decrease in level when

wind is present. This is because when there is no wind, the stagnant conditions can

54
cause the pollution levels build up and become more concentrated while winds can

disperse the pollutants and dilute the concentrations. A likely reason for the rise in

ozone levels with an increase in wind can either be that the wind speeds in the North-

Northwest directions are very low, causing little to no dispersal of the Ozone due to its

density compared to other pollutants. Another possibility for this is that there were

winds, but it was blowing ozone from a different region of China with high ozone

levels, causing the Ozone levels in both location to rise when wind was present.

6.4. What are the conditions of the air quality in each location?

The air quality index is an index used by government agencies to determine

the level of air pollution in the country. Different countries have different air

quality indexes according to their national air quality standards. As air

pollutants can pose a risk to public health, the air quality index acts as an

indicator as to what measurements the public and government have to take

to minimise casualties. The pollutants normally measured to determine the

Air Quality Index are PM10, PM2.5, NO2, O3, CO and SO2. However,

sometimes NH3 and Pb levels are also taken into account to determine the

AQI.

This is the Air Quality Index for China.

55
Category AQI Range Definition

Good 0-50 The air quality is

considered

satisfactory, and air

pollution poses little to

no risk

Satisfactory 51-100 The air quality is

acceptable but can

pose a moderate

health concern for a

population of people

sensitive to air

pollution.

Moderate 101-200 Members of sensitive

groups may experience

health effects. The

56
general public is not

likely to be affected

Poor 201-300 Everyone may begin to

experience health

effects while members

of sensitive groups

may experience more

serious health effects.

Very Poor 301-400 Health alert. Everyone

may experience

serious health effects

Severe 401-500 Health warning of

emergency conditions.

The entire population is

likely to be affected.

For my data, I will be determining the AQI levels in each location using

PM10, PM2.5, NO2, O3, CO and SO2 levels only as I do not have NH3 and

Pb data.

PM2.5

57
The pie-chart shows that in both locations, the PM2.5 levels lie under the

very poor and severe category. This can pose an extreme health risk to

public health. PM2.5 are particles that are less than 2.5 micrometers in

diameter. When inhaled, these particles are dangerous as they can

penetrate deeply into the lungs, irritate and corrode the alveolar wall, and

consequently impair lung function. It can also enter the bloodstream by

diffusion affecting the heart, kidneys and other organs. The causes of

these high PM2.5 levels are most likely from the emissions of motorised

vehicles and factories. In China, there are about 253.76 million vehicles

and 399.4 thousand industrial enterprises. With China being the greatest

population country and a manufacturing giant for the world, this results in

very high PM2.5 levels.

PM10

58
The pie chart depicts that in both locations, more than 50% of the time, the

PM10 levels are within the very poor and severe category. Pm10 are

particles that are less that 10 mircometers in diameter. When inhaled,

these particles can penetrate deep into the lungs. Exposure to high

concentrations of PM10 can result in a number of health impacts ranging

from coughing and wheezing to asthma attacks and bronchitis to high

blood pressure, heart attack, strokes and premature death. The sources of

PM10 include fugitive dust, wildfires and industrial sources.

59
NO2

The piechart shows that on average, the NO2 levels in Aotizhongxin lies

under the very poor category while the NO2 levels in Dingling lie under the

Poor category. This means that the NO2 levels in Dingling is lower than

Aotizhongxin. On average, NO2 levels in Aotizhongxin is worse than that

in Dingling. The reason for this result is due to the locations of

Aotizhongxin and Dingling. Road traffic and gas is the main cause for NO2

emmisions as it is a byproduct of burning fuel. As Aotizhongxin is in the

city, a lot of traffic runs through Aotizhongxin. Dingling on the other hand,

is in a mountainous area surrounded by villages. As not much traffic runs

through there, NO2 levels will not be as high. NO2 can cause irritation of

eyes, nose and throat and when inhaled might cause lung irritations and

decreased lung function. In areas with higher levels of nitrogen dioxide, a

greater chance of asthma attacks and respiratory issues occur.

60
O3

The piechart shows the O3 levels in both locations are very similar ranging

from the good to very poor category. This means that the ozone levels in

the locations have been well maintained. The only concern would be

people that are sensitive to ozone.

CO

61
The CO levels in both locations are also very similar, ranging from the

good to moderate category. This is a good sign which the general public

should not have a concern about.

SO2

The SO2 levels in both locations do not differ much and range between

satisfactory to moderate category. This is also a good sign and should not

concern the general public majorly.

Overall air quality

62
As it seem that the PM2.5 and PM10 levels are of the greatest concern with

the rest of the pollutants being within safe levels, the average AQI of both

locations are good and the air quality is good for the most part as an overall.

People should still be concern about the hazardous levels of particulate

mater in the atmosphere. From the graph we can see that Dingling has a

greater level of good pollutant levels. This is most likely due to the

environmental differences Aotizhongxin and Dingling are in. Overall, it

would be safer to travel to Dingling.

6.5. Is there a correlation between temperature affecting Carbon Monoxide

levels?

63
I plotted the two line graphs above to view the correlation between

temperature and average CO levels. The first graph shows the average

carbon monoxide levels against the months of a year. The second graph

shows the average CO levels against temperature. We can see that CO

levels are higher at lower temperatures and as we learnt in question 1,

China experiences summer with high temperatures in the middle of the year

which explains why the CO levels are lower in the middle of the year.

I did some further research to understand why the CO levels are higher at

lower temperatures.

64
The main cause of carbon monoxide is the incomplete combustion of fuel.
As mentioned before, there are about 253.76 million motorised vehicles in
China with Beijing ranking the highest of any Chinese city with 5.4 million
vehicles. As seen from the line graph, carbon monoxide levels are highest
at the start and end of the year, when it is winter in China and
temperatures are low. It is also at this time of the year that many people
use gas heaters that burn natural gas to keep an area warm during the
winter. At these low temperatures, the air is denser so more air is
introduced into the combustion process. This causes the air to fuel ratio to
be very high, causing a leaner mixture. If the mixture is too lean, it will not
burn effectively and cause incomplete combustion. In normal combusttion,
the fuel contains carbon and hydrogen while the air contains oxygen.
During combustion, the carbon and hydrogen combines with the oxygen to
produce carbon dioxide and water but if incomplete combustion occurs,
the carbon does not completely oxides, forming soot and carbon monoxide
in the process.

6.6. How do the SO2 and NO2 levels vary with rainfall with time?

65
I plotted the line graphs above to view the correlation of SO2 and NO2
levels with rainfall. The first graph shows the average NO2 and SO2 levels
against the months of a year. The purple line represents SO2, and the
brown line represents NO2. The second graph shows the average rain
against the months of the year. I stacked these graphs in a single
dashboard to view their correlation better and with greater convenience.
The second graph shows that the rainfall increases from the start to the
middle of the year and decreases again towards the end of the year. This
has been explained in question 1 as to why the rainfall is higher during the
middle of the year. From the first graph we can see that the NO2 and SO2
levels drop during the middle of the year when rainfall is highest.

I performed some research as to why this occurs.

The main sources of nitrogen dioxide are gasoline vehicles, power plants,
diesel powered industrial equipment and industrial boilers. These sources
emit large amounts of nitrogen oxide gas which oxides with oxygen in the
air to produce nitrogen dioxide.

The main sources of sulfur dioxide are from the burning of fossil fuels by
power plants and petroleum refineries. These processes release sulfur into
the environment as a by-product which combines with oxygen in the air to

66
form sulfur dioxide gas. Sulfur dioxide can also be produces naturally from
volcanic activity and biological decay, but this does not contribute much to
the sulfur dioxide levels in Aotizhongxin and Dingling.

The nitrogen dioxide gas is soluble in water, so it combines with water in the
air to produce nitric acid and nitrogen oxide gas. When rainfall occurs, the
raindrops are acidic due to the nitric acid dissolved within it, causing acid rain.
This contributes to 25% of the acidity in rainwater. Therefore, when rainfall
occurs, the nitrogen dioxide will dissolve in it which explains the drop in NO2
when rainfall increases.

As for the sulfur dioxide, it is also soluble in water and combines with water in
the air to produce sulfuric acid. When rainfall occurs, the raindrops are acidic
due to the sulfuric acid dissolved within it, causing acid rain. Furthermore, as
sulfuric acid is a strong acid, it can further dissociate to give hydrogen ions
and sulfate ions, causing the concentration of hydrogen ions in the sulfuric
acid to increase substantially, lowering the pH of rainwater even more, making
the acid rain more detrimental. This contributes to the other 75% of the acidity
in rainwater. Therefore, when rainfall occurs, the sulfur dioxide will dissolve in
it which explains the drop in SO2 when rainfall increases.

Acid rain can cause many environmental and infrastructure problems in the
world. It causes lakes to become so acidic that fish cannot survive, it releases
toxins like aluminum ions into our water supply, kills trees and crops and can
deteriorate stone buildings and structures. Furthermore, acid rain can also
occur in far away areas from where the SO2 and NO2 are emitted due to wind
currents.

67
7. Data Modeling

The file reader node was used to read my data.


The linear regression node is used to predict the particulate matter smaller than 2.5
micrometres in diameter (PM2.5) levels from the particulate matter smaller than 10
micrometres in diameter (PM10), sulfur dioxide (SO2), nitrogen dioxide (NO2),
carbon monoxide (CO), ozone (O3), temperature (TEMP), pressure (PRES),
dewpoint (DEWP), rain (RAIN) and windspeed (WSPM).

68
The partitioning node was used to divide the data so that 70% will be the training set
used by linear regression while the other 30% will be the test set that will be used by
the regression predictor. This was done to check the accuracy of the linear
regression learner without adding in new data.

69
The regression predictor node was used to append a prediction column in the data to
show what the linear regression model predicts the PM2.5 level is going to be.

The numeric scorer node was used to calculate the Root Mean Square Error
(RMSE) to check the accuracy of the linear regression model. The RMSE it
produced was 30.021 and the R-squared value it produced was 0.847. The R-
squared value is a value that ranges from 0 to 1 and it is stated as a percentage from
0% to 100%. It determines how many percent of the data fits on the regression line.
This means that 84.7% of my data fits well on the regression line.

70
To improve the accuracy of my data, I decided to configure the partitioning node to
use 80% of the data for the training set used by linear regression while the other
20% will be the test set that will be used by the regression predictor.

After executing the linear regression model, I managed to get a R-squared value of
0.853 and a RMSE of 29.457. This means the predicted data is now more accurate.

To visually view the correlation between the predicted PM2.5 value and the actual
one, I used the CSV writer node to save the new dataset with the predicted PM2.5
values in my computer to plot a scatterplot of the predicted PM2.5 against the actual
PM2.5.

71
We can see that the scatterplot shows a strong relationship of the predicted PM2.5
against the actual PM2.5 which proves its R-squared value. We can see that there
are some outliers at lower levels of actual PM2.5.

72
To view the accuracy of the linear regression model further, I plotted a residual plot
which plots the residuals against the predicted PM2.5 level. The residuals are
basically the difference between the actual PM2.5 level and predicted PM2.5 level.
The more accurate the prediction, the closer the residuals will be to zero.

We can see that the residual plot is heteroscedastic. Though the trend line shows
that most of the data lie at 0. The red line shows the outliers.

8. Conclusion

I learnt how important analysing air pollutants is now as it can ensure the
safety of people and our planet. Analysing such data and gaining insights can
also allow us to tackle the problem better and find ways to reduce its effects.

I learnt that many of our everyday activities contribute to a large sum of the air
pollutants without even realising. Now that I have this knowledge and its ill
effects, I now know what to avoid reducing air pollutants.

I learnt that China has higher temperatures in the middle of the year and lower
temperatures at the start and end of the year. This affects the rainfall, air
pressure, ozone levels and carbon monoxide levels. I also learnt about how wind
can affect the level and types of air pollutants, how governments determine the
safety level of the country’s air quality and how rainfall occurs in China and how
some pollutants can make even rain dangerous.

9. Reflection

73
In terms of software, I now have a deeper understanding of how tableau and
KNIME works and learnt to appreciate how it can turn complicated data into
simple visual charts to gain insights from and predict data.

I feel that I have now grasped the information learnt in class better through
this project.

Some problems I faced during this project is juggling this with other
schoolwork. I also had struggles in finding data that correlates well to form
sophisticated questions with deep meaning behind them. I feel my time
management is not the best which is why I had these challenges during the
project.

10. References

[1] Influence of Urban-Growth Pattern on Air Quality in China: A Study of 338


Cities (nih.gov)

[2] What Can the Growth of the Beijing Metropolitan Area Teach Us About Cities?
| Peak Urban (peak-urban.org)

[3] Who’s at Risk - Air Pollutants and Health Effects (sparetheair.org)

[4] Acid Rain (wustl.edu)

[5] Monsoon of South Asia - Wikipedia

[6] Incomplete Combustion - an overview | ScienceDirect Topics

74

You might also like