Pollution in Seoul Project Report R Project

POLLUTION IN SEOUL
Lexi Hanna
May 1st, 2020
Abstract
Seoul, South Korea is ranked as the 37th in the world for the most polluted city. Air pollution is caused
by many factors, including (but not limited to): use/burning of fossil fuels, mass agriculture, factory/industry
exhaust, mining, and household sources. The use and burning of fossil fuels emits sulfur dioxide (SO2) and
carbon monoxide (CO) and includes activities such as transportation and factory use. Mass agriculture causes
ammonia to be released into the air (NH3) and into water sources. Exhaust from factories and industries
releases carbon monoxide (CO) and other pollutants as well that deplete air quality. Mining can expose a
plethora of chemicals that pollute the air. Indoor air pollution is mainly caused by cleaning and painting
products. Particulate matter is caused in part by many of these causes, but mainly by dust and combustion.
Air pollution can cause respiratory issues, global warming, acid rain, eutrophication, and depletion of the
ozone layer. Due to these negative effects, reducing the amount of pollution is essential for continued
development. In order to do that, however, it is first necessary to identify the most prevalent pollutants in
the air. In this project, I examine the levels of different compounds that cause pollution from a statistical
viewpoint. The data used in this project was collected hourly from 2017-2019 by the Air Quality Analysis
Center in Seoul, Korea. I started my analysis by first removing outliers and data points outside of the realm of
possibility. I then checked for missing values. After that, I had created a random sample in order to slim down
the data to a workable set of observations and created a new variable utilizing the legend provided in a
separate file that shows the severity of the pollution. I then made some scatterplots in order to visualize the
relationship between the various pollutants and time.
Hanna 1
Table of Contents
Introduction 3
Data Description 3
Filters and Subsets 4
Main Pollutants 5
Numerical Charts 5
Categorical Charts 9
Conclusions and Further Studies 10
References 12
Hanna 2
Introduction
Post World War II, South Korea began to move from an agrarian society to an industrial one. The
Korean War highly accelerated the country’s industrialization. During the 1980s and 1990s, the economy of
South Korea grew 10% each year, now becoming an international industrial powerhouse. During the 1970s,
the environment took a backseat to the nation’s economic development. Since then, the country has
implemented many environmentally focused legislatures, but the air quality remained as one of the worst
globally. In 2017, coincidentally the beginning of data collection for this dataset, South Korea was named the
second worst for air quality out of the advanced nations of the Organization for Economic Cooperation and
Development (Smith). In this project I analyzed a dataset of pollutant levels collected hourly from 2017 to
2019 from 25 different districts in Seoul, South Korea. The following analysis is organized as: Data Description
focuses on the different variables of the given data and what each of them mean, as well what each
observation represents. The Filters and Subsets section focuses on how I trimmed down the data into the
sample I used for the analysis. The main pollutants section highlights the most prevalent pollutants, the
Numerical Charts section shows relevant numerical graphs, and the Categorical Charts section shows relevant
categorical charts. Lastly, the Conclusions and Further Studies section exhibits the findings as well as topics to
delve into further.
Data Description
Dataset: This dataset was found and downloaded from Kaggle and is called Air Pollution in Seoul. It was
collected between 2017 and 2019 from 25 different locations every hour. It provides multiple pollutant level
readings.
Variables: There are 11 variables and are described as such:
Measurement Date: Provides the date and time the measurement was taken.
Station Code: Identifies which of the 25 stations is being sampled.
Address: Identifies the location of the station being sampled.
Latitude: In degrees, exact latitude of the address.
Longitude: In degrees, exact longitude of the address.
SO2: In ppm, the average value of sulfur dioxide over the hour. Blue on the legend is
0.020, green on the legend is 0.050, yellow on the legend is 0.150 and red on the
legend is 1.000. Data is given 3 decimal places.
NO2: In ppm, the average value of nitrogen dioxide over the hour. Blue on the legend is
0.030, green on the legend is 0.060, yellow on the legend is 0.200, and red on the
legend is 2.000. Data is given 3 decimal places.
O3: In ppm, the average value of ozone over the hour. Blue on the legend is 2.000,
Hanna 3
green on the legend is 9.000, yellow on the legend is 15.000, and red on the
legend is 50.000. Data is given 1 decimal place.
CO: In ppm, the average value of carbon monoxide over the hour. Blue on the legend
is 0.030, green on the legend is 0.090, yellow on the legend is 0.150, and red on
the legend is 0.500. Data is given 3 decimal places.
PM10: The average value of particulate matter less than 10 μm over the hour. Blue on
the legend is 30.000, green on the legend is 80.000, yellow on the legend is
150.000, and red on the legend is 600.000. Data is given no decimal places.
PM2.5: The average value of particulate matter less than 2.5 μm over the hour. Blue on
the legend is 15.000, green on the legend is 35.000, yellow on the legend is
75.000, and red on the legend is 500.000. Data is given no decimal places.
Observations: The dataset originally contained 647,511 rows of observations. Each row/observation is a
measurement of air pollution in Seoul, South Korea. Each measurement was taken an hour apart in 25
different districts in the country’s capital between 2017-2019.
Filters and Subsets

The only variables that caused need for filtering are the numerical pollutant variables. Each pollutant
had a unique range of observations, but a common thread amongst the boundaries was the negation of
negative observations- there was no way to have a negative reading. Missing values would cause issues for
reporting, so those were recorded and thrown out. Outliers were also assessed and thrown out.
Nonsense Variables: The pollutants’ common need for positive observations called for the deletion of
negative values from the dataset. So, I counted how many there were per pollutant to see how many
observations would be thrown out and it accounted for less than 4% of the data. Those observations were
then thrown out.
Missing Values: After searching the dataset for missing values, none were found. So I moved onto the
outlier assessment.
Outliers: For this section, I subsetted and divided the dataset by the pollutants. I did this in order to salvage
the other sets’ observations that were not outliers, as one outlier of one pollutant does not negate the
correct reading of another. I then identified and excluded outliers for each set.
Random Sampling: Due to the large nature of the dataset, I decided that sampling was necessary in order
to effectively analyze the data. I took a random sample of 10,000 observations from the each of the individual
datasets resulting from the previous filtering, which resulted in 6 separate sets of data with 10,000
observations each to work with.
Hanna 4
Color Variable: Utilizing the legend provided by the dataset, each pollutant had different values for low to
high levels of pollution. Blue was considered good, green was normal, yellow was bad, and red was very bad.
Each random sample observation was assigned a color based on this scale.
Pollutant Unit of measurement Good (Blue) Normal (Green) Bad (Yellow) Very bad (Red)
SO2 ppm 0.02 0.05 0.15 1
NO2 ppm 0.03 0.06 0.2 2
CO ppm 2 9 15 50
O3 ppm 0.03 0.09 0.15 0.5
PM10 Mircrogram/m3 30 80 150 600
PM2.5 Mircrogram/m3 15 35 75 500
Main Pollutants
PM10: After examination of frequencies of the pollutant levels, this one ranked second as the most
prevalent in the air at the highest levels. Particulate matter in the atmosphere is caused by many different
sources, but mainly combustion. The difference between PM10 and PM2.5 is size, as PM10 is larger. This
pollutant had a couple of yellow readings and a lot of green.
PM2.5: After examination of frequencies of the pollutant levels, this one ranked first as the most prevalent
in the air at the highest levels. PM2.5 is smaller than PM10 and is therefore considered more dangerous as it
can more easily enter into our bodily systems. This pollutant had mostly yellow readings and a few green.
Numerical Charts
I first made scatterplots of the existing data (without the Nonsense Variables) to see what data I was
working with.
Hanna 5
I then made boxplots of the data with outliers to see how the dataset would change with the
removal of them.
Hanna 6
Hanna 7
Hanna 8
Hanna 9
Categorical Charts
These are categorical charts for the results of this analysis.
Hanna 10
Conclusions and Further Studies
In this project, I examined 647,511 observations to conclude that both types of particulate matter
(both PM10 and PM2.5) were the most prevalent in the air at dangerous levels. Through the creation of a
new variable, sampling, and filtering, a conclusion was reached. The filtering first accomplished slimming
down the data by throwing out Nonsense Variables, Outliers, and Missing Values. The sampling furthered the
slimming of the data by taking a random sample of the data. The new variable created an insightful way of
understanding the pollutant level observations, and through frequency box plots I was able to conclude
which of the different pollutants were at the most harmful levels.
This project could be furthered by analyzing the dates at which the data fluctuates and combine this
dataset with an outside source of temperature in Seoul at the 25 different stations for those specific hours to
see if that has any correlation with the pollution levels. From a simple scatterplot, it is easy to see that the
levels fluctuate over time, which could have something to do with the temperature at that time.
Hanna 11
References
“Air Quality and Pollution City Ranking.” IQAir, 10 Apr. 2020, www.iqair.com/us/world-air-quality-
ranking.
Bappe. “Air Pollution in Seoul.” Kaggle, 3 Apr. 2020, www.kaggle.com/bappekim/air-pollution-in-

seoul.
Rinkesh. “Causes, Effects and Solutions of Air Pollution.” Conserve Energy Future, 13 Apr. 2019,
www.conserve-energy-future.com/causes-effects-solutions-of-air-pollution.php.
“Seoul Air Pollution: Real-Time Air Quality Index (AQI).” Aqicn.org, 11 Apr. 2020,
aqicn.org/city/seoul/.
Smith, Brett. “South Korea: Environmental Issues, Policies and Clean Technology.”
AZoCleantech.com, AZO Cleantech, 8 Aug. 2019, www.azocleantech.com/article.aspx?
ArticleID=552.
“열린데이터 광장 댓글 입력.” 서울열린데이터광장, data.seoul.go.kr/dataList/OA-

15515/S/1/datasetView.do#AXexec.
Hanna 12

Pollution in Seoul Project Report R Project

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Pollution in Seoul Project Report R Project

Uploaded by

Copyright:

Available Formats

POLLUTION IN SEOUL

May 1st, 2020

Variables: There are 11 variables and are described as such:

Filters and Subsets

Bappe. “Air Pollution in Seoul.” Kaggle, 3 Apr. 2020, www.kaggle.com/bappekim/air-pollution-in-

“열린데이터 광장 댓글 입력.” 서울열린데이터광장, data.seoul.go.kr/dataList/OA-

You might also like