You are on page 1of 28

UN Data Analysis Pandas and Matplotlib

UN Data Analysis Pandas and Matplotlib...............................................................................................................1

Overview.............................................................................................................................................................3

Introduction.........................................................................................................................................................4

UN Datasets.........................................................................................................................................................4

Download CO2 Dataset.......................................................................................................................................4

Launch Jupyter Notebook...................................................................................................................................7

Create Markdown Headings and Text.................................................................................................................7

Import Pandas and Matplotlib.............................................................................................................................9

Pandas Read CSV..............................................................................................................................................10

Analysis Using Pandas......................................................................................................................................11

df.head()........................................................................................................................................................11

df.tail()...........................................................................................................................................................12

df....................................................................................................................................................................12

df Series.........................................................................................................................................................13

max................................................................................................................................................................14

min.................................................................................................................................................................15

mean..............................................................................................................................................................15

Standard Deviation........................................................................................................................................15

count..............................................................................................................................................................15

df.describe()...................................................................................................................................................16

df.dtypes........................................................................................................................................................16

df.info().........................................................................................................................................................17

Conditional Expressions – Filtering Rows....................................................................................................18

shape..............................................................................................................................................................20

isin.................................................................................................................................................................20

Plotting with Matplotlib....................................................................................................................................21

1
Plotting the ozdata DataFrame......................................................................................................................21

Plotting the above_5000000 DataFrame.......................................................................................................23

Creating and Ploting the ozusdata DataFrame..............................................................................................24

df.to_excel.....................................................................................................................................................26

pd.read_excel.................................................................................................................................................28

2
Overview
The exercise will incorporate the following three tools with Python using Anaconda.

1. Pandas: Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation
tool, built on top of the Python programming language. It comes integrated with Anaconda but can also
be downloaded separately at https://pandas.pydata.org/. This website also provides excellent notes and
tutorials. When we read data into Pandas using a csv file, we have a lot of options to choose from with
regard to how we import our data and preformat it as follows. For more information on each option,
visit: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

2. NumPy: NumPy (Numerical Python) is an open-source Python library that’s used in almost every
field of science and engineering. It’s the universal standard for working with numerical data in Python,
and it’s at the core of the scientific Python and PyData ecosystems. It comes integrated with Anaconda
but can also be downloaded separately and an additional support resource for NumPy is located at
https://numpy.org/doc/stable/index.html. This website also provides excellent notes and tutorials.
3. MathPlotLib: Matplotlib is a cross-platform, data visualization and graphical plotting library
for Python and its numerical extension NumPy. Matplotlib provides a comprehensive library for
creating static, animated, and interactive visualizations in Python. It comes integrated with Anaconda
but can also be downloaded separately and an additional support resource for is located at
https://matplotlib.org/. This website also provides excellent notes and tutorials.

3
Introduction
The following exercise will provide you with a complete step by step introduction into an integrated exercise
using all three software applications. Please follow it carefully. Please note that # hashtags will be used as per
usual to describe what is happening in the Python code.

UN Datasets
We are going to work with a dataset from the UN DataMart: http://data.un.org/Explorer.aspx

Download CO2 Dataset


The specific dataset we are interested in is located under the Greenhouse Gas Inventory Data category and it is
called:

Carbon dioxide (CO2) Emissions without Land Use, Land-Use Change and Forestry (LULUCF), in kiloton
CO2 equivalent: http://data.un.org/Data.aspx?d=GHG&f=seriesID%3aCO2

4
You can see from the preview that you have one data point (value) per year for each country and each country
is listed alphabetically and chronologically. So let’s download the dataset as a comma separated file (CSV) and
save it to our drive. It will download as a compressed zipped folder from which we will extract the CSV file.

After downloading it, open the compressed zipped folder and drag and drop the CSV file outside the folder.
This just makes it easier to access it from Pandas.

5
Then we will rename it so that it is not such a long title. I just renamed mine, UNEnvData when Env means
environmental.

Double click on the CSV file and it will open in Microsoft Excel. Scroll down through the data and observe
how many rows of data are in the file and what the headers are called in the header row (1st row in this
instance). While it is okay to expand the width of the column to see the headers or indeed the data, do not alter
or format the data in any way. It is essential to maintain the integrity of the original data. Once you have
finished reviewing the CSV file, close the file but do not save any changes.

We now need to load the CSV file into Jupyter using the upload button:

Once this is uploaded (it may take a while) we are ready to go.

6
Launch Jupyter Notebook
We are now ready to work with the data using Pandas and Matplotlib in a Jupyter Notebook. So open Jupyter
and start a new Notebook.

Save the new Notebook as UNEnvDataAnalysisNov2022

We are now ready to commence preparation of our Jupyter Notebook to analyse this particular dataset: Let’s
get started.

Create Markdown Headings and Text


Using a Heading 1 style, input the following title:

UN ENVIRONMENTAL DATA ANALYSIS NOVEMBER 2022

Using a Heading 2 style, input the following subtitle substituting my name/student number for yours:

Aidan Duane 1234567

Using a Heading 3 style, input todays date as follows:

November 17th, 2022

It should look like the following:

7
Now, using a Heading 4 and a Paragraph style, write a description about the data, data source, and purpose of
this analysis. For example:

Introduction:

The following analysis is based on a dataset retrieved from the UN Datamart. The dataset is entitled,
Carbon dioxide (CO2) Emissions without Land Use, Land-Use Change and Forestry (LULUCF), in
kiloton CO2 equivalent.

The dataset provides individual country data with a single data point for each year. The dataset extends
from 1990 to 2019 and it is ordered in alphabetical and chronological order. The dataset contains 1291
rows of data. The first row contains the header data. There are three headers entitled, Country or Area,
Year, and Value. The Value represents a single annual CO2 data point for each country.

Your Notebook should now look as follows:

8
Import Pandas and Matplotlib
We are now ready to import the data as a CSV into Jupyter using Pandas with further analysis using Matplotlib.
However, first we must import both the Pandas and Jupyter modules as follows. Type:

import pandas as pd

import matplotlib.pyplot as plt

# this configuration ensures that we have good quality scalable vector graphics (svg)

%config InlineBackend.figure_format = 'svg'

The Jupyter Notebook should look as follows:

9
Pandas Read CSV
A Pandas DataFrame is a 2-dimensional labelled data structure with columns of potentially different types. You
can think of it like a spreadsheet or SQL table, or a Python dict {dictionary} of Series objects. It is generally the
most commonly used Pandas object.

The way Pandas works is that it imports a CSV file data into a DataFrame (df). Thus we type:

df = pd.read_csv('unenvdata.csv')

This command calls the data from the CSV file and creates a DataFrame object in our Notebook containing all
the data.

10
Analysis Using Pandas
Now that we have a DataFrame object (df) created in our Notebook, we can begin to analyse the data contained
in that DataFrame.

df.head()

This pulls the first 5 lines of the df (note that it adds an index column to the output by default (0, 1, 2, 3, etc). 5
lines is the default, so if we want more we just specify more:

11
df.tail()

the last 5 lines of the df

df

a snapshot of the df head/tail

12
df Series

Or we can call each data series (column) one by one. To select a single column, use square brackets [] with the
column name of the column of interest. Each column in a DataFrame is a Series. As a single column is selected,
the returned object is a Pandas Series. We can even confirm that each column is a Series by typing:

type(df["Value"])

So if we want to look at each Series (column) we just call them using square brackets as follows:

df["Year"]

If we want the Country or Area column, we type:

df["Country or Area"]

13
And if we want the actual Value column from the Dataframe, we type:

df["Value"]

max

Let’s say we want to know the highest CO2 value from the Value column (Series):

df["Value"].max()

14
min

or the lowest CO2 value from the Value column (Series):

df["Value"].min()

mean

or the average (arithmetic mean) CO2 value from the Value column (Series):

df["Value"].mean()

Standard Deviation

or the standard deviation from the Value column (Series):

df["Value"].std()

count

or count the number of values in the Value column (Series):

df["Value"].count()

15
df.describe()

We can also use the df.describe() method which provides a quick overview of the numerical data in a
DataFrame.

df.describe()

However, much of the data in the Year column is meaningless as mean, std, etc., of Year are of little use.

df.dtypes

We can also check out what the Data Types are of each data series within the DataFrame.

df.dtypes

16
df.info()

We can also look at particular information about the dataset using

df.info()

The method info() provides technical information about a DataFrame, so let’s explain the output in more detail:

 It is indeed a DataFrame.
 There are 1290 entries, i.e. 1290 rows.
 Each row has a row label (aka the index) with values ranging from 0 to 1289.
 The table has 3 columns. All columns have a value for each of the rows (all 1290 values are non-null).
 None of the columns have missing values and less than 1290 non-null values.
 The column Country or Area consists of textual data (strings, aka object). The other columns are
numerical data with some of them whole numbers (aka int64) and others are real numbers (aka float64).

17
 The kind of data in the different columns are summarized by listing the dtypes and the number of each
type represented in the DataFrame.
 RAM usage of the DataFrame is just a little over 30.4 KB.

Conditional Expressions – Filtering Rows

To select rows based on a conditional expression, use a condition inside the selection brackets []. The condition
inside the selection brackets df["Value"] > 500000 checks for which rows the Value column has a value larger
than 5 million:

df["Value"] > 5000000

The output of the conditional expression (>, but also ==, !=, <, <=,. . . would work) is actually a pandas Series
of Boolean values (either True or False) with the same number of rows as the original DataFrame.

Such a Series of Boolean values can be used to filter the DataFrame by putting it in between the selection
brackets []. Only rows for which the value is True will be selected. So, if we want to filter specific rows, we can
establish a data series of the values we wish to filter. So we will call our data series above_5000000

above_5000000 = df[df["Value"] >5000000]

18
Now we can check what data has been extracted from the original DataFrame into our data series called
above_5000000 and we will look at the first 50 rows:

above_5000000.head(50)

Despite requesting the first 50 rows of data, we only get 30 rows of data output. This is because only one
country has exceeded 5000000 parts CO2 and it is the USA and it has done so every year for the past 30 years!
Remember our dataset only extends from 1990-2019 with regards to the UN data.

19
shape

We know from before that the original DataFrame (before we filtered it >5 million) consists of 1290 rows (0-
1289). Let’s have a look at the number of rows which satisfy the condition by checking the shape attribute of
the resulting DataFrame above_5000000:

above_5000000.shape

It tells us that the above_5000000 DataFrame contains 30 rows of data across 3 Series (columns).

However, let’s say we are not convinced that there are only 30 rows meeting the conditional expression. Thus,
it confirms the same (30,3), that the US is the only country with greater than 5000000 parts CO2 over the past
30 years!

isin

We can use isin to extract data from the main DataFrame into a new DataFrame where it meets certain
conditions. For example, we can use it to extract all of the data for Australia.

ozdata = df[df["Country or Area"].isin(['Australia'])]

ozdata

20
Plotting with Matplotlib
We can of course also, use Matplotlib to plot the ozdata DataFrame so that we have a graphical perspective of
our data. It is very easy to create a quick and dirty chart with the bare minimum of instructions to matplotlib.

Plotting the ozdata DataFrame

Sometimes, depending on how the data is formatted in the Data Frame (more on this later), we can quickly plot
data as follows:

ozdata.plot()

plt.show

However, you can see that although the chart is generated, it is not entirely correct as the x axis does not show
the year on the label, instead it is picking up on the Index column from the DataFrame. So what we really need
to do is to instruct Matplotlib what the x and y axes are by declaring them prior to plotting:

plt.figure()

x = ozdata['Year']

y = ozdata['Value']

plt.plot(x, y)

21
As you can see this makes the figure much clearer. However, we still need to add some labels to the chart as
without it, it is difficult to know what each represents beyond the DataFrame Series labels. We will also add a
few extra things to out chart

 plt.title – this adds a title to the chart


 plt.xlabel – this adds an x axis label to the chart
 plt.ylabel – this adds a y axis label the chart
 color - this changes the default color to whatever you choose.
 fontsize - this changes the size of the font

So let’s replot the figure with these settings:

plt.figure()

x = ozdata['Year']

y = ozdata['Value']

plt.title('Australian CO2 Values 1990-2019', color='r', fontsize=20)

plt.xlabel('Year', color='b', fontsize=14)

plt.ylabel('CO2 Values', color='g', fontsize=14)

plt.plot(x, y)

22
Plotting the above_5000000 DataFrame

We can also go back to our US data in the above_5000000 DataFrame and plot that too. If the data doesn’t tell
you the story, perhaps a graph can.

plt.figure()

x = above_5000000['Year']

y = above_5000000['Value']

plt.title('USA CO2 Values 1990-2019', color='r', fontsize=20)

plt.xlabel('Year', color='b', fontsize=14)

plt.ylabel('CO2 Values', color='g', fontsize=14)

plt.plot(x, y)

23
The interesting thing in this chart as although it shows that the US parts CO2 data is very high and above 5
million parts for the last 30 years, it has reduced from a peak of over 6 million parts. Still, neither dataset is
environmentally positive is it!

Creating and Ploting the ozusdata DataFrame

We can also create a new DataFrame which combines both the Australian and US data – let’s create it first and
call the DataFrame ozusdata.

ozusdata = df[df["Country or Area"].isin(['Australia', 'United States


of America'])]

and we can then call it:

ozusdata

24
Unfortunately, plotting this data as it is presented in the DataFrame becomes erroneous such as:

plt.figure()

x = ozusdata['Year']

y = ozusdata['Value']

plt.title('Australia vs USA CO2 Values 1990-2019', color='r',


fontsize=20)

plt.xlabel('Year', color='b', fontsize=14)

25
plt.ylabel('CO2 Values', color='g', fontsize=14)

plt.plot(x, y)

As we can see, this chart is utterly meaningless! So let us rethink what we are doing here. We could of course
make complex changes to our code to sort all of this out and we could of course choose a different chart type
etc., and all of those are valid options. However, let us choose a different path to make life a bit easier for
ourselves by harnessing the power of the Pivot Table in Microsoft Excel to restructure our data in the CSV file!

df.to_excel

Let’s say someone wants you to email them the DataFrame to them in a Microsoft Excel spreadsheet format
and let’s say they want Sheet 1 named as CO2. To output a DataFrame as such, type:

df.to_excel("unco2data.xlsx", sheet_name="co2", index=False)

By setting index=False the row index labels are not saved in the spreadsheet. Now return to your folder
directory to see if the new Excel file has been created. We should see it alongside our Jupyter Notebook we
created earlier as follows:

26
Tick the box beside the xlsx file. We can now download it:

And open it:

You will see that the sheet name is CO2 and you will see that the index column from our DataFrame has been
removed as it is not necessary in MS Excel.
27
pd.read_excel

The equivalent read function to read from an Excel file and a particular Sheet in that files is read_excel() to
reload the data to a DataFrame but let’s not do it just yet:

df = pd.read_excel("unco2data.xlsx", sheet_name="co2")

For now, let’s just save, close and halt the file.

28

You might also like