JUST Data Science Chapter 5 Pandas and Matplotlib

Jamhuriya University of Science & Technology (JUST)
CA416 - Principles of Data Science

Chapter 5
Practical Data Exploration and Visualization with

Pandas and Matplotlib packages.
Lecturer: XYZ
1
Learning outcomes
By the end of this lecture, you will be able to:
 Describe some core data analysis concepts including
dataframes, and data exploration.
 Create and access main data structures in Python,
such as series and dataframes.
 Perform exploratory data analysis in Python usig
Pandas library.
 Build data visualizations with Matplotlib library
2
Data science workflow – recap
Source:
https://www.dataquest.io/bl
og/what-is-data-science/ 3
Types of variables
A variable is any characteristic or attribute that can be
quantitatively and qualitatively been measured
Variable
Numeric Categorical
Continuous Discrete Ordinal Nominal
Continuous – time, age, etc Ordinal – grades, rating, etc

Discrete – people, houses, etc Nominal – nationality, race, etc
4
Exploratory data analysis

 Exploratory Data Analysis (EDA) is an initial exploration of
the data to understand its characteristics, patterns,
correlations, and to identify any anomalies in the data.
 Statistical data summarization and graphical visualizations
are two primary forms of EDA.
 EDA is typically applied before formal data modeling and
helps inform the development of appropriate statistical models
or machine learning models.
5
Pandas
 Data analysis is normally performed over data stored in a
tabular format, e.g., Excel spreadsheet.
 Each observation is recorded in a row and its each
attribute is recorded in a column (e.g., students’ and
their grades in each assignment)
 Pandas is a Python library for manipulating data in tabular
format and comes with Python Anaconda.
 In Pandas, data manipulation can be much more varied, it can
programmed more easily and performed more efficiently,
which is critical in large-scale projects
6
Pandas
• Pandas is a high-level library built on NumPy, providing
tools that make it easier to work with real-world data:
– load data from a variety of sources (e.g. CSV, JSON, SQL).
– update (add, modify, delete etc) data
– select subsets of the data
– group data by a certain criterion
– clean and handling missing values or NANs
– visualize the data using different plotting tools
– perform statistical analysis of the data and
– export the data to other file formats or databases
• Pandas provides two main data structures, namely,
DataFrame (equivalent to a spreadsheet) and a Series
(equivalent to a column in a spreadsheet).
7
Pandas Series
 A Series is just a column is a dataframe or spreadsheet
Pandas series with indices

is created the list
the series attributes can

also be separately
accessed
Noice that the values

are just NumPy arrays 8
Operations on series data
Please read the Pandas

documentation for detailed list of
statistical functions applicable on
series, as given in the additional
resource slide.
9
Pandas Dataframe
 A Pandas dataframe is a collection of Series ( a 2D data
structure with row & columns - effectively a spreadsheet )
 Let's create an example dataframe with the population and area
values for some regions in Somalia
The head() returns the first few

rows of the dataframe. You can use
tail() to see the last few rows
The dataframe is just like

spreadsheet sheet with indices (as
the series) 10
Indexing the dataframe records

 Since we did not supply any particular values as the index,
a range of integers was used as the index for the dataframe
 However, it may be convenient to set the region names as the
indices for our example dataframe.
The argument inplace=True is

an instruction to modify the
dataframe and as the result the
dataframe will now have two
columns
Region names are now used as the

dataframe indices.
11
Hands-on exercise 1
 We can create dataframes by supplying dictionaries with
identical sets of keys as arguments to DataFrame()
 Can you represent each column (population and area)

as
dictionaries and create a dataframe from them ?
12
Dataframe attributes
Using these attributes, one can

separately access df information
This shows that dataframe values are

just 2D NumPy arrays, and this is
why NumPy underpins Pandas and other
data science libraries
13
Descriptive statistics
Average population and

area for all regions are
6.931368e+05 and
41079.333333 in order.
These statistics include

mean, quartiles, median,
total observations etc.
14
Descriptive statistics
The info() method

provides a concise
description of the
dataframe
The shape() method

provides the shape of
the dataframe in terms
of the number of rows
and columns
15
Matplotlib for plotting a dataframe

• The matplotlib is comprehensive package for
data visualization, and comes as part of Anaconda.
 Pandas has a convenient integration with matplotlib, which
means that data contained in a dataframe can be plotted with
plot():
• You can select the plot type of your choice ( e.g., scatter, bar,
boxplot, pie, hist, …) corresponding to your data
• Please see the resource at the end for more information on
various plots and arguments of the plot() function
16
Matplotlib for plotting a dataframe

• Let us now plot the data contained in our dataframe, df, by
simply calling its plot method:
Logy=True argument
enables us to scale the
y values
logarithmically.
Otherwise the scales
could have been very
different
Bar plots are useful

tools for viewing
categorical
17
variables
Selecting dataframe columns

• Let us extract the area column/variable from our data
frame
The extracted columns are Pandas series type and can stored in a
different variable or processed separately 18
Selecting dataframe cells

• We also extract a cell value of a dataframe
Notice that the cell is
accessed by its column
name and row index.
The index can be a
number and its format is
to put it in square
brakcets []
You can use the same

syntax to update the cell,
e.g. change the number
Again the extracted cell values can be separately processed.

19
Slicing the dataframe
The iloc attribute is used

to access the rows and
columns by their integer
indices:
‘:’ means extract all –
columns (also rows)
The loc attribute is used

to access the rows and
columns by their string
indices: 20
Adding columns to a dataframe
Remember this a
vectorised or element
wise math operation just
like NumPy arrays.
‘density’ columns is now

created and added to the
dataframe
Obviously Banaadir has the

highest density 21
Adding columns to a dataframe
The backward slash ‘\’

enables the continuation
of the list definition.
A new column ‘capital’

is now created and
added to the dataframe
22
Adding rows to a dataframe
This now adds a new

row with index
‘M_Shabelle’ to the
dataframe
23
Conditional data selection
The selected data is a

dataframe itself.
Such conditional
extraction can be
applied to any other
dataframe column
24
Conditional updating
This populates
the entire new
column with the
single value
‘low’
2525
The first index in

loc specifies the
rows to which the
change applies
and the second
argument
specifies the
column
26
 We can also use the apply() function to apply some
operation to every row or every column in a dataframe.
 The function takes a custom function as an argument, the
custom function takes either a row or a column at a time and
can return a modified row or column:
2727
The axis
argument
indicates
whether to
process the
dataframe by
columns (1)
or rows (0)
28
Deleting dataframe columns
The drop() method can be

used to remove rows or
columns depending on the
axis we specifiy and
column/row we name
From this output, we can

see that density_status
column is now removed.
29
Deleting dataframe rows
Like with columns, using

the inplace=True
argument means we are
updating the dataframe
and as the result the
dataframe will now have
fewer records or rows.
From this output, we can

see that M_Shabelle row is
now removed.
30
Exploring categorical variables
The unique() function

returns the unique
values of the colum
The value_counts() method

returns the requicy of each
The value_counts() unique value
method returns a series
31
Exploring numerical variables
The unique() function

returns the unique
values of the colum
The value_counts() method

returns the requicy of each
The value_counts() unique value
method returns a series
32
Visualizing numerical variables
This plot shows that 5

regions has a population
ranging from about 375K
to approx.
620k
Histograms are useful

tools for viewing the
distribution of
variables
33
Maximum
Q3
Median IQR
Q1
Minimum
Boxplots are used the represent summary statistics (5 number

summary) and to compare summary of different datasets
The two variable or columns could be drawn on the same plot but they
have been plotted separately since their scales differ, 34
The rot and fontsize

arguments are used
here to rotate and
size the x labels.
Line plots are

primarily used for
viewing continuous
variables.
35
Visualizing categorical variables
plt is an alias for

matplotlib Pyplot
module. The loc
argument to the legend
function sets the
location of the legend
Pie charts are primarily

using for visualizing
proportions of mostly
categorical variables
36
Hands-on exercise 2
 Suppose you have this data in a dictionary:
exam_data =
{
'name': ['Ali', 'Ahmed', 'Jama', 'Omar', 'Fatima', 'Mohamed',
'Mohamud', 'Malin', 'Farah', 'Samad'],
'score': [62.5, 79, 16.5, 65, 53, 81, 58, 45, 72, 66.5], 'attempts':
[1, 3,
2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'yes', 'no', 'yes']
}
 Create a dataframe from this data and retrieve the following subsets of data:
1. The first three rows
2. The following three rows
3. The score for 'Mohamed'
4. The scores of all students who qualify and who made just one
attempt.
37
References & reading resources

Data Analysis with Pandas
• https://pandas.pydata.org/docs/getting_started/
• https://pandas.pydata.org/pandas-docs/stable/index.html
• https://www.youtube.com/watch?v=5JnMutdy6Fw
• https://pandas.pydata.org/pandas-
docs/stable/reference/api/pandas.DataFrame.plot.html
• Chapter 3, Python Data Science Handbook, Jake
VanderPlas
• Chapter 5-6, Python for Data Analysis, Wes
McKinney.
Matplotlib
• https://matplotlib.org/tutorials/index.html
38
39

JUST Data Science Chapter 5 Pandas and Matplotlib

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

JUST Data Science Chapter 5 Pandas and Matplotlib

Uploaded by

Copyright:

Available Formats

Jamhuriya University of Science & Technology (JUST)

CA416 - Principles of Data Science

Practical Data Exploration and Visualization with

Data science workflow – recap

Continuous Discrete Ordinal Nominal

Continuous – time, age, etc Ordinal – grades, rating, etc

Exploratory data analysis

Pandas series with indices

the series attributes can

Noice that the values

Operations on series data

Please read the Pandas

The head() returns the first few

The dataframe is just like

Indexing the dataframe records

The argument inplace=True is

Region names are now used as the

 Can you represent each column (population and area)

Using these attributes, one can

This shows that dataframe values are

Average population and

These statistics include

The info() method

The shape() method

Matplotlib for plotting a dataframe

Matplotlib for plotting a dataframe

Bar plots are useful

Selecting dataframe columns

Selecting dataframe cells

You can use the same

Again the extracted cell values can be separately processed.

Slicing the dataframe

The iloc attribute is used

The loc attribute is used

Adding columns to a dataframe

‘density’ columns is now

Obviously Banaadir has the

Adding columns to a dataframe

The backward slash ‘\’

A new column ‘capital’

Adding rows to a dataframe

This now adds a new

Conditional data selection

The selected data is a

The first index in

Deleting dataframe columns

The drop() method can be

From this output, we can

Deleting dataframe rows

Like with columns, using

From this output, we can

Exploring categorical variables

The unique() function

The value_counts() method

Exploring numerical variables

The unique() function

The value_counts() method

Visualizing numerical variables

This plot shows that 5

Histograms are useful

Visualizing numerical variables

Boxplots are used the represent summary statistics (5 number