You are on page 1of 39

Jamhuriya University of Science & Technology (JUST)

CA416 - Principles of Data Science


Chapter 5

Practical Data Exploration and Visualization with


Pandas and Matplotlib packages.

Lecturer: XYZ

1
Jamhuriya University of Science & Technology (JUST)

Learning outcomes
By the end of this lecture, you will be able to:
 Describe some core data analysis concepts including
dataframes, and data exploration.
 Create and access main data structures in Python,
such as series and dataframes.
 Perform exploratory data analysis in Python usig
Pandas library.
 Build data visualizations with Matplotlib library

2
Jamhuriya University of Science & Technology (JUST)

Data science workflow – recap

Source:
https://www.dataquest.io/bl
og/what-is-data-science/ 3
Jamhuriya University of Science & Technology (JUST)

Types of variables
A variable is any characteristic or attribute that can be
quantitatively and qualitatively been measured

Variable

Numeric Categorical

Continuous Discrete Ordinal Nominal

Continuous – time, age, etc Ordinal – grades, rating, etc


Discrete – people, houses, etc Nominal – nationality, race, etc
4
Jamhuriya University of Science & Technology (JUST)

Exploratory data analysis


 Exploratory Data Analysis (EDA) is an initial exploration of
the data to understand its characteristics, patterns,
correlations, and to identify any anomalies in the data.
 Statistical data summarization and graphical visualizations
are two primary forms of EDA.
 EDA is typically applied before formal data modeling and
helps inform the development of appropriate statistical models
or machine learning models.

5
Jamhuriya University of Science & Technology (JUST)

Pandas
 Data analysis is normally performed over data stored in a
tabular format, e.g., Excel spreadsheet.
 Each observation is recorded in a row and its each
attribute is recorded in a column (e.g., students’ and
their grades in each assignment)
 Pandas is a Python library for manipulating data in tabular
format and comes with Python Anaconda.
 In Pandas, data manipulation can be much more varied, it can
programmed more easily and performed more efficiently,
which is critical in large-scale projects

6
Jamhuriya University of Science & Technology (JUST)

Pandas
• Pandas is a high-level library built on NumPy, providing
tools that make it easier to work with real-world data:
– load data from a variety of sources (e.g. CSV, JSON, SQL).
– update (add, modify, delete etc) data
– select subsets of the data
– group data by a certain criterion
– clean and handling missing values or NANs
– visualize the data using different plotting tools
– perform statistical analysis of the data and
– export the data to other file formats or databases
• Pandas provides two main data structures, namely,
DataFrame (equivalent to a spreadsheet) and a Series
(equivalent to a column in a spreadsheet).
7
Jamhuriya University of Science & Technology (JUST)

Pandas Series
 A Series is just a column is a dataframe or spreadsheet

Pandas series with indices


is created the list

the series attributes can


also be separately
accessed

Noice that the values


are just NumPy arrays 8
Jamhuriya University of Science & Technology (JUST)

Operations on series data

Please read the Pandas


documentation for detailed list of
statistical functions applicable on
series, as given in the additional
resource slide.

9
Jamhuriya University of Science & Technology (JUST)

Pandas Dataframe
 A Pandas dataframe is a collection of Series ( a 2D data
structure with row & columns - effectively a spreadsheet )
 Let's create an example dataframe with the population and area
values for some regions in Somalia

The head() returns the first few


rows of the dataframe. You can use
tail() to see the last few rows

The dataframe is just like


spreadsheet sheet with indices (as
the series) 10
Jamhuriya University of Science & Technology (JUST)

Indexing the dataframe records


 Since we did not supply any particular values as the index,
a range of integers was used as the index for the dataframe
 However, it may be convenient to set the region names as the
indices for our example dataframe.

The argument inplace=True is


an instruction to modify the
dataframe and as the result the
dataframe will now have two
columns

Region names are now used as the


dataframe indices.
11
Jamhuriya University of Science & Technology (JUST)

Hands-on exercise 1
 We can create dataframes by supplying dictionaries with
identical sets of keys as arguments to DataFrame()

 Can you represent each column (population and area)


as
dictionaries and create a dataframe from them ?

12
Jamhuriya University of Science & Technology (JUST)

Dataframe attributes

Using these attributes, one can


separately access df information

This shows that dataframe values are


just 2D NumPy arrays, and this is
why NumPy underpins Pandas and other
data science libraries

13
Jamhuriya University of Science & Technology (JUST)

Descriptive statistics

Average population and


area for all regions are
6.931368e+05 and
41079.333333 in order.

These statistics include


mean, quartiles, median,
total observations etc.

14
Jamhuriya University of Science & Technology (JUST)

Descriptive statistics

The info() method


provides a concise
description of the
dataframe

The shape() method


provides the shape of
the dataframe in terms
of the number of rows
and columns

15
Jamhuriya University of Science & Technology (JUST)

Matplotlib for plotting a dataframe


• The matplotlib is comprehensive package for
data visualization, and comes as part of Anaconda.
 Pandas has a convenient integration with matplotlib, which
means that data contained in a dataframe can be plotted with
plot():
• You can select the plot type of your choice ( e.g., scatter, bar,
boxplot, pie, hist, …) corresponding to your data
• Please see the resource at the end for more information on
various plots and arguments of the plot() function

16
Jamhuriya University of Science & Technology (JUST)

Matplotlib for plotting a dataframe


• Let us now plot the data contained in our dataframe, df, by
simply calling its plot method:

Logy=True argument
enables us to scale the
y values
logarithmically.
Otherwise the scales
could have been very
different

Bar plots are useful


tools for viewing
categorical
17
variables
Jamhuriya University of Science & Technology (JUST)

Selecting dataframe columns


• Let us extract the area column/variable from our data
frame

The extracted columns are Pandas series type and can stored in a
different variable or processed separately 18
Jamhuriya University of Science & Technology (JUST)

Selecting dataframe cells


• We also extract a cell value of a dataframe
Notice that the cell is
accessed by its column
name and row index.
The index can be a
number and its format is
to put it in square
brakcets []

You can use the same


syntax to update the cell,
e.g. change the number

Again the extracted cell values can be separately processed.


19
Jamhuriya University of Science & Technology (JUST)

Slicing the dataframe

The iloc attribute is used


to access the rows and
columns by their integer
indices:
‘:’ means extract all –
columns (also rows)

The loc attribute is used


to access the rows and
columns by their string
indices: 20
Jamhuriya University of Science & Technology (JUST)

Adding columns to a dataframe

Remember this a
vectorised or element
wise math operation just
like NumPy arrays.

‘density’ columns is now


created and added to the
dataframe

Obviously Banaadir has the


highest density 21
Jamhuriya University of Science & Technology (JUST)

Adding columns to a dataframe

The backward slash ‘\’


enables the continuation
of the list definition.

A new column ‘capital’


is now created and
added to the dataframe

22
Jamhuriya University of Science & Technology (JUST)

Adding rows to a dataframe

This now adds a new


row with index
‘M_Shabelle’ to the
dataframe

23
Jamhuriya University of Science & Technology (JUST)

Conditional data selection

The selected data is a


dataframe itself.

Such conditional
extraction can be
applied to any other
dataframe column
24
Jamhuriya University of Science & Technology (JUST)

Conditional updating

This populates
the entire new
column with the
single value
‘low’

2525
Jamhuriya University of Science & Technology (JUST)

Conditional updating

The first index in


loc specifies the
rows to which the
change applies
and the second
argument
specifies the
column

26
Jamhuriya University of Science & Technology (JUST)

Conditional updating
 We can also use the apply() function to apply some
operation to every row or every column in a dataframe.
 The function takes a custom function as an argument, the
custom function takes either a row or a column at a time and
can return a modified row or column:

2727
Jamhuriya University of Science & Technology (JUST)

Conditional updating

The axis
argument
indicates
whether to
process the
dataframe by
columns (1)
or rows (0)

28
Jamhuriya University of Science & Technology (JUST)

Deleting dataframe columns

The drop() method can be


used to remove rows or
columns depending on the
axis we specifiy and
column/row we name

From this output, we can


see that density_status
column is now removed.

29
Jamhuriya University of Science & Technology (JUST)

Deleting dataframe rows

Like with columns, using


the inplace=True
argument means we are
updating the dataframe
and as the result the
dataframe will now have
fewer records or rows.

From this output, we can


see that M_Shabelle row is
now removed.

30
Jamhuriya University of Science & Technology (JUST)

Exploring categorical variables

The unique() function


returns the unique
values of the colum

The value_counts() method


returns the requicy of each
The value_counts() unique value
method returns a series
31
Jamhuriya University of Science & Technology (JUST)

Exploring numerical variables

The unique() function


returns the unique
values of the colum

The value_counts() method


returns the requicy of each
The value_counts() unique value
method returns a series
32
Jamhuriya University of Science & Technology (JUST)

Visualizing numerical variables

This plot shows that 5


regions has a population
ranging from about 375K
to approx.
620k

Histograms are useful


tools for viewing the
distribution of
variables
33
Jamhuriya University of Science & Technology (JUST)

Visualizing numerical variables

Maximum
Q3

Median IQR
Q1

Minimum

Boxplots are used the represent summary statistics (5 number


summary) and to compare summary of different datasets

The two variable or columns could be drawn on the same plot but they
have been plotted separately since their scales differ, 34
Jamhuriya University of Science & Technology (JUST)

Visualizing numerical variables

The rot and fontsize


arguments are used
here to rotate and
size the x labels.

Line plots are


primarily used for
viewing continuous
variables.
35
Jamhuriya University of Science & Technology (JUST)

Visualizing categorical variables

plt is an alias for


matplotlib Pyplot
module. The loc
argument to the legend
function sets the
location of the legend

Pie charts are primarily


using for visualizing
proportions of mostly
categorical variables
36
Jamhuriya University of Science & Technology (JUST)

Hands-on exercise 2
 Suppose you have this data in a dictionary:
exam_data =
{
'name': ['Ali', 'Ahmed', 'Jama', 'Omar', 'Fatima', 'Mohamed',
'Mohamud', 'Malin', 'Farah', 'Samad'],
'score': [62.5, 79, 16.5, 65, 53, 81, 58, 45, 72, 66.5], 'attempts':
[1, 3,
2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'yes', 'no', 'yes']
}
 Create a dataframe from this data and retrieve the following subsets of data:
1. The first three rows
2. The following three rows
3. The score for 'Mohamed'
4. The scores of all students who qualify and who made just one
attempt.
37
Jamhuriya University of Science & Technology (JUST)

References & reading resources


Data Analysis with Pandas
• https://pandas.pydata.org/docs/getting_started/
• https://pandas.pydata.org/pandas-docs/stable/index.html
• https://www.youtube.com/watch?v=5JnMutdy6Fw
• https://pandas.pydata.org/pandas-
docs/stable/reference/api/pandas.DataFrame.plot.html
• Chapter 3, Python Data Science Handbook, Jake
VanderPlas
• Chapter 5-6, Python for Data Analysis, Wes
McKinney.

Matplotlib
• https://matplotlib.org/tutorials/index.html
38
Jamhuriya University of Science & Technology (JUST)

39

You might also like