You are on page 1of 45

14-Oct-21 549519173.

docx 1

Analyzing Data Using Python: Data Analytics


Using Pandas
Built on the Python programming language, pandas provides a flexible and open
source tool for data manipulation. In this course, you'll develop the skills you need to
get started with this library. You'll begin by installing pandas from a Jupyter
notebook using pip.

Next, you'll instantiate a pandas object, including a Series and DataFrame, and
practice several ways of instantiating Dataframes - for instance, from lists,
dictionaries of lists, and tuples created from lists using the zip method.

You round out this course by performing filter operations on DataFrames using the
loc and iloc operations - fundamental techniques used to access specific rows and
columns. You'll use loc to identify rows based on labels and iloc to access rows based
on the index offset position starting from 0.

/conversion/tmp/activity_task_scratch/549519173.docx
14-Oct-21 549519173.docx 2

Contents
Analyzing Data Using Python: Data Analytics Using Pandas..........................................1
1. Course Overview..................................................................................................3
2. Creating a pandas Series......................................................................................5
3. Performing Basic Series Operations...................................................................10
4. Performing Casting Operations Using Series.....................................................15
5. Performing Logical Operations Using Series......................................................20
6. Creating and Editing pandas DataFrames..........................................................25
7. Performing Data Lookup Using DataFrames......................................................30
8. Constructing DataFrames from Dictionaries and Tuples...................................35
9. Course Summary................................................................................................41

/conversion/tmp/activity_task_scratch/549519173.docx
14-Oct-21 549519173.docx 3

1. Course Overview
Topic title: Course Overview

Hi, and welcome to this course getting started with data analytics using Pandas. My
name is Vitthal Srinivasan and I will be your instructor for this course.

Your host for this session is Vitthal Srinivasan. He is a software engineer and big
data expert

A little bit about myself first. I did my master's from Stanford University and have
worked at various companies, including Google and Credit Suisse. I presently work
for Loonycorn, a studio for high-quality video content. Pandas is an extremely
popular and powerful Python library used for working with tabular and time series
data.

The key abstraction in Pandas is the data frame object, which encapsulates data
organized into named columns and uniquely identified rows. This of course, is exactly
how spreadsheets as well as relational databases represent data, and is also how
many data analysts and computer scientists are accustomed to modeling data
mentally. This universality in design, coupled with a natural syntax that combines the
best elements of pythonic as well as R-style programming, and constantly expanding
APIs all help explain the meteoric rise in popularity of Pandas over the last decade.

In this course we will begin by opening up a Jupiter notebook from a shell utility and
use the PIP package installer to install Pandas. Once we confirm that our versions of
Pandas and Python are up to date, we move on and instantiate our first pandas
objects, series as well as data frame objects. We then go deep into the use of Pandas
series objects, each of which represents a single column of data. We create such
objects in various ways. We perform logical operations on them, and then aggregate
data in a panda's series object using the add function.

We then move on to data frames and explore the many different methods of
instantiating data frame objects. From dictionaries of lists from tuples, which in turn
have been created from lists using the zip method. Finally, we turn our attention to
basic data filter and access operations and round out the course by performing filter
operations using the loc and iloc methods. These are fundamental techniques to
access data in rows and columns.

/conversion/tmp/activity_task_scratch/549519173.docx
14-Oct-21 549519173.docx 4

As we shall see, loc is used to identify rows based on labels while iloc is used to
access rows based on the index offset position starting from 0. By the end of this
course, you will have a thorough grasp of the series and DataFrame objects, methods
used to instantiate these objects and perform basic operations on them.

/conversion/tmp/activity_task_scratch/549519173.docx
14-Oct-21 549519173.docx 5

2. Creating a pandas Series


Topic title: Creating a pandas Series. Your host for this session is Vitthal Srinivasan

Let's go ahead and get started with our exploration of Pandas from Jupyter I Python.
This particular demo is mostly going to be devoted to preliminaries.

A terminal window is open on the screen. The prompt is:


~/Project/Skillsoft/SkillsoftPythonDataAnalytics>

We'll start by ensuring that we have the correct versions of Python and Pandas
installed. We will then begin with Pandas series objects. Now the basic construct in
Pandas is the data frame, which encapsulates data organized into rows and columns.

And you can think of a Pandas series object as encapsulating one of those columns.
So you can think of a Pandas series object as a building block out of which data
frames are composed. At this point we have before us a blank terminal window.
These demos are being executed on a Mac machine. But the equivalent of such a
terminal window will be available on Windows machines as well.

Let's begin by examining the version of Jupyter that we are running. And the
command to do this is on screen now, jupyter --version. This displays a long list of
version numbers of a different binaries associated with jupyter and ipython. The
single most important one of these is the version of ipython. And you can see that
that is 7.16.1 you can also see that the version of jupyter-notebook is 6.0.3. Let's run
the clear command. And then let's get started.

Let's open up Jupyter notebook. The command for this is on screen now, jupyter
notebook. Now if you've not encountered Jupyter notebooks before, these are a
handy way of executing Python code along with associated comments and mark
down from a browser based interface. The way in which Jupyter workbooks are run
is by first starting up the Jupyter notebooks server. That's exactly what we've done
on screen now. Once the server is up and running, we can then go ahead and run our
notebooks from a browser.

You can see in the informational comments displayed when we run Jupyter notebook
that there are different ways of accessing notebooks listed there. First of these is by
opening a file in a browser, and then the name of the file follows. Or we can copy

/conversion/tmp/activity_task_scratch/549519173.docx
14-Oct-21 549519173.docx 6

and paste one of these URLs and there are two URLs given below. The first of those is
http://localhost, followed by :8888. And that in turn is followed by a token.

The second URL looks pretty similar that also has the port number that's 8888 as well
as the token, but it's a little different because it has the IP address 127.0.0.1. That
refers to the IP address of the current machine. That's equivalent to localhost. Let's
go ahead and copy one of these URLs. Let's pick the second one at random. And then
let's switch to a browser and paste this URL into the browser's address bar.

When we do this, we are hitting the Jupyter notebook server which we just started
up from our terminal window a moment ago. Let's take a moment to orient
ourselves. This Jupyter server shows three tabs Files, Running processes, and
Clusters. They are currently on the tab title Files. And you can see that down below it
has a view of the file system that view of the file system starts from the current
directory. The current directory happened to be whatever directory we were in when
we ran the Jupyter notebook command.

Here inside that current directory, there's just one subdirectory and that's called
Datasets. Let's go ahead and click on that. You can see that these are all clickable
links. When we click on the Datasets link, we now effectively navigate into the data
sets folder. There are several files visible inside this folder. We can quickly scroll
through and take a look at them once we are satisfied we can scroll back up and click
on the little folder icon that's towards the top of the screen.

And that takes us one level backup in the directory hierarchy. And this is back to
where we had just the one folder in there called datasets. Let's begin our Python
coding. We do this by clicking on the New button over on the extreme right. This
gives us a choice of various different types of actions. We could choose to open a
new Python 3 notebook, or we might just be interested in creating a new text file,
creating a new folder, or launching a terminal window. Here, it is indeed a Python 3
notebook that we are looking to get started with.

So that's the option we click on. So this launches a new Jupyter notebook. You can
see up top, towards the left, that this is untitled. We can change that and give this
notebook a title, by clicking on the word Untitled. This brings up a little dialogue
which allows us to rename our notebook. So we click on the little blue Rename
button over on the bottom right, and then enter the new name. And we are going to
call this SeriesAndDataFrames. This takes us back to our blank Jupyter notebook.

/conversion/tmp/activity_task_scratch/549519173.docx
14-Oct-21 549519173.docx 7

You can see that up top are a list of menu items, File, Edit, View, Insert, Cell, Kernel,
Widgets and Help. Below that is a little ribbon with icons for common operations
such as saving the notebook, then there's the plus icon that would help us add a new
input element. There is a pair of scissors which allows us to cut, code and various
other similar icons. For now, let's click on the View menu and choose the Toggle
Header option. Once we are done with that, we go back to the View menu and
choose the Toggle Toolbar option.

This is going to allow us to free up a little bit of screen real estate. We are now ready
to run our first command from inside our Jupyter notebook. And this is a pip install
command. Now, pip is a little utility that's used in order to install modules. These
modules will then be available for use within Python. Here of course, the module
that we are looking to install is pandas. A couple of points worth pointing out. First
up, note the exclamation point before the word pip. That exclamation point is used
whenever we'd like to execute a shell utility command from within Jupyter
notebooks.

So that exclamation point else, Jupyter that whatever's coming up next is a Unix
command and not a piece of pure Python. Then there's the -U flag over on the
extreme right of the statement. That -U flag is our way of requesting that the Pandas
module be upgraded, in case it already exists. But if it's out of date, we run this
command. It's pretty standard command. If you want to install a different library, you
just specify the name of that library in place of Pandas. We're ready to execute this
command and for that we use the keyboard shortcut Shift+Enter.

This is important. Anytime you try to run code from within a Jupyter notebook,
you've got to hit Shift+Enter and not Enter. So Shift+Enter triggers the execution of
the statement. As you can see from the return values which are displayed down
below. This requirement is already up to date. We already have Pandas which is
installed and ready for use. Also note that as soon as we run this command, ipython
assigns a number to the input control. So you can see that there is the word one
within square brackets next to the word In.

That number is only displayed after the command has finished running. So for long
running commands, this is a good way of testing whether our command is still being
executed or whether it's done. Now let's move to the next input box, this one
doesn't have a number next to it yet, and let's type out three statements. Import
pandas as pd, import sys, and import numpy as np. Numpy is a matrix manipulation

/conversion/tmp/activity_task_scratch/549519173.docx
14-Oct-21 549519173.docx 8

library. This is a really powerful and commonly used library. Here we are requesting
Python to make this library available and we are renaming the module as np.

Np is a pretty standard alias for the numpy module. And similarly, pd is a pretty
standard alias for the Pandas module. We are also importing a module named sys.
This is a module which allows us to access system information from within our
Python programs. Let's hit Shift+Enter. We can now see that this statement is
executed, the number 2 is assigned to this statement. Now we can go ahead and
actually make use of the sys module. So in the next input, we type, print Python
Version. So all of this is within double quotes, plus sys.version.

This is a handy way of checking exactly what version of Python we're executing all of
these commands on. So yet another Shift+Enter, and that tells us that the Python
version is 3.8.3. Let's run a similar command to examine the version of Pandas.

The command is: print("Pandas Version: " + pd.__version__)

This makes use of pd.__ version__. This is known as an attribute, here we are
examining the version attribute of the Pandas module. Shift+Enter will execute this
command and we can see that we are running Version 1.1.2 of Pandas.

We now have all of the required preliminaries out of the way. Let's scroll up and
instantiate our first Pandas series. On screen now in statement five, we have not one
but two Python commands. The first of these Python commands, creates a Pandas
series and stores it in a variable called random_series.

random_series = pd.Series (np.random.randn (50))

The second Python command invokes the head method on this object called random
series.

random_series.head()

What is a random series? Well, let's look closely at the right hand side of the equal to
sign. This is a Pandas series.

We know this because of the pd.Series. And this is a series consisting of 50 random
numbers. Those 50 random numbers were generated by invoking NumPy's
random.randn method. This randn method is used to generate numbers from a
standard normal distribution. This is a normal distribution, which has mean 0 and

/conversion/tmp/activity_task_scratch/549519173.docx
14-Oct-21 549519173.docx 9

standard deviation 1. There are 50 such numbers that we wish to draw from the
standard normal distribution. And those 50 numbers have been placed into a Pandas
series and stored in a variable called random series.

The .head method is a way of examining the first five elements of a Pandas series.
And those first five numbers have been displayed in the output of statement five.
That output has some interesting features. So let's come back to it in the demo
coming up next.

/conversion/tmp/activity_task_scratch/549519173.docx
14-Oct-21 549519173.docx 10

3. Performing Basic Series Operations


Topic title: Performing Basic Series Operations. Your host for this session is Vitthal
Srinivasan

In this demo, let's pick up a right from where we left off at the end of the last one.
That demo was mostly about preliminaries, making sure that we had Jupyter
Notebook up and running and checking on the versions of Python and Pandas that
we were running. Now we've turned our attention squarely to Pandas. And we are
examining the first pandas construct, that's the Pandas series. On screen now you
can see the first Pandas series that we created.

This series consists of a list of 50 randomly generated numbers. Those 50 randomly


generated numbers came from a NumPy method called random.randn. The output is
visible down below. The Pandas series has numbers. Each of those numbers is of a
particular type. That type is called the dtype. That's short for data type. And the
dtype here is float64.

Also note the numbers, 0, 1, 2, 3, and 4, which appear to the left of the actual data
contain within our random series. These numbers are the index values. We'll have a
lot more to say on index values in the upcoming demos. But you can think of these as
identifiers for each row. Here, each row in our Pandas series has an index value, the
first index value is 0, and the last is 49. We'll be able to examine that in just a
moment. Finally, also note that a Pandas series consists of a single column. And that
single column has no name.

This is going to be quite different from a Pandas DataFrame. As we shall see in


upcoming demos, Pandas dataframes consist of multiple columns, each one of those
columns will have a unique name. So a Pandas series can be thought of as a basic
building block for a Pandas DataFrame. A Pandas series represents one column. A
Pandas DataFrame has many columns. A Pandas series does not have a column
name. However, in a Pandas DataFrame, there is a column name for each
constituent series.

A lot of this will become more clear in the examples coming up ahead. For now, let's
go ahead and invoke the tail method.

The command is: random_series.tail()

/conversion/tmp/activity_task_scratch/549519173.docx
14-Oct-21 549519173.docx 11

On screen now you can see the output of the tail method. It consists of the last five
elements. These have the index values, which end with 49. So the head method is a
quick way of examining the first five elements in a Pandas series. The tail method is a
good way of examining the last five elements. What if you'd like to sample values
drawn at random from within the series? Well, then you make use of the sample
method.

That sample method takes in the number of values to be sampled. We can see this
on screen now. We've invoked the sample method and passed in one input
argument. That's the value 5.

The command is: random_series.sample(5)

And the output consists of five values which have been drawn at random from within
our data series. We can see on screen now the outputs of the head, tail, and sample
methods. When we compare the index values returned by these three methods, we
can see that the head gave the first five index values. The tail method give the last
five index values, and the sample method give us five index values picked at random.

In each of these three cases, however, the data type or the dtype of the data within
our Pandas series was the same. All of these data elements are of type float64. Let's
now turn to another example, this time where the data type is going to be something
a little different. On screen now you can see that we have instantiated a Python list.
This Python list is called employees_list. Like any Python list, it consists of a set of
values enclosed within square brackets and separated by commas. Each of these
values is a string.

We can see that that is the case because each of those values is contained between a
pair of single quotes. Remember that in Python, strings can be enclosed within either
single or double quotes. That's the first command contained within the present cell.
There is a second command and that creates a Pandas series.

employees_list = ['Bruce', 'Tony', 'Steve'' ,'Chris', 'Ruth', 'Paula']

It does so by using the pd.Series method which takes in the employees list.

employees_series= pd.Series (employees_list)

And what's returned is a Pandas series which is stored in a variable. That variable
which is on the left hand side of the equal to sign is called employees_series.

/conversion/tmp/activity_task_scratch/549519173.docx
14-Oct-21 549519173.docx 12

This too is a Pandas data series. This also has just the one column, and all of the
elements within this column are of type string. What happens if we now examine the
contents of this variable? Let's find out. So we type out employees_series and then
hit Shift+Enter. And we can see that we now have the same combination of index
values and data values. As before the index values start from 0. The data values
correspond to the strings from our employees list.

What's interesting though is the dtype. That dtype now is object, because of course,
in Python strings are objects. You can see that the index values by default begin from
0 and increase by 1. But what if we'd like to have more control over the index
values? What if we'd like to specify the index values for our cells using another
Pandas series? Let's go ahead and see how this is done. On screen now in this cell,
we first define a new Pandas series, this is called emp_id.

The values in this Pandas series are also drawn from a random number generator
that's np.random. Remember that np is short for NumPy. But this time we are not
making use of randn. Instead, we are making use of a randint. As its name would
suggest the randint method is going to return integer random numbers. We specify
minimum and maximum values.

So here the minimum value is 0 and the maximum value is a 100. And note that we
then specify the number of random numbers that we'd like generated. This is done
using the input argument called size. And this size input argument is given the value
length of employees series. In this way, we are going to get exactly the correct
number of employee IDs. This correct number is going to be the length of the
employee series. In case you're wondering why we have a name for the third input
argument there.

You can see that name size, which is only used to disambiguate the third input
argument. It's not used for the first or the second. And the reason for this is that
when the order of input arguments is clear, we don't need to specify names. The first
two input arguments into the randint function are the minimum and the maximum
values. And that's why we did not need to specify names in there.

However, after those first two input arguments randint takes in many possible
different other input arguments. In order to specify exactly which one of them we
had in mind, we had to disambiguate it using the name size. So once we've created
this Pandas data series and stored it in a variable called emp_id, we can examine the
value of the emp id by just typing out the variable name.

/conversion/tmp/activity_task_scratch/549519173.docx
14-Oct-21 549519173.docx 13

He highlights two command lines in a new code cell

emp_id = pd.Series(np.random.randint(0, 100, size=len(employees_series)))

emp_id

If we now hit Shift+Enter, we can see that this emp_id consists of the numbers 2, 83,
48, 22, 97, and 64.

These five integers are of course of type int64. And that's why the dtype is in64. And
the index values for this data frame, as expected range from 0 through 5. Now let's
use this new Pandas data series as the index of the employee series, which we had
created a moment ago.

He highlights two command lines in a new code cell

employees_series.index = emp_id

employees_series

You can see how this is done in the new commands on screen now.
employees_series.index is equal to emp_id. We then type out the variable name and
hit Shift+Enter in order to examine it.

And we can now see that the index values of the employees_series data series have
changed. They are no longer the default values which range from 0 through 5.
Instead, they have the randomly generated values which we stored in the emp_id
series up above. These are the same numbers in the same order 2, 83, 48, 22, 97,
and 64. Note also that there has been no change in the dtype of employee series
that remains object. We had made the point a moment ago that a Pandas series
consists of a single column and that column does not have a name.

But it is possible for a Pandas series to have names for the index. Let's see how this
works. On screen now, you can see a simple bit of code, which assigns a name to the
index property of employees_series. And that name is emp_id.

He highlights two command lines in a new code cell

employees_series.index.name = 'emp_id'

employees_series

/conversion/tmp/activity_task_scratch/549519173.docx
14-Oct-21 549519173.docx 14

And this actually has a little hint for us. Remember that employees_series had the
index, which itself was a Pandas series. And this tells us that it is possible to assign a
name to a Pandas series. The name that we have gone with in this instance is
emp_id.

Let's examine the contents of employees_series now. And we can see that the
output looks a little different. There is a name for the index and that name is emp_id.
It's also possible to examine the dtype corresponding to a Pandas series. This is a
simple command which you can see on screen now. We simply invoke the .dtype
method.

He enters the following command in a new code cell: employees_series.dtype

Because all of the values in our variable are of type object, the return value of the
dtype method is dtype O. O is enclosed within single quotes.

And this tells us that the dtype is object. Because the Pandas series has just one
column, this returns just one dtype. dtype is an object. And in upcoming examples,
we'll see how to examine or verify the contents of that dtype object. For now, let's
move on and explore some other ways of instantiating Pandas series. We'll get to
those in the demo coming up ahead.

/conversion/tmp/activity_task_scratch/549519173.docx
14-Oct-21 549519173.docx 15

4. Performing Casting Operations Using Series


Topic title: Performing Casting Operations Using Series. Your host for this session is
Vitthal Srinivasan

In this demo, we'll pick up from where we left off and examine some new ways of
creating and working with Pandas series objects. We've already mentioned that the
index of a pandas series is a unique identifier for each row. And that gives us a
similarity between a pandas series and a dictionary. On screen now you can see a
dictionary which consists of keys and values. A dictionary can be thought of as a set
of key value pairs. The keys are unique and each key uniquely identifies the
corresponding value.

He highlights the command in a new code cell. It reads: emp_data = {'Ronald': 22,
'Russel': 25, 'Lila ': 20, 'Anthony': 23, 'Wendy': 27}

In the dictionary defined on screen now, we have keys corresponding to the names
Ronald, Russell, Lila, Anthony, and Wendy. A dictionary is defined in Python by
placing the key value pairs between a pair of curly braces. Each key value pair has the
key, followed by a colon, followed by the value. And the keys must be strings. Here
we have five key value pairs. These have been placed within a pair of curly braces
and stored in a variable called emp_data.

We hit Shift+Enter, and our dictionary emp_data comes into existence. That was cell
14. In the next cell, we go ahead and instantiate a pandas series directly from this
dictionary.

The first command line is: emp_data_series = pd. Series (emp_data). The second
command line is: emp_data_series

And when we view the contents of this variable emp_data series, we can see that it
is a pandas series with a difference. Now the values of the index property, that is all
of the unique row identifiers, correspond to the keys in our dictionary. That is, they
correspond to the strings Ronald, Russell, Lila, Anthony, and Wendy.

The values contained within our series are the integers, that is the values from the
dictionary, and they are all of dtype int64. Next, let's see how we can assign a name
to a pandas data series as a whole. On screen now you can see that we've made use
of the .name property of the emp_data_series.

/conversion/tmp/activity_task_scratch/549519173.docx
14-Oct-21 549519173.docx 16

The first command line is: emp_data_series.name = 'name_age_info'. The second


command line is: emp_data_series

And we've assigned this the string name_age_info. We then examine


emp_data_series as usual. When we hit Shift+Enter, we see that we still have the
index values corresponding to the names.

The values within the series itself corresponding to those integers. And then down
below we have a new property, that's the name. And the name is exactly what we
specified up above. It's the string name_age_info. Let's now move on and
experiment with some other ways of working with pandas series. On screen now
we've created two series, series_A and series_B. Each one of these has been
instantiated from a list.

series_A = pd.Series([100, 200, 300, 400, 500])

series_B = pd.Series([200, 300, 500.6, 700.3, 900.8])

The first one, that's series_A, has been instantiated from a list in which every
element is an integer. Those are the values 100, 200, 300, 400, and 500. series_B
looks very similar, but some of the values that it has been instantiated from are not
integer values, they are floating point values. These include, for instance, 500.6,
700.3, and 900.8. Let's instantiate these two pandas data series objects by hitting
Shift+Enter. And then let's verify these two series. We start with series_A, just type
the name of the variable and hit Shift+Enter.

And you can see that this has all of the values we expect, along with the dtype of
int64. This is in cell 18. In the next cell, let's go ahead and type series_B and repeat
the process. We can see that the type is now float64. What's more, even the integer
values contained within this data series now appear in floating point format. For
instance, the element at index position 0 is stored as 200.0. And similarly, at index
position 1, we have 300.0.

What we can infer from this is that Python is pretty smart about figuring out the right
dtype that makes sense for all elements in a pandas data series. With series_A, all of
the values in there were of type integer. And that's why the correct dtype, the most
restrictive dtype which worked for every element, was int64. In the case of series_B,
we had two integer values, 200 and 300. But then we had the three floating point
values. And so the most restrictive dtype which still worked for every element was
float64.

/conversion/tmp/activity_task_scratch/549519173.docx
14-Oct-21 549519173.docx 17

And that's why the dtype of series_B is float64 while that of series_A is int64. By this
point we have a good handle over different ways of instantiating pandas data series
objects. Let's now break into a series of examples which show how such pandas data
series objects can be worked with. To begin with, let's check the .empty method. This
is a way of testing whether a particular series is empty or not. On screen now you
can see series_A.empty returns the value of false.

He enters a new command in the next code cell. It reads: series_A.empty. Next let's
see how we can go from a data series to a NumPy array.

Notice on screen now in cell 17, that both series_A and series_B have been
instantiated from Python list objects. Let's go ahead and reverse that operation. So
on screen now you can see a new variable called series_A_array.

series_A_array = series_A.to_numpy()

series_A_array

This is the name of the variable which is on the left hand side of the equal to sign. On
the right hand side, we invoke the .to_numpy method on our series_A object. And
the return value is a numPy array. This numPy array encapsulates a list.

That list consists of the integers 100, 200, 300, 400, and 500, enclosed within a pair
of square brackets. This little example on screen now is an important one. It shows
the difference between three Python constructs. We have the simple Python list,
which consists of values enclosed within a pair of square brackets. Then we have a
numPy array, which wraps around such a list. And then we have a pandas data series.
numPy arrays are great for carrying out matrix multiplication or other matrix
operations.

Pandas data series are great for when you want a relational representation of your
data. That is, the data organized into a single column where each row in that column
can be identified using the index. Let's move on and try out yet another method
available on a series object, this is the .values method.

series_A_array = series_A.values

series_A_array

/conversion/tmp/activity_task_scratch/549519173.docx
14-Oct-21 549519173.docx 18

We can see here that we invoked the .values method on series_A. And this has the
same return value as the .to_numpy method up above. This too returns an array, this
array encapsulates a Python list. Now the focus in these example is on the pandas
series object, but you should be aware that numpy array is referred to as an ndarray.

ndarray is the standard abstraction in NumPy, just as a DataFrame or a series is the


standard abstraction in pandas. You should also be aware that if you'd like to get
hold of data in the form of an ndarray from a pandas data series, you are advised to
use the .to_numpy method rather than the .values method. to_numpy is just a lot
more clear, you know that you are asking for a NumPy ndarray. Let's invoke the
.dtype method on our series. And once again, because this is a series, there's just
one dtype in there and that is int64. It's also possible for us to cast a pandas series
from one type to another.

series_A_float= series_A.astype('float64')

series_A_float

This is done using the .astype method. So we are invoking the .astype method on
series_A and we are passing in a string representation of the new dtype. This is the
dtype float64. It's important to note that this does not change the dtype of series_A
in place. Rather, it returns another pandas series which has the required dtype. And
that's exactly what's stored in the variable series_A_float. That's on the left hand side
of the equal to sign.

And when we examine the contents of series_A_float, we see that this pandas data
series has the exact same data values, but it has a different dtype. The dtype is now
float64 rather than int64. Also note how each one of the data values appears with a
decimal point and the 0 after that decimal point. The .astype method is a great way
of casting a pandas series from one type to another. Let's now turn our attention
back to series_B.

series_B.round()

series_B contained various floating point values. You might recall that those floating-
point values were close to round numbers. Let's see how we can make them perfect
round numbers using the .round method. You can see this on screen now, we invoke
the .round method on series_B. And when we run this using the Shift+Enter
keystroke combination, we get five values, each of dtype float64. But what's
interesting now is that all of those values have a 0 after the decimal point. So the

/conversion/tmp/activity_task_scratch/549519173.docx
14-Oct-21 549519173.docx 19

.round method has had the effect of dropping all of the fractional components from
our data values. It has not had the effect of changing the dtype to int64 from float64.
So round will get rid of the floating point portion but it retains the floating point
type.

Of course, you could always invoke the .astype method after applying the round
operation. You could specify the int64 data type, and that would truncate all of these
values. And that would give you a pandas data series of dtype int64. Let's move on
from casting operations to logical operations.

series_B.lt(400)

We'll do that in the demo coming up ahead.

/conversion/tmp/activity_task_scratch/549519173.docx
14-Oct-21 549519173.docx 20

5. Performing Logical Operations Using Series


Topic title: Performing Logical Operations Using Series. Your host for this session is
Vitthal Srinivasan

At the end of the last demo, we had introduced on screen our first logical operation
involving a pandas data series. Here it is on screen now. We have invoked the .It
method on our data series series_B. As the name lt would suggest, this is going to
take in a value and return true or false values corresponding to whether every value
in the series is less than the value we specified. In other words, lt is a predicate, it's a
condition, which is used to filter values from our original pandas data series.

Let's hit Shift+Enter to execute this lt command. And you can see that the return
value is yet another pandas data series, this time consisting of only True and False
values. And see that the predicate return the boolean value true for the first two
elements. Those are at index positions 0 and 1, and it returned the value false for the
subsequent index positions. Notice also how the dtype of the return value is bool,
let's go ahead and try some other predicate operations. The mirror image of the .lt, is
you guessed it .gt, lt is short for less than, gt is short for greater than.

series_B.gt(300)

On screen now, we've invoked the .gt method on series_B and we've passed in a
value of 300. We hit Shift+Enter, and as before, we get back a series of True and
False values. You can see these are the mirror images of the values we had up above,
the first two values are False. And that's because the first two elements are not
greater than 300, the subsequent values return True because each of those is indeed
greater than 300.

We can also see that the dtype is bool. On screen now, we have three operations.
These are in cells 25, 26, and 27, the first of these was the .round operation, the
second was the .lt, and the third was the .gt. We can see that the return values of
each of these operations is a pandas data series. So many operations performed on a
pandas data series will return a transformed pandas data series. Let's move on to
another search operation, this one is the isin operation.

Here, we performed series_B.isin and we passed in series_A as the input argument.


We execute this using Shift+Enter, and we see that this too has a return type of
pandas data series, and the dtype of the returned series is bool. The .isin method

/conversion/tmp/activity_task_scratch/549519173.docx
14-Oct-21 549519173.docx 21

performs an element wise check. This element wise check is carried out for each
element in the first object that is in series_B. Each object in series_B is going to be
tested to see if it's contained in series_A.

Here, note that the first two elements are returned true because those first two
numbers were present in both series_B and series_A. This was smart enough to
figure out the differences in the dtype, so series_B consisted of float64, series_A
consisted of int64s. But the .isin operation was smart enough to cast those values
appropriately while performing the .isin method. So the .isin method is yet another
transformation operation on a pandas data series. Let's now turn our attention to
some reduce operations. On screen now is the .agg method. This takes in a string
which tells us what kind of aggregation we'd like to perform on a pandas data series.

series_A.agg('min')

A moment ago, I referred to this as a reduce operation. And that's the term that's
used for such operations which take in a series or a set of numbers and compress or
reduce those into a single value. Here, this is an aggregation operation of type min.
The min, that's the smallest number in series_A is 100, and that's why that's the
return value we see when we hit Shift+Enter.

Let's try out some other aggregation operations. Now on screen you can see that
we've re-invoked the.agg method, but this time with the value passed in of average.

series_A.agg('average')

And when we execute this, we can see that the average value in series_A is 300. Also
note that the average is going to take all of the numbers in the series and then divide
them by the number of numbers. And that's why the result is of floating point type,
and that's why the result has a decimal point and a 0 after the decimal point.

Under the hood, any reduce operation is going to need to touch or iterate over every
element in the pandas series. And so it's often very efficient to perform multiple
aggregation operations all at once.

series_A.agg(['min', 'max'])

This is exactly what we're doing on screen now, we've passed in a list of aggregation
operations. That list has two elements, min and max, doing it this way is more

/conversion/tmp/activity_task_scratch/549519173.docx
14-Oct-21 549519173.docx 22

efficient than separately performing the min and the max operations. Because in
doing it this way, we only need to iterate over the elements in our data series once.

The return type of this is also a pandas data series. We can see that we have indexes,
which are the strings in min and max, and we also have the dtype, which is int64.
Next, let's turn our attention to another interesting operation, this is the append
operation.

appended_series = series_A.append(series_B, ignore_index = True)

appended_series

Append is a way to concatenate two pandas data series. Here you can see that we've
invoked the append method on series_A. The first input argument into the append
method is series_B, the second input argument is a named input argument.

This is named ignore_index and we've specified the value True. Finally, note that the
append operation is not performed in place. In other words, series_A is not going to
be modified. Rather, the concatenated data series is going to be returned. We store
this in a variable called appended_series. When we run this code and examine the
contents of appended_series, we note a bunch of different interesting points. First
off, we can see that the values include all of the values in the two data series without
any attempt at eliminating duplicates. For instance, we have the value 200 contained
at index position 1, as well as an index position 5, and that's because 200 appeared in
both of the two data series.

The next interesting bit to note is the dtype of this appended_series, it's float64.
Once again, pandas has done a smart job of figuring out the right dtype. This is the
dtype which is as restrictive as possible, but which still works for every element in
the result. And that's why the return type has a dtype of float64. The index values
are also worth looking at. You can see that these start from 0 and go on up to n
minus 1, where n is equal to the total number of elements in the appended_series.

The reason that we have these new and re-computed index values is because we
specified ignore_index to be equal to True. Had we said ignore_index equal to false
or had we just not specified the value of ignore index, the returned data series would
have the same index values as the two input data series. So we would have had
duplicate index values. That in turn leads to yet another question, is it acceptable, is
it allowed for a data series to contain duplicate index values?

/conversion/tmp/activity_task_scratch/549519173.docx
14-Oct-21 549519173.docx 23

The answer is, well, it's actually allowed, but it's definitely not recommended. And
you should be aware that the append method has yet another input argument called
verify integrity. By default, verify integrity is set to False, we've not included it or
invoked it here with verify integrity and so the verify integrity check is not
performed. However, if we do have duplicate index values and if we use verify
integrity set to True, then an error results.

Because that will not allow the use of duplicate index values. Next, let's turn our
attention to the drop method which is great for eliminating specific index values
from a data series. On screen now, you can see that we've invoked the .drop method
on the appended_series object.

appended_series.drop (labels = [0,9])

The .drop method is going to drop specific elements from within the data series,
which elements? Well, the elements which have the label specified contained in a
list, that list has two elements 0 and 9.

When we run this code, we can see the return value no longer contains indexes, 0
and 9. Now, what if we'd like to count the number of elements in a data series? Well,
it's simple, we simply invoke the .count method. You can see this done on screen
now, and the return value is 10.

appended_series.count ()

This also confirms that the operation on cell 33, that's a drop operation, did not
actually modify the variable appended_series. In cell 33, we invoke the drop method,
and we dropped two index values 0 and 9, the return from that operation had the
index values 1 through 8.

But then in cell 34, when we re-invoke the count method, we still have 10 values. In
cell 33, when we invoke the drop method, the returned value was a data series, that
data series was not stored in a variable. We simply executed that statement, and the
return value is what got displayed in the output of cell 33. In any case, we are now
done with various helpful handy operations on pandas data series, we will now turn
our attention to pandas data frames. So far, we've only worked with data series
which consists of individual columns.

empty_df = pd.DataFrame()

/conversion/tmp/activity_task_scratch/549519173.docx
14-Oct-21 549519173.docx 24

empty_df

A data frame builts on the data series abstraction, a data frame, which is of course
the most important abstraction in pandas consists of multiple data series. And each
data series has a unique name as well as a unique type. All of that will be introduced
in the demo coming up ahead.

/conversion/tmp/activity_task_scratch/549519173.docx
14-Oct-21 549519173.docx 25

6. Creating and Editing pandas DataFrames


Topic title: Creating and Editing pandas DataFrames. Your host for this session is
Vitthal Srinivasan

At this point we are very familiar with pandas data series objects, and it's time to
introduce pandas data frames. On screen now, you can see our first pandas
DataFrame, which is called empty_df. This is an empty data frame we've instantiated
by simply using the pd.DataFrame method. No data values have actually been parsed
in, and that's why this is an empty data frame.

Let's hit Shift+Enter and execute this code, we can see that the output of cell 35 is
simply an underscore. This is Python's way of telling us that this is a data frame
which has neither rows nor columns. Let's get a little more elaborate information,
and we can do this by using the print function.

print (empty_df)

This is a built-in Python function which can be invoked on pretty much any data
value, whether it's an object or not. When we attempt to print Empty DataFrame, we
get somewhat more useful information. We can see that our empty data frame has a
name. That name is Empty DataFrame and that name was given to it by pandas.
What's more, our empty data frame has a list of columns and a list of index values.
Now, it so happens that the list of columns is empty and the list of index values is
empty as well. But even so, this is a hint that a data frame consists of rows and
columns.

The rows are going to have unique identifiers which are contained within the Index
list. And the columns are going to have unique names and types, which are
contained within the Columns list. This tells us why a data frame is really the key
abstraction in pandas. After all, this is exactly how data is laid out in relational
database tables. It's also how data is laid out in CSV files, in the form of named rows
and columns. Let's go ahead and continue exploring this idea. Let's start by creating a
new list. This list is called employees and it consists of a set of employee names.

employees = ['Peter', 'Adam', 'Douglas'' ,'Henry', 'Jose', 'Aaron']

Like all Python lists, this is delimited by a pair of square brackets, elements within the
list are separated by commas. Once we've created this list, let's make use of it while

/conversion/tmp/activity_task_scratch/549519173.docx
14-Oct-21 549519173.docx 26

instantiating a data frame. This data frame is called employees_df, and it's
instantiated in an interesting fashion.

employees_df = pd.DataFrame({'name' : employees})

employees_df

We invoke the pd.DataFrame method, we've invoked this before, but now we pass in
a dictionary. That dictionary is enclosed within a pair of curly braces, and that
dictionary has a single key value pair. The one key is the string name, and the one
value is the list, employees. That's the list we just created in the previous cell. Let's
go ahead and run this code using Shift+Enter. When we do this, we get to examine
the value of employees_df. This is our first fully formed and functional data frame.

This data frame has just the one column, that column is called name. What's more,
each of the rows in this data frame is identified via an index. As before, this index
starts from 0, and goes down to n minus 1. Here our data frame has six rows, and
that's why the index values by default are 0 through 5. This tells us that a data frame
can be thought of as a dictionary. Every key in that dictionary corresponds to a
column name. And every value in that dictionary corresponds to a set of column
values. And that's why it's so easy to instantiate a pandas data frame from such a
dictionary, that's what's being done in cell 38.

We've understood how a data frame relates to a dictionary, let's now see how a data
frame is connected to a data series. On screen now, we've examined or displayed the
value of a series which we had previously created, this is called employees_series.
You can see that when we hit Shift+Enter in cell 39, we get the result,
employees_series consists of a set of employee names, each of these identified by a
corresponding unique index. We had given a name to the series and that was
emp_id. Let's use the series in order to create a pandas data frame. We'll do this
using the employees_series.to_frame method.

employees_df = employees_series .to_frame( 'emp_name' )

employees_df

So again, we are invoking the .to_frame method on the employees_series object.


We've got to specify an input argument into this .to_frame method. That input
argument is the name of the column that we would like our data frame to contain.
And that name is emp_name.

/conversion/tmp/activity_task_scratch/549519173.docx
14-Oct-21 549519173.docx 27

We've saved this in a variable called employees_df. And when we hit Shift+Enter, we
can see the contents of employees_df. This is a data frame, you can see how the
output is displayed with slightly different formatting from that of a data series. This
data frame has one column and that column is called emp_name. The values in that
column of course correspond to the values in the series. And the index values are the
same as well. So we've now learned yet another way of creating a data frame, this
time from a data series. Once again, do please note the difference in the formatting.

Note how the output of cell 39 is quite differently formatted from the output of cell
40. And that's because the output of cell 40 is a data frame and it's formatted in clear
tabular format with rows and named columns. What's common between the outputs
of cells 39 and 40, however, is the values in the index field. These are the integers 2,
83, 48, 22, 97, and 64. You might recall these were randomly generated index values
which we had specified explicitly, these are not the default index values. Let's now
see how we could reset these. On screen now we have invoked the .to_frame
method on our employees_series.

employees_df = employees_series .to_frame( 'emp_name' ).reset_index()

employees_df

So the first part of this cell is the same as before. But what's different is way over on
the right, that's the reset_index method. In effect, we are recreating our
employees_df data frame from the employees_series variable. And as before, we are
doing this using the .to_frame method.

But this time we are requesting pandas to reset the indexes when our data frame is
created. If we hit Shift+Enter and examine the results, we can see that the index
values now take the default values, that is 0 through n minus 1. But wait, there's
something else interesting about the output of cell 41 as well. This output now has
two columns, not one. And those two columns are called emp_id and emp_name.
The interesting bit is that emp_id contains all of the index values from the original
series.

Pandas was smart enough to figure out that if we invoke .reset_index, and did not
include the old index values, we would be losing information. Pandas wanted to be
sure that we did not lose this information. And that's why the resulting data frame
this time around has two columns rather than one. And the first of those two
columns contains the index values from the underlying pandas data series. Let's now
come back to the idea of instantiating a data frame from a dictionary. On screen now

/conversion/tmp/activity_task_scratch/549519173.docx
14-Oct-21 549519173.docx 28

you can see that we have a dictionary called location_data. This dictionary has three
keys. These correspond to the column names, Name, City, and State.

location_data = {. The second command line is: 'Name' : ['Alice', 'Patty', 'Paula',
'Lily', 'Ryan', 'Larry'], The third command line is: 'City' : ['Phoenix', 'Houston',
'Chicago', 'Los Angeles', 'New York', 'San Diego'], The fourth line command is:
'State' : ['Arizona', 'Texas', 'Illinois', 'California', 'New York', 'California']. The fifth
line command is: }

Each of these keys has a value, and that value in each case is a list. So for the key
Name, we have a list of names. For the key City, we have a list of cities. And of course
for the key State, we have a list of states. Now that we have this dictionary, we can
go ahead and instantiate a pandas data frame from it. The code to do this is visible
on screen now, we invoke pd.DataFrame.

df_location = pd.DataFrame(location_data, index=[10, 20, 30, 40, 50, 60])

df_location

The first input argument is the dictionary location_data. The second input argument
is an interesting one, this is a list of index values. So we are explicitly requesting this
data frame to have the index values 10, 20, 30, 40, 50, and 60. So we create this data
frame and store it in a variable called df_location. When we print that value out to
screen, we can see that we have a nice pandas data frame which has three columns.

These columns correspond to the names, the cities, and the states. And what's more,
the indexes, that is the unique row identifiers, are the values we specified, that is n
through 60. And in this way we have instantiated a data frame from a dictionary.
That dictionary had a set of keys. Each of those keys became a column name in our
data frame. That dictionary had corresponding values. And what was important was
that all of those values were lists, each list had the same number of elements.

And this is why we were able to create a pandas data frame which had a nice
rectangular shape. Let's go ahead and perform some operations on this data frame.
Let's start by getting hold of the index values. >> This is simple enough, we use
the .index method, which we have encountered previously while working with
pandas data series.

df_location.index

/conversion/tmp/activity_task_scratch/549519173.docx
14-Oct-21 549519173.docx 29

The return type of df_location.index is a built-in pandas type called Int64Index. This
encapsulates a list, that list has all of those values, 10, 20, 30, 40, 50, and 60. It's also
got a dtype, and that dtype is int64. So we can use the .index method in order to get
hold of the unique row identifiers for each row in a pandas data frame. In similar
fashion, we can get handles to the columns in the data frame using the .columns
method.

df_location.columns

This is what we've done in cell 45. The return value is also a built-in pandas type
named Index. This also encapsulates a list. That list has the names of our columns,
Name, City, and State. And it has the dtype object. We can see the symmetry in how
pandas deals with rows and columns. And once again, a data frame, which is the core
abstraction of pandas, is meant for data organized into rows and columns. Each row
is uniquely identified using its index. Each column is uniquely identified by the
column name. And we can get hold of the individual row as well as the column
identifiers, as you can see on screen now.

/conversion/tmp/activity_task_scratch/549519173.docx
14-Oct-21 549519173.docx 30

7. Performing Data Lookup Using DataFrames


Topic title: Performing Data Lookup Using DataFrame. Your host for this session is
Vitthal Srinivasan

In this demo we will continue with our exploration of methods available on


DataFrame objects. At the end of the last demo we had made use of the .index on
the .columns methods in order to get access to the index, that is the set of unique
row identifiers, and the columns contained within a DataFrame. Now on screen you
can see that we've invoked the .size property on our DataFrame. df_location.size
returns the number of rows and columns. So that 18 corresponds to the number of
unique data values.

As we can see up above in the output of cell 44, we have six rows. Then in the output
of cell 45, we can see that we have three columns. So three multiplied by six is 18.
And that's why df_location.size returns 18. Another important property is the shape
property. df_location.shape returns a tupple. This tupple has two elements and the
values are (6, 3). 6 corresponds to the number of rows, and 3 corresponds to the
number of columns.

If you work with NumPy arrays, that's nd arrays, you would know that the shape of a
NumPy nd array is a very important property. We've often got to reshape arrays in
order to perform operations such as matrix multiplication. For DataFrames, the
shape is relatively simple, because it's usually laid out into rows and columns. And
that's why the shape is usually going to consist of two values, as you see on screen
now.

We've already encountered the head and tail methods while working with data
series. These work with dataframes just as well. If we invoke the .head property on
our dataframe, we get the first five rows.

df_location.head()

Here, our dataframe has a total of six rows. And so when we invoke the head
method, we get the first five of those six rows. And this determines which rows are
to appear in the head and the tail, by sorting on the index values.

In similar fashion we can invoke the .tail method, this returns the last five rows.

/conversion/tmp/activity_task_scratch/549519173.docx
14-Oct-21 549519173.docx 31

df_location.tail()

We can see by comparing the outputs of cells 48 and 49, that the first five rows, that
is the five smallest index values are returned by head and the last five rows, that is
the five largest index values are returned by tail. Let's now turn our attention to
lookup operations. These are really powerful and important. We begin by looking up
a specific column.

This is the column name and we do this lookup by specifying the value of a particular
index, that is 30. Here we've uniquely identified both the row and the column that
we care about, not the format which we've used.

df_location['Name'][30]

We've got square brackets. The first pair of square brackets contains within it the
column name, which is name. The second pair of square brackets contains within it
the index value, which is 30. Please remember the syntax, this is important in order
to get the most out of Pandas DataFrames.

And we hit Shift Enter, we get the return value of Paula, because that is the value
which is contained in the name column for the row where the index was 30. And if
we'd like to figure out the city for the same index value, we just change the name of
the column from name to city. You can see that down below, df_location as City, and
then the index value 30. When we hit Shift + Enter, this time we get Chicago. What if
we'd like to get multiple columns?

Well, that's easy enough. We've now got a df_location, and we've now specified a list
and that list has two elements, Name and State. This is also the first time we are
looking up a dataframe using a pair of square brackets. Notice how we have two
opening square brackets and two closing square brackets. The reason for that is
because we've specified the input argument to be a list, and that's why the inner pair
of square brackets is actually just delimiting the list of column names.

Folks are often confused by when to use one pair of square brackets and when to use
two pairs. Well, the answer is easy enough. If you'd like to specify an input argument,
which is a list, then clearly you've got to make use of two pairs of square brackets.
Let's run this by hitting Shift + Enter and we now get the Names and States for every
row in our data. Why did we get this for every row? Well, because we simply did not
specify any filter on the index value.

/conversion/tmp/activity_task_scratch/549519173.docx
14-Oct-21 549519173.docx 32

If we compare the forms of cells 50, 51 and 52, we can see that in cells 50 and 51, we
have specified a set of column names, as well as a set of row indexes. In both of
those cases, we had just one column and one row index. In contrast, in cell 52, we
specified multiple column names, Name and State, but we did not specify any row
index at all. And that's why our result returned those multiple columns, but for all
rows in our dataframe.

Let's now introduce one little bit of new terminology. The value of the index for a
particular row is called its label. So for instance, the label for the first row is 10, the
label for the second row is 20, and so on. Now that we've understood what labels
are, let's turn our attention to a really important function. This is the .loc function.

df_location.loc[10]

Loc is one of the most commonly used functions in Pandas and it's used in order to
access a group of rows and columns using labels.

It can also be used along with Boolean arrays. We'll get to that in a bit. But for now
let's focus on using loc along with a label value. On screen now you can see that
we've invoked the .loc method and we have used square brackets in order to specify
a single label, that label is 10. If we now hit Shift + Enter, we are only you're going to
get that one row. And that row is displayed in a nice form where we have the names
of the columns, that's over on the left, and then the values for those columns over
on the right.

So the Name is Alice, the City is Phoenix and the State is Arizona. Now again, this
specifically refers to one row in a Pandas dataframe. This has a dtype, which is object
and it has a name, and that name is the label 10. What if you'd like to specify
multiple label values? Well, it's easy enough. We include those in a list.

df_location.loc[[20, 50] ]

Now that we've got a list which has two values, 20 and 50. That list is delimited using
one pair of square brackets. This is then passed into the .loc method, which has its
own pair of square brackets, and that's why we have two pairs of square brackets.

We have two opening square brackets and two closing square brackets. The values
of label that we are interested in are 20 and 50. We've not specified any column
restrictions, and that's where when we hit Shift + Enter, we are going to get all
columns for these 2 rows. So we have the Name, City and State are corresponding to

/conversion/tmp/activity_task_scratch/549519173.docx
14-Oct-21 549519173.docx 33

the labels 20 and 50. Let's now move on and try yet another important function
called .iloc.

df_location.iloc[1]

The loc function became a specific label value, the iloc function takes in a specific
label position. So here when we invoke df_location.iloc[1], we are asking for the row
which is at index position 1. And to remember that index positions start from zero,
and that's why this is going to be the second row in our data. The second row has a
label of 20, and that's why we can see that the name is equal to 20. As before dtype
is object, and as before, we have the Name, City, and State. The Name is Patty, the
City is Houston, and the State is Texas. iloc differs from loc because it takes in
integers rather than specific label values.

Also remember that those integers are indexed starting from zero, and they go up to
length minus 1 of whatever axis you happen to be query. Let's now try an iloc which
has multiple index positions.

df_location.iloc[[3, 4] ]

And the moment we have multiple input arguments, we need to have a pair of
opening and closing square brackets. So here we are invoking the iloc method. We
are passing in two index positions, [[3, 4]]. And when we execute this, we get two
rows from our underlying data. One other interesting bit here.

Contrast the output of cell 55 with that of cell 56, cell 55 had iloc with just one input
position specified, and that's where the output appeared in the form of key value
pairs. Cell 56, on the other hand, also was an iloc but it took in two input positions,
and that's why the return value was a dataframe. This shows that Pandas is smart
enough to figure out whether what's being returned is a single row, in which case it's
going to be treated as just one row. That's a different kind of object than a
dataframe. If multiple rows are going to be returned, the return type is a dataframe
as well.

/conversion/tmp/activity_task_scratch/549519173.docx
14-Oct-21 549519173.docx 34

8. Constructing DataFrames from Dictionaries and Tuples


Topic title: Constructing DataFrames from Dictionaries and Tuples. Your host for
this session is Vitthal Srinivasan

In the previous demo, we understood how the loc and iloc methods work on pandas
DataFrames. In this demo, we'll change track just a little bit and we will focus on
different ways of instantiating a DataFrame. We'll come back to the loc and iloc
methods on data frame objects, but for now let's segue into something slightly
different. Let's begin by instantiating a list of dictionaries. emp_info, which you can
see defined here, is a list.

He enters commands in the next code cell

emp_info = [{'ID' : 1, 'Name' : 'Cathy', 'City' : 'Austin'},

{'ID' : 2, 'Name' : 'Simon', 'City' : 'San Jose'},

{'ID' : 3, 'Name' : 'Ian', 'City' : 'Chicago'}]

We know it's a list because it has opening and closing square brackets. The individual
elements within this list are dictionaries. Each of those individual elements is
delimited by a pair of curly brackets. Pay close attention to the keys contained in
each of these dictionaries. Each of these dictionaries has as its first element a key ID,
as its second element the key Name, the third element as the key City. So emp_info
is a list of dictionaries, and all of those dictionaries have the same set of keys. Keep
this in mind as we move on to the next command. We use this emp_info in order to
instantiate a DataFrame. This is simple enough, we simply invoke pd.DataFrame and
pass in this list of dictionaries.

df = pd.DataFrame(emp_info)

df

We can see when we examine the contents of this DataFrame that it looks exactly
like any other DataFrame. It has the named rows and the named columns. What's
interesting is that the keys from the original dictionaries have become the column
names of our DataFrame. In addition, pandas has also generated labels for each of
these rows. As expected, these begin with a default value of 0, so we have the labels

/conversion/tmp/activity_task_scratch/549519173.docx
14-Oct-21 549519173.docx 35

0, 1, and 2. And the columns are ID, Name, and City. Let's go ahead and perform a
few other similar operations.

We begin first by examining emp_info. Once again, we can see that this is a list of
dictionaries, and all of those dictionaries have the same keys. Then we again create a
DataFrame from this list of dictionaries. But this time we specify a list of index values,
that is, a list of labels.

df = pd.DataFrame(emp_info, index=['a', 'b', 'c'])

df

We'll do this using the named input argument index, and the label values which we
specify are a, b, and c. Let's run this code and see what we get. Well, it's a data frame
where the index values are indeed a, b, and c. We've overridden the default index
values of 0, 1, and 2. So remember while instantiating a DataFrame using
pd.DataFrame, you simply specify a named parameter called index, which has a list
of the index values. Let's try another little experiment. This time we will override the
default column names. Once again we begin by verifying that emp_info is a list of
dictionaries. Then we instantiate a DataFrame, we call it df_one.

What's different is that this time around we have another named input argument,
that's the columns input argument. So columns is a list, and that list has the three
elements, ID, Name, and City.

df_one= pd.DataFrame(emp_info, index=['a', 'b', 'c'], columns=['ID', 'Name', 'City'])

df_one

We also have the index named input argument with the labels a, b, and c. We hit
Shift+Enter and this code executes. And we can see that in this way we can override
or control the column names just as easily as we can control the labels. Let's see
what happens if we tweak the column names.

We now have df_two. We have a different set of column names. We've now decided
to call these ID, Name, and State instead of ID, Name, and City. But wait a minute,
you might say at this point, those values of Austin, San Jose, and Chicago are
legitimate city names but they are not legitimate state names. What will pandas do
in this case? Let's find out.

/conversion/tmp/activity_task_scratch/549519173.docx
14-Oct-21 549519173.docx 36

df_two= pd.DataFrame(emp_info, index=['a', 'b', 'c'], columns=['ID', 'Name',


'State'])

df_two

We hit Shift + Enter, and we find something really interesting. We find that the
column titled State does appear in our pandas DataFrame. But all of the values in
there are populated with NaNs. How is pandas smart enough to figure out that San
Jose is not a legit state name? Well, the answer is simple enough. Pandas was
looking at the keys in the dictionaries and those keys contained, ID, Name, and City.
There was no key with the string State.

So even though we insisted that we wanted a column called State, there were no
keys called State in the underlying dictionaries inside emp_info. And notice that
emp_info is the first input argument into pd.DataFrame. Also notice that this
DataFrame simply does not contain the city information. And that's because we did
not include City in the list of columns while instantiating our DataFrame. This is a
really important example because it shows us how pandas keeps track of rows and
columns in the underlying data. And this is done using the row and the column
names.

Let's move on to yet another example, this one making use of an interesting function
called the zip function. Lets check this out in action. We begin while defining two
lists. The first of these is emp_name, the second of these is state.

emp_name = ['Barbara', 'Frank', 'Bhadrak', 'Gregory', 'Nikki']

state = ['California', 'Illinois', 'Texas', 'Arizona', 'Pennsylvania']

Now that we have these two lists, we'd like to create a data frame from these two
lists.

data = list(zip(emp_name, state))

data

But the easiest way to do this is by first combining those two lists into one list. Each
element in the combined list is going to have a tuple. And this is where the zip
function comes into its own.

/conversion/tmp/activity_task_scratch/549519173.docx
14-Oct-21 549519173.docx 37

You can see it on screen now. zip has been invoked with input lists, emp_name and
state. The output of zip is a list of tuples. Each tuple has the first element taken from
the first list, that's emp_ name. And the second element taken from the second list,
that's the state. So we have Barbara who's from California, Frank who's from Illinois,
and so on. Also note how we make use of the list method after invoking zip. This bit
of syntax, which you see in cell 65, is a really common Python idiom so you should be
aware of how it works.

You have multiple lists, you combine them using zip, and then you apply the list
method to get a list of tuples. And once we have a list of tuples, we can construct a
pandas DataFrame from it.

df = pd.DataFrame(data, columns=['EmpName', 'State'])

df

This is done here in cell 66. We've made use of pd.DataFrame. The first input
argument is a data, that's that list of tuples which we created in the cell up above.
Notice at this time all of the data consisted of lists, there were no dictionaries. In
other words, there were no column names or no keys.

And that's why we've got to specify a second input argument, that's the named input
argument columns. Columns is also a list with the two values EmpName and State.
We can see the output of cell 66 down below, and that is a perfectly formatted
pandas DataFrame. The first column is called EmpName, the second column is called
State. These correspond to the columns specified in the named input argument
columns. What's more, each of the rows in this DataFrame corresponds to one tuple
from our zip operation.

So the first row, which has label 0, corresponds to Barbara from California. In similar
fashion, we can see that the second row corresponds to Frank from Illinois, and so
on. So all of the rows which we obtained via the list zip operation have been
converted into rows of our pandas DataFrame. We've now seen how to construct
pandas DataFrames in many different ways. We did this using just lists, lists which
had been combined using zip, a list of dictionaries. And now let's also see how to
instantiate a DataFrame from underlying pandas series.

employee_details = {'Name' : pd.Series(['Becky', 'Lily', 'Laura', 'Roy'], The second


command line is: index=[1, 2, 3, 4]), The third command line is: 'City' :

/conversion/tmp/activity_task_scratch/549519173.docx
14-Oct-21 549519173.docx 38

pd.Series(['San Francisco', 'Indianapolis', 'Seattle', 'Denver'], The fourth command


line is: index=[1, 2, 3, 4])}

On screen now we have a data frame called employee_details. That's on the left
hand side of the equal to sign. On the right hand side we have a dictionary. That
dictionary has two keys. These correspond to the column names, Name and City.
Each of those keys is then associated with a corresponding value and that value is a
pd.Series object. We've constructed that pd.Series object in turn using values as well
as indexes. So the first pd.Series has the names Becky, Lily, Laura, and Roy, and the
index values 1, 2, 3, 4.

In similar fashion, the second value in our dictionary is also a pd.Series. This pd.Series
consists of a list of cities, that's San Francisco, Indianapolis, Seattle, and Denver. And
importantly, the same index values are repeated. So once again we have the index
values 1, 2, 3, 4. So we've now got a dictionary. The keys in the dictionary correspond
to column names. The values in that dictionary correspond to pandas series objects.
And we can now construct a pandas DataFrame from this. No additional information
is required, we simply invoke pd.DataFrame with employee details.

df = pd.DataFrame(employee_details)

df

And the return value, we can see that in the output of cell 68, is a nice DataFrame
with the two columns Name and City. These are the keys from our dictionary in cell
67. And then each row has the labels that we specified in that dictionary. So the
labels are 1, 2, 3, and 4. This is also visible with both of the values in cell 67. All of
these values needed to be in sync. And that's what made it so easy for pandas to
interpret this dictionary of column names and pandas series as a pandas DataFrame.
Let's try an experiment. What would happen if the index values in these dictionaries
were out of sync?

Let's find out. On screen now, we have a dictionary. We again have the column
names, Name and City. But this time the corresponding pandas series have different
index labels. The Name column has the index labels 1, 2, 3, 4, while the City series
has the index values 0, 1, 2, and 3.

employee_details = {'Name' : pd.Series(['Molly', 'Joseph', 'Daniel', 'Nick'], The


second command line is: index=[1, 2, 3, 4]), The third command line is: 'City' :

/conversion/tmp/activity_task_scratch/549519173.docx
14-Oct-21 549519173.docx 39

pd.Series(['San Francisco', 'Indianapolis', 'Seattle', 'Denver'], The fourth command


line is: index=[0, 1, 2, 3])}

What happens if we now construct a DataFrame from this employee_details


dictionary? Well, let's find out. And we can now see that we have all of the index
values.

df = pd.DataFrame(employee_details)

df

So we've got the union of the two sets of index values. We have 0, 1, 2, 3, and 4. The
missing values, however, are represented by NaNs. And that's why we have one NaN
in row with label 0, and that's in the Name column. And we have a second NaN in the
row with label 4, and that's in the City column. So we've now mastered yet another
way of creating pandas DataFrames, this time from a dictionary in which each of the
values was a pandas series. This little example drives home yet again how important
it is for all of the index values, the column names, the keys in the underlying
dictionaries, all of these need to be in sync. Otherwise, the DataFrame which we
create from them is probably going to have missing data in the form of NaNs.

/conversion/tmp/activity_task_scratch/549519173.docx
14-Oct-21 549519173.docx 40

9. Course Summary
Topic title: Course Summary

We've now come to the end of this course Getting Started with Data Analytics using
Pandas. The most important Python library for data analysts when they want to read,
manipulate, and transform tabular or time series data is Pandas. In this course, we've
used Jupyter Notebooks to analyze and work with data, and saw how we could use
shell commands from within Jupyter Notebooks to install the Python packages that
we need to get started. We then moved on to the use of Pandas series objects. A
series object represents a single column of data in a tabular format.

We explored a variety of ways to create series objects from Python lists and
dictionaries and using the Numpy library. We saw that every value in a series is
associated with a unique label. All of those labels together constitute the index. We
saw how we could aggregate our own custom indexes for series objects, and then
perform a number of analytical operations using built-in methods. For instance, we
computed basic statistics on series objects such as minimum and maximum and also
saw how general purpose aggregations could be applied using the agg method. We
then saw how we could instantiate and read data into Pandas DataFrames. These are
the core tabular structure used to store data in Pandas.

We rounded this course off by performing filter operations on DataFrames using the
loc and iloc operations which are fundamental techniques used to access only those
rows and columns which we are interested in. We learn the basics of loc and iloc and
saw how loc is used to identify rows based on labels. While iloc is used to access
rows based on the index offset position, starting from zero. Once you're done with
this course, you should have a strong foundation in manipulating data using Pandas
series and data frame objects. This will help you as you move forward to more
advanced operations with Pandas objects in the course, importing, exporting, and
analyzing data using Pandas coming up next.

/conversion/tmp/activity_task_scratch/549519173.docx
14-Oct-21 549519173.docx 41

10. Test

/conversion/tmp/activity_task_scratch/549519173.docx
14-Oct-21 549519173.docx 42

/conversion/tmp/activity_task_scratch/549519173.docx
14-Oct-21 549519173.docx 43

/conversion/tmp/activity_task_scratch/549519173.docx
14-Oct-21 549519173.docx 44

/conversion/tmp/activity_task_scratch/549519173.docx
14-Oct-21 549519173.docx 45

/conversion/tmp/activity_task_scratch/549519173.docx

You might also like