Learn Python Pandas For Data Science Quick TutorialExamples For All Primary Operations of DataFrames

www.justintodata.
com /learn-python-pandas-for-data-science-tutorial/
Learn Python Pandas for Data Science: Quick

Tutorial Examples for all primary operations of DataFrames
⋮ 20/6/2020
This is a guide to help you quickly learn and use Python pandas for data science, machine learning.
Pandas is the most popular Python package for data manipulation and analysis. Data analysts or
scientists often spend 80% of their time exploring and transforming data, which makes pandas an
essential skill for data science.
In this tutorial, you’ll learn the most useful data science-related pandas techniques:
What is pandas Python.

How to install and import pandas.
What are the pandas data structures.
How to do the key dataframe operations such as reading/saving data, exploring data,
manipulating data.
Much more.
1/37
This practical tutorial with examples will help you start real data analysis, if you just learned the basics of
Python.
Note: if you are new to Python, take our FREE Python crash course for data science.
Let’s begin!
What is Pandas Python for Data Science?

Or why pandas?
Because pandas:
is easy to use, fast, and open-source.

offers powerful data structures and operations.
can be used to import, write, manipulate, clean, and analyze data. Many people consider it as the
Excel within Python.
is the most popular Python data analysis/manipulation package, which is built upon NumPy.
In summary, pandas is a must-know package for data science or machine learning.
Note before the tutorial: it’s not necessary (or possible) to memorize all the functions or methods. Plus,
there might be other functions or methods you need in the future.
We suggest you follow the tutorial to learn systematically, which covers the essentials. When you are
practicing data science, a simple Google search would often solve the problem.
Install and Import Pandas

So how to get started with pandas?
You can install the Anaconda distribution to start using Python for data science, which is FREE and
includes popular packages such as pandas.
If you want to get pandas, check out How to Install/Setup Python and Prep for Data Science NOW.
If you want to install pandas and learn the basics of Python, take our FREE Python crash course.
This guide assumes you have all the basic Python knowledge.
If you already have Python but not pandas, please use the command below to install it.
pip install pandas
After downloading the package, we need to import the packages for each new Python session.
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
view raw import_packages.py hosted with ❤ by GitHub
2/37
We also import the matplotlib package, which allows us to show some simple data visualizations.
Learn Python Pandas Data Structures

pandas has two main data structures: Series and DataFrame.
Series: a 1-dimensional labeled array that can hold any data type such as integers, strings, floating
points, Python objects.
It has row/axis labels as the index.
DataFrame: a 2-dimensional labeled data structure with columns of potentially different types.
It also has row labels as the index.
DataFrames can be considered as a collection of Series; it has a structure like a spreadsheet.
In this tutorial, we will focus on DataFrames since they need a lot more data manipulation than Series.
Next, let’s see how to read in data as pandas DataFrames.
Import Data into pandas DataFrame

To work in pandas, we can either create DataFrames or read data from external sources.
We’ll cover three common cases in practice.
Dictionary to Dataframe
There are different ways to construct DataFrames from scratch. Let’s use Python dictionaries as an
example.
After a raw dataset is preprocessed in Python, it is often convenient to compile the new dataset as a
dictionary. But dictionaries are not great objects for further analysis.
This is when we need to convert this dictionary object into a pandas DataFrame.
This is especially common when analyzing log files or text.
Let’s see an example below.
We have a dataset for housing sales records. First, we create two lists with values of the houses’
addresses and prices. Then we’ll combine these lists into a dictionary d with key ‘address’, value l1 and
key ‘price’, value l2. We can simply transform this dictionary as a pandas DataFrame df2 as below.
l1 = ['123 first st.', '234 second st.', '345 third st.', '456 fourth st.', '567 fifth st.', '678 sixth st.']
l2 = [305000, 650000, 267000, 500000, 125000, 875000]
d = {'address': l1, 'price': l2}
df2 = pd.DataFrame(data=d)
df2
view raw dataframe_dictionary.py hosted with ❤ by GitHub
3/37
This is great!
We’ve created our first pandas DataFrame. As you can see, it has six rows and two columns with an
index from 0 to 5.
Yet, it is more often we import data from outside sources. Let’s look at the most common source: csv
files.
read_csv
Within this tutorial, we’ll mainly use two sample datasets. Both can be downloaded following the links.
One is a YouTube dataset sydney_videos.csv; the other is a transactional dataset trans.csv. They are
both datasets we’ve done analyses on, which makes this guide more practical. Please make sure the two
files are stored in the same working directory as your Python code. This saves us from specifying the
path in read_csv.
Since both datasets are saved as csv files. Let’s use the pandas read_csvfunction to import them into
Python.
For the YouTube dataset, we can use the default settings of read_csv to load it as DataFrame df0.
We then call df0 for a look.
df0 = pd.read_csv('sydney_videos.csv')
df0
view raw read_csv.py hosted with ❤ by GitHub
4/37
We can see that df0 is a DataFrame with 837 rows and four columns. Each of the rows is a video with
columns:
title
time posted_ago
length
number of views
We can also change the settings of the input parameters to read_csv. For example, to read the
transactional dataset, we set:
sep = ‘;’ to change the delimiter from commas (,) to semicolons (;).
dtype={‘k_symbol’: ‘string’, ‘bank’: ‘string’} to set the data types of columns k_symbol and bank to
strings.
parse_dates=[‘date’] to parse the column date as a DateTime format.
df1 = pd.read_csv('trans.csv', sep=';', dtype={'k_symbol': 'string', 'bank': 'string'}, parse_dates=

['date'])
df1
view raw read_csv_more.py hosted with ❤ by GitHub
5/37
Further Reading: Pandas read_csv to DataFrames: Python Pandas Tutorial
Read this complete tutorial to learn more about Python pandas read_csv. Or take the last section as a
cheat sheet.
read_sql
Another common case is when we want to import data into Python from SQL servers. We can use
read_sql to read in the SQL query or database table into a DataFrame.
For example, we have a sample database in Microsoft SQL Server. We first set up a connection with the
server by using the SQLAlchemy function create_engine. Then we can query within Python to get the
data as DataFrame.
from sqlalchemy import create_engine

engine = create_engine("mssql+pyodbc://JustintoData")
conn = engine.connect()
df3 = pd.read_sql('SELECT TOP 100 AddressLine1, AddressLine2, City FROM
[AdventureWorks].[Person].[Address]', conn)
df3
view raw read_sql.py hosted with ❤ by GitHub
6/37
Further Readings:
How to Download and Install SQL Server and Sample Database: Step-by-Step
SQL Tutorial for Beginners: Learn SQL for Data Analysis
SQL is another essential data science skill. Read these articles to master it.
Besides CSV files and SQL databases, we can read data from a variety of sources with pandas. Learn
more with pandas’ official IO tools documentation.
At the end of the tutorial, we’ll show how to save the DataFrames on your computer.
But let’s see what we can do with these DataFrames first.
Explore the DataFrames

After creating the pandas DataFrames, let’s learn more about their basic characteristics.
We’ll cover a few popular methods to do this.
info: summary
The info method shows a helpful summary of our DataFrames. It is one of the first methods you should
apply after loading in the dataset.
Let’s try it out on the YouTube DataFrame df0.
df0.info()
view raw dataframe_info.py hosted with ❤ by GitHub
7/37
We can see that it prints information, including:
the index information

the columns and their names, non-null values, counts, dtypes
memory usage
head or tail: first or last n rows
The head method is also used in every data analysis project. With its help, we can see exactly how the
first n rows look like within the DataFrame.
Let’s print the first five rows of df0 using the default settings of the method.
df0.head() # prints out first 5 rows.

view raw dataframe_head.py hosted with ❤ by GitHub
We can also specify the number of rows to display.
df0.head(20) # prints out first 20 rows.

view raw dataframe_head_rows.py hosted with ❤ by GitHub
8/37
The tail method returns the last n rows of the DataFrame. It can be used as a quick verification after we
sort or append rows.
df0.tail() # prints out last 5 rows.

view raw dataframe_tail.py hosted with ❤ by GitHub
9/37
shape: dimensionality
To get the number of rows and columns of the DataFrame, we can use shape.
df0.shape
view raw dataframe_shape.py hosted with ❤ by GitHub
(837, 4)
describe: descriptive statistics
The method describe generates descriptive statistics for both numeric and object columns.
For example, for the numeric column views, we can print out its summary statistics such as:
count
mean
standard deviation
percentiles.
df0['views'].describe()
view raw dataframe_views_describe.py hosted with ❤ by GitHub
For the object column posted_ago, we can see its corresponding statistics:
count
unique counts
10/37
the most common value and its frequency (top, freq).
df0['posted_ago'].describe()
view raw dataframe_describe_object.py hosted with ❤ by GitHub
boxplot and histogram: explore numerical variables
When the columns are numeric, it’s useful to explore their distributions by boxplot and histogram.
We can do a boxplot on the numerical column views.
df0.boxplot('views')
view raw dataframe_views_boxplot.py hosted with ❤ by GitHub
We can also look at its histogram.
df0.hist('views', bins=50)
view raw dataframe_hist_views.py hosted with ❤ by GitHub
11/37
Note: we prefer the seaborn library for data science since it has more features. But these methods in
pandas are good for quick and simple plots.
Further Readings: How to use Python Seaborn for Exploratory Data Analysis
The YouTube dataset was also used in this complete guide of using seaborn Python for Exploratory Data
Analysis. Discover how to use histograms, barplots, scatter plots, boxplots, heatmaps, etc.
value_counts and bar plot: unique values of categorical variables
The value_counts method is handy for categorical columns. It returns a pandas Series with counts of
unique values, in descending order. And it excludes NA values by default.
For example, we can use it to see frequency tables of column posted_ago. It provides more details than
the previous method describe().
df0['posted_ago'].value_counts()
view raw dataframe_value_counts.py hosted with ❤ by GitHub
12/37
We can also display the results in percentages by using the normalize parameter.
df0['posted_ago'].value_counts(normalize=True)
view raw dataframe_value_count_normalize.py hosted with ❤ by GitHub
13/37
It’s also helpful to visualize this result. We can do this by using the plot method.
df0['posted_ago'].value_counts().plot(kind='bar')
view raw dataframe_value_count_plot_bar.py hosted with ❤ by GitHub
We just plotted a bar chart.
14/37
Note: again, we prefer the seaborn library for data science since it has more features. But these methods
in pandas are good for quick and simple plots.
Further Readings: How to use Python Seaborn for Exploratory Data Analysis
Now with a better understanding of the DataFrames, let’s see what we can do with its columns and rows.
Select and Manipulate Columns and Rows

Columns and rows are the basic structures of DataFrames. It’s common practice to select a subset or
modify the columns/rows. Let’s learn how to do it with Python pandas. We’ll also cover related topics
such as index settings and filtering.
Columns of DataFrames
The columns of the DataFrames often represent variables/features in the dataset. There are often names
we can refer to these columns by.
Select Columns
The most basic way of selecting columns is by calling its name within squared brackets [].
For example, we can print out the columns length and views separately below.
df0['length']
df0['views']
view raw dataframe_column.py hosted with ❤ by GitHub
15/37
Column Operations
We can also perform simple numerical or categorical operations on columns.
For example, we can perform a multiplication of 100 on the numeric column views.
df0['views'] * 100
view raw dataframe_column_numeric.py hosted with ❤ by GitHub
Add New Columns
We can also create new columns in the DataFrame by assigning existing columns’ transformations.
The column length stores the number of seconds of the videos. Sometimes it’s more convenient to look
at it in minutes. Let’s create a new column below.
df0['length_in_minutes'] = df0['length'] / 60
df0.head(10)
view raw dataframe_new_columns.py hosted with ❤ by GitHub
By printing out the head of df0, we can see the new column length_in_minutes being added to it.
16/37
Similarly, we can perform categorial operations and create new columns.
df0['title_lowercase'] = df0['title'].str.lower()
df0.head(10)
view raw dataframe_column_categorical.py hosted with ❤ by GitHub
The new column title_lowercase contains the lower cases of title, is added to the DataFrame.
Rows of DataFrames
Rows represent observations in the datasets. In this section, we’ll cover the basic methods for getting
subsets of the DataFrame and indexing.
The two main methods to locate rows are either by labels (loc) or integer-location based indexing (iloc).
Before that, let’s change the index of our dataset to contain special labels.
set_index: set index to columns
Like strings or lists, every pandas DataFrame has an index. By default, it is an integer ranging from 0 to
(number of observations – 1). But we can also set the index to be one of the columns.
17/37
The default index of the transaction dataset df1 is integers 0 to 1056319. Since it is a time series dataset,
the column date could be a better index.
Let’s set the index to the column date.
df1 = df1.set_index('date')
df1.head(10)
view raw dataframe_set_index.py hosted with ❤ by GitHub
We can see that the date column replaced the leftmost index column.
Now we are ready to explore the row locating methods.
loc: locate by labels
We’ll start from the loc method, which helps us to locate rows by their labels/names.
18/37
For example, if we want to see all the transactions that happened on 1996-12-25, we can use loc to find
them.
df1.loc['1996-12-25']
view raw dataframe_loc_date.py hosted with ❤ by GitHub
We can also get a range of dates using slicing. Note the difference from the usual Python slices. Both the
start and the stop are included.
df1.loc['1997-01-01':'1997-01-31']
view raw dataframe_loc_date_range.py hosted with ❤ by GitHub
19/37
The loc method also allows us to subset both rows and columns by the labels.
For example, instead of printing all the columns for date index ‘1996-12-25’, we can specify the columns.
df1.loc['1996-12-25', ['account_id', 'type', 'amount']]

view raw dataframe_loc_columns.py hosted with ❤ by GitHub
Besides labels, we can also use boolean arrays (True/False) to subset rows, which will be shown in the
Filtering section later.
reset_index
One of the disadvantages of an index is that the data in it is not as accessible as the other columns. To
do better analysis on the dates in later sections, let’s reset the index of df1 to return the date as a column.
df1 = df1.reset_index()
df1.head()
view raw reset_index.py hosted with ❤ by GitHub
20/37
iloc: locate by index
The iloc method allows us to locate rows based on integer-location based indexing.
For example, to call the 10th to 19th rows, we can use slicing with iloc.
df1.iloc[10:20]
view raw dataframe_iloc_row_number.py hosted with ❤ by GitHub
Note that the stop index 20 is not included since iloc slicing is similar to the regular Python slices.
When selecting columns using iloc, we need to specify the column number (starting from 0) rather than
the names. For example, the column date is the first column, so it has the index 0.
df1.iloc[10:20, [2,0,4]]
view raw dataframe_iloc_columns.py hosted with ❤ by GitHub
21/37
Besides integers, we can also use boolean arrays (True/False) to subset rows with iloc, similar to loc.
Random Sampling
Random sampling is a common practice in data science. We can use the sample method in pandas to do
this.
For example, we can take a random sample of size ten from df1 and assign it to a new DataFrame
df1_samp1.
df1_samp1 = df1.sample(n=10) # specifying the number of observations.

df1_samp1
view raw dataframe_random_sampling.py hosted with ❤ by GitHub
Or, we can take a certain proportion from the original dataset using the parameter frac.
df1_samp2 = df1.sample(frac=0.0001) # specifying the proportion of rows to sample.
22/37
df1_samp2
view raw dataframe_random_sampling_proportion.py hosted with ❤ by GitHub
Filter: Conditional Selection
Filtering is another popular technique in Python pandas to learn for data science. It’s useful when we
want to subset the data based on some condition(s).
For example, we can grab the rows with both views over 50k and length of over 50 minutes. The variable
msk is a boolean array with length equals the number of rows in the Dataframe. The array element is
True when the condition is satisfied, and False otherwise.
msk = (df0['views'] > 50000) & (df0['length_in_minutes'] > 50)

df0[msk]
view raw dataframe_filtering.py hosted with ❤ by GitHub
As you may have noticed, we can use logical operators such as & (AND), | (OR) in filtering. It is good
practice to put the parentheses around the conditions to specify the execution. Otherwise, it might not
work as expected.
23/37
Moreover, we can combine filtering with loc to select rows and columns as below. This is when loc is
used to locate rows based on boolean arrays.
df0.loc[msk, ['title_lowercase', 'posted_ago']]

view raw dataframe_loc_rows_cols.py hosted with ❤ by GitHub
24/37
Besides displaying the subset, we can also change the values in the DataFrame based on the filters.
We have non-English terms like ‘PRIJEM’ in the column type within the transactions dataset.
We can transform them into English as below.
msk = (df1['type'] == 'PRIJEM')

df1.loc[msk, ['type']] = 'DEPOSIT'
msk = (df1['type'] == 'VYDAJ')
df1.loc[msk, ['type']] = 'WITHDRAWAL'
df1.head()
25/37
view raw dataframe_loc_set_values.py hosted with ❤ by GitHub
Now with the English categories in column type, we can more easily perform our analysis.
That’s a lot of rows and columns manipulation! Let’s move on to another critical pandas functionality:
Group by.
Groupby
Before learning pandas GroupBy, let’s look at some aggregating functions.
Below are some of the popular ones. We are not showing the results of the code since they are
straightforward.
df1['amount'].count()
df1['amount'].sum()
df1['amount'].min()
df1['amount'].median()
df1['amount'].max()
df1['amount'].quantile([0.25, 0.5, 0.75, 0.9, 0.95, 0.99])
view raw aggregating_functions.py hosted with ❤ by GitHub
Instead of applying over the whole dataset, aggregating functions are more powerful on groups.
pandas Groupby is a popular manipulation tool for data science. It can be used to group large amounts of
data and compute operations based on groups. The idea should be familiar to those who have used
SQL’s SELECT and Group By statement.
For example, we can get the sum and mean of the column amount, grouped by column type.
df1.groupby('type')['amount'].sum()
df1.groupby('type')['amount'].mean()
view raw groupby.py hosted with ❤ by GitHub
26/37
We can also group by multiple columns. For example, we can get the sum, mean of the column amount,
grouped by type and operation.
df1.groupby(['type', 'operation'])['amount'].sum()
df1.groupby(['type', 'operation'])['amount'].mean()
view raw groupby_multiple_cols.py hosted with ❤ by GitHub
There is a lot more to learn about pandas Groupby. That’s why we wrote another tutorial below. Take a
look to dig into it deeper!
Further Reading: How to GroupBy with Python Pandas Like a Boss

To learn more about Python Pandas GroupBy, read this complete tutorial with practical examples. This
article covers transform and apply functions as well.
Data Cleaning/Manipulation
pandas is also a powerful tool for cleaning and transforming the dataset. Let’s see some useful functions.
Apply customized Functions
With the pandas apply method, we can apply a function on the DataFrame.
Let’s see an example.
The DataFrame df1’s type column has three main categories, as shown below.
df1['type'].value_counts()
view raw type_value_count.py hosted with ❤ by GitHub
What if we want to perform different operations on different types of transactions:
when it’s a deposit, calculate amount/balance.

when it’s a withdrawal, calculate amount/(balance + amount).
We can first define a function specifying the logic we described above and then apply it to the
DataFrame. The parameter axis is set to 1 to apply the function on each row for columns.
def pct_of_balance_trans(row):
if row['type'] == 'DEPOSIT':
return row['amount']/row['balance']
27/37
elif row['type'] == 'WITHDRAWAL':
return row['amount']/(row['balance'] + row['amount'])
else:
-999
df1['pct_of_balance_trans'] = df1.apply(pct_of_balance_trans, axis=1)
df1
view raw define_function_apply.py hosted with ❤ by GitHub
As you can see, the new column is added to the DataFrame.
Missing Data (NA/Null)
Missing data is a common issue for most datasets.
For example, by calling the tail of df1, we can see it has some missing values in the operation, bank, and
account columns.
df1.tail()
view raw df_tail_missing_values.py hosted with ❤ by GitHub
We can fill in the missing values of column operation with ‘UNKNOWN’ using the fillna method.
df1['operation'] = df1['operation'].fillna('UNKNOWN')
df1.tail()
28/37
view raw fillna.py hosted with ❤ by GitHub
We can also combine the filtering method and isnull, and notnull to subset rows with or without missing
values.
The code below subsets all the rows with the missing values in the bank column.
msk = df1['bank'].isnull()
df1[msk]
view raw isnull.py hosted with ❤ by GitHub
The code below subsets all the rows with the bank column NOT missing.
msk = df1['bank'].notnull()
df1[msk]
view raw notnull.py hosted with ❤ by GitHub
29/37
Further Reading: Data Cleaning Techniques in Python: the Ultimate Guide
There is a lot more to learn about missing data. Take a look at this guide for how to deal with missing
data.
Remove Columns/Rows
When we want to drop rows or columns from a DataFrame, we use the drop method.
When we drop columns, we set the parameter axis = 1. The default is 0, which drops by index/rows.
For example, we can drop the column title as below.
df0.drop(['title'], axis=1)
# df0 = df0.drop(['title'], axis=1) permanently assign the dropped dataset to a new dataframe
view raw drop_columns.py hosted with ❤ by GitHub
30/37
Or we can drop a few rows.
df0.drop([832, 833, 834, 835], axis=0)

# df0 = df0.drop([832, 833, 834, 835], axis=0) permanently assign the dropped dataset to a new
dataframe
view raw drop_rows.py hosted with ❤ by GitHub
Note that the drop method is not permanent; we often assign it to a new DataFrame for further analysis.
For example, df0_new = df0.drop([‘title’], axis=1). This makes sure we always have the original dataset.
For details on drop_duplicates, please check out the duplicates section in Data Cleaning Guide.
Date functions
pandas has a lot of functionality for dates, which is a common variable in data science.
For example, we can get the year, month, day, and weekday from the date column.
df1['year'] = df1['date'].dt.year
df1['month'] = df1['date'].dt.month
df1['day'] = df1['date'].dt.month
df1['weekday'] = df1['date'].dt.weekday
df1[['date', 'year', 'month', 'day', 'weekday']]
view raw date.py hosted with ❤ by GitHub
31/37
We can then use methods such as Groupby and plot to analyze the data based on the date.
For example, we plot the bar chart of the average amount by month.
df1.groupby('month')['amount'].mean().plot(kind='bar')
view raw bar_plot_date_groupby.py hosted with ❤ by GitHub
Further Reading: Data Cleaning in Python: the Ultimate Guide (2020)

Again, we strongly recommend reading this guide if you want to clean the data better.
32/37
Merge and Concatenate DataFrames
Similar to SQL tables, we can also merge or concatenate DataFrames. Let’s see how it works.
For the demonstration, we create two new DataFrames left and right below.
left = pd.DataFrame({'col1': [1,2,3,4,5], 'col2': ['a', 'b', 'c', 'd', 'e']})

right = pd.DataFrame({'col1': [3,4,5,6,7], 'col3': ['f', 'g', 'h', 'i', 'j']})
print(left)
print(right)
view raw merge_tables.py hosted with ❤ by GitHub
Merge
The merge join is done on columns or indexes.
For example, we can merge the left and right DataFrames on their common column col1. The how
parameter specifies the type of merge to be performed. For example, when how = ‘left’, the new
DataFrame keeps all the rows from the left DataFrame and the corresponding values from the right
DataFrame.
pd.merge(left, right, on=['col1'], how='inner')

pd.merge(left, right, on=['col1'], how='left')
pd.merge(left, right, on=['col1'], how='right')
pd.merge(left, right, on=['col1'], how='outer')
view raw merge_types.py hosted with ❤ by GitHub
Inner
Left
33/37
Outer
Right
Concatenate
When we have two DataFrames with the same row or column information, we can glue them together
using concat.
For example, if we create two new DataFrames with the same columns, we may want to combine these
two DataFrames.
d = {'account_id': [1, 2, 3, 11382], 'gender': ['female', 'male', 'female', 'male'], 'age': [55, 25, 29,
39]}
df4 = pd.DataFrame(d)
df4
d = {'account_id': [4, 5, 6, 7], 'gender': ['female', 'male', 'female', 'male'], 'age': [12, 13, 14, 15]}
df5 = pd.DataFrame(d)
df5
view raw df5.py hosted with ❤ by GitHub
df6 = pd.concat([df4, df5])

df6
view raw concat_dataframes.py hosted with ❤ by GitHub
Sort DataFrames
What if we want to display the DataFrame in order of a column?
34/37
For example, we can use sort_values to order the DataFrame by columns account_id, and trans_id, in
ascending order.
df1 = df1.sort_values(by=['account_id', 'trans_id'])

df1
view raw sort_values.py hosted with ❤ by GitHub
Save or Output DataFrames

We may want to output and save these pandas DataFrames for further analysis after the many
operations.
We’ll introduce the two common methods – pickles and csv files.
pickles
If we need to save the DataFrame and resume our analysis in Python, it’s better to output it as the pickle
format. As pickles, the DataFrame will be saved as it is in Python, i.e., we have the same index, data
types, etc.
Let’s see how it works.
When we want to pause the analysis, we use to_pickle to save the DataFrame df0 as videos.pkl.
df0.to_pickle('videos.pkl')
view raw save_dataframe.py hosted with ❤ by GitHub
Next time, we can access the same DataFrame by using read_pickle. Let’s also print out its head and
info.
df0_from_pickle = pd.read_pickle('videos.pkl')
df0_from_pickle.head()
df0_from_pickle.info()
35/37
view raw read_pickle.py hosted with ❤ by GitHub
The DataFrame looks the same as before. We can keep analyzing from where we left off.
csv files
csv files are more common than pickles and can be opened by other software/programs. So if we want to
share the data with others, it’s better to save our DataFrames as csv files.
Note that some information might be lost compared to pickles. For example, we might have to specify the
column data types when loading the csv files into Python again.
df0.to_csv('videos.csv', index=False)
view raw save_csv.py hosted with ❤ by GitHub
That’s it! You’ve learned a lot about Python pandas for data science.
Did you get a better idea of how to analyze and manipulate data?
Leave a comment for any questions you may have or anything else!
Related “Break into Data Science” resources:
How to Learn Data Science Online: ALL You Need to Know

A detailed review of resources online, including courses, books, free tutorials, portfolios building, and
more.
36/37
Python crash course: breaking into Data Science
A FREE Python online course, beginner-friendly tutorial. Start your successful data science career
journey.
What are the In-Demand Skills for Data Scientists

Why Python, SQL, Machine Learning are the most in-demand skills for data science.
SQL Tutorial for Beginners: Learn SQL for Data Analysis

An ultimate tutorial to learn SQL for data analysis (from beginner to advanced). Learn & master SQL
queries with this practical guide.
37/37

Learn Python Pandas For Data Science Quick TutorialExamples For All Primary Operations of DataFrames

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Learn Python Pandas For Data Science Quick TutorialExamples For All Primary Operations of DataFrames

Uploaded by

Copyright:

Available Formats

www.justintodata.

Learn Python Pandas for Data Science: Quick

What is pandas Python.

What is Pandas Python for Data Science?

is easy to use, fast, and open-source.

In summary, pandas is a must-know package for data science or machine learning.

Install and Import Pandas

pip install pandas

Learn Python Pandas Data Structures

DataFrames can be considered as a collection of Series; it has a structure like a spreadsheet.

Next, let’s see how to read in data as pandas DataFrames.

Import Data into pandas DataFrame

We’ll cover three common cases in practice.

This is especially common when analyzing log files or text.

Let’s see an example below.

We then call df0 for a look.

df1 = pd.read_csv('trans.csv', sep=';', dtype={'k_symbol': 'string', 'bank': 'string'}, parse_dates=

from sqlalchemy import create_engine

But let’s see what we can do with these DataFrames first.

Explore the DataFrames

We’ll cover a few popular methods to do this.

Let’s try it out on the YouTube DataFrame df0.

the index information

head or tail: first or last n rows

df0.head() # prints out first 5 rows.

We can also specify the number of rows to display.

df0.head(20) # prints out first 20 rows.

df0.tail() # prints out last 5 rows.

describe: descriptive statistics

boxplot and histogram: explore numerical variables

We can do a boxplot on the numerical column views.

We can also look at its histogram.

value_counts and bar plot: unique values of categorical variables

We just plotted a bar chart.

Select and Manipulate Columns and Rows

We can also perform simple numerical or categorical operations on columns.

Add New Columns

set_index: set index to columns

Let’s set the index to the column date.

Now we are ready to explore the row locating methods.

loc: locate by labels

df1.loc['1996-12-25', ['account_id', 'type', 'amount']]

df1_samp1 = df1.sample(n=10) # specifying the number of observations.

df1_samp2 = df1.sample(frac=0.0001) # specifying the proportion of rows to sample.

Filter: Conditional Selection

msk = (df0['views'] > 50000) & (df0['length_in_minutes'] > 50)

df0.loc[msk, ['title_lowercase', 'posted_ago']]

We can transform them into English as below.

msk = (df1['type'] == 'PRIJEM')

Further Reading: How to GroupBy with Python Pandas Like a Boss

Apply customized Functions

Let’s see an example.

What if we want to perform different operations on different types of transactions:

when it’s a deposit, calculate amount/balance.

As you can see, the new column is added to the DataFrame.

Missing Data (NA/Null)

Missing data is a common issue for most datasets.

For example, we can drop the column title as below.

df0.drop([832, 833, 834, 835], axis=0)

Further Reading: Data Cleaning in Python: the Ultimate Guide (2020)

left = pd.DataFrame({'col1': [1,2,3,4,5], 'col2': ['a', 'b', 'c', 'd', 'e']})

The merge join is done on columns or indexes.

pd.merge(left, right, on=['col1'], how='inner')