Sarkar, DR Tirthajyoti - Roychowdhury, Shubhadeep - Data Wrangling With Python - Creating Actionable Data From Raw Sources-Packt Publishing (2019)

Contents
1. Preface
1. About the Book
1. About the Authors

2. Learning Objectives
3. Approach
4. Audience
5. Minimum Hardware
Requirements
6. Software Requirements
7. Conventions
8. Installation and Setup
9. Installing the Code Bundle
10. Additional Resources
2. Chapter 1
3. Introduction to Data Wrangling with Python
1. Introduction
1. Importance of Data
Wrangling
2. Python for Data Wrangling

3. Lists, Sets, Strings, Tuples, and
Dictionaries
1. Lists
2. Exercise 1: Accessing the List
Members
3. Exercise 2: Generating a List
4. Exercise 3: Iterating over a
List and Checking
Membership
5. Exercise 4: Sorting a List
6. Exercise 5: Generating a
Random List
7. Activity 1: Handling Lists
8. Sets
9. Introduction to Sets
10. Union and Intersection of
Sets
11. Creating Null Sets
12. Dictionary
13. Exercise 6: Accessing and
Setting Values in a Dictionary
14. Exercise 7: Iterating Over a
Dictionary
15. Exercise 8: Revisiting the
Unique Valued List Problem
16. Exercise 9: Deleting Value
from Dict
17. Exercise 10: Dictionary
Comprehension
18. Tuples
19. Creating a Tuple with
Different Cardinalities
20. Unpacking a Tuple
21. Exercise 11: Handling Tuples
22. Strings
23. Exercise 12: Accessing Strings
24. Exercise 13: String Slices
25. String Functions
26. Exercise 14: Split and Join
27. Activity 2: Analyze a Multiline
String and Generate the
Unique Word Count
4. Summary
4. Chapter 2
5. Advanced Data Structures and File Handling
1. Introduction
2. Advanced Data Structures
1. Iterator
2. Exercise 15: Introduction to
the Iterator
3. Stacks
4. Exercise 16: Implementing a
Stack in Python
Stack Using User-Defined
Methods
6. Exercise 18: Lambda
Expression
7. Exercise 19: Lambda
Expression for Sorting
8. Exercise 20: Multi-Element
Membership Checking
9. Queue
Queue in Python
11. Activity 3: Permutation,
Iterator, Lambda, List
3. Basic File Operations in Python
1. Exercise 22: File Operations

2. File Handling
3. Exercise 23: Opening and
Closing a File
4. The with Statement
5. Opening a File Using the with
Statement
6. Exercise 24: Reading a File
Line by Line
7. Exercise 25: Write to a File
8. Activity 4: Design Your Own
CSV Parser
4. Summary
6. Chapter 3
7. Introduction to NumPy, Pandas,and Matplotlib
1. Introduction
2. NumPy Arrays
1. NumPy Array and Features

2. Exercise 26: Creating a
NumPy Array (from a List)
3. Exercise 27: Adding Two
NumPy Arrays
4. Exercise 28: Mathematical
Operations on NumPy Arrays
5. Exercise 29: Advanced
Mathematical Operations on
NumPy Arrays
6. Exercise 30: Generating
Arrays Using arange and
linspace
7. Exercise 31: Creating Multi-
Dimensional Arrays
8. Exercise 32: The Dimension,
Shape, Size, and Data Type of
the Two-dimensional Array
9. Exercise 33: Zeros, Ones,
Random, Identity Matrices,
and Vectors
10. Exercise 34: Reshaping,
Ravel, Min, Max, and Sorting
11. Exercise 35: Indexing and
Slicing
12. Conditional Subsetting
13. Exercise 36: Array Operations
(array-array, array-scalar, and
universal functions)
14. Stacking Arrays
3. Pandas DataFrames

Pandas Series
2. Exercise 38: Pandas Series
and Data Handling
3. Exercise 39: Creating Pandas
DataFrames
4. Exercise 40: Viewing a
DataFrame Partially
5. Indexing and Slicing Columns
6. Indexing and Slicing Rows
7. Exercise 41: Creating and
Deleting a New Column or
Row
4. Statistics and Visualization with NumPy

and Pandas
1. Refresher of Basic Descriptive

Statistics (and the Matplotlib
Library for Visualization)
2. Exercise 42: Introduction to
Matplotlib Through a Scatter
Plot
3. Definition of Statistical
Measures – Central Tendency
and Spread
4. Random Variables and
Probability Distribution
5. What Is a Probability
Distribution?
6. Discrete Distributions
7. Continuous Distributions
8. Data Wrangling in Statistics
and Visualization
9. Using NumPy and Pandas to
Calculate Basic Descriptive
Statistics on the DataFrame
10. Random Number Generation
Using NumPy
Random Numbers from a
Uniform Distribution
Random Numbers from a
Binomial Distribution and Bar
Plot
Random Numbers from
Normal Distribution and
Histograms
14. Exercise 46: Calculation of
Descriptive Statistics from a
DataFrame
15. Exercise 47: Built-in Plotting
Utilities
16. Activity 5: Generating
Statistics from a CSV File
5. Summary
8. Chapter 4
9. A Deep Dive into Data Wrangling with Python
1. Introduction
2. Subsetting, Filtering, and Grouping
1. Exercise 48: Loading and

Examining a Superstore's
Sales Data from an Excel File
2. Subsetting the DataFrame
3. An Example Use Case:
Determining Statistics on
Sales and Profit
4. Exercise 49: The unique
Function
5. Conditional Selection and
Boolean Filtering
6. Exercise 50: Setting and
Resetting the Index
7. Exercise 51: The GroupBy
Method
3. Detecting Outliers and Handling Missing

Values
1. Missing Values in Pandas

2. Exercise 52: Filling in the
Missing Values with fillna
3. Exercise 53: Dropping Missing
Values with dropna
4. Outlier Detection Using a
Simple Statistical Test
4. Concatenating, Merging, and Joining
1. Exercise 54: Concatenation

2. Exercise 55: Merging by a
Common Key
3. Exercise 56: The join Method
5. Useful Methods of Pandas
1. Exercise 57: Randomized

Sampling
2. The value_counts Method
3. Pivot Table Functionality
4. Exercise 58: Sorting by
Column Values – the
sort_values Method
5. Exercise 59: Flexibility for
User-Defined Functions with
the apply Method
6. Activity 6: Working with the
Adult Income Dataset (UCI)
6. Summary
10. Chapter 5
11. Getting Comfortable with Different Kinds of Data
Sources
1. Introduction
2. Reading Data from Different Text-Based
(and Non-Text-Based) Sources
1. Data Files Provided with This

Chapter
2. Libraries to Install for This
Chapter
3. Exercise 60: Reading Data
from a CSV File Where
Headers Are Missing
4. Exercise 61: Reading from a
CSV File where Delimiters are
not Commas
5. Exercise 62: Bypassing the
Headers of a CSV File
6. Exercise 63: Skipping Initial
Rows and Footers when
Reading a CSV File
7. Reading Only the First N
Rows (Especially Useful for
Large Files)
8. Exercise 64: Combining
Skiprows and Nrows to Read
Data in Small Chunks
9. Setting the skip_blank_lines
Option
10. Read CSV from a Zip file
11. Reading from an Excel File
Using sheet_name and
Handling a Distinct
sheet_name
12. Exercise 65: Reading a
General Delimited Text File
13. Reading HTML Tables
Directly from a URL
14. Exercise 66: Further
Wrangling to Get the Desired
Data
15. Exercise 67: Reading from a
JSON File
16. Reading a Stata File
17. Exercise 68: Reading Tabular
Data from a PDF File
3. Introduction to Beautiful Soup 4 and

Web Page Parsing
1. Structure of HTML
2. Exercise 69: Reading an
HTML file and Extracting its
Contents Using
BeautifulSoup
3. Exercise 70: DataFrames and
BeautifulSoup
4. Exercise 71: Exporting a
DataFrame as an Excel File
5. Exercise 72: Stacking URLs
from a Document using bs4
6. Activity 7: Reading Tabular
Data from a Web Page and
Creating DataFrames
4. Summary
12. Chapter 6
13. Learning the Hidden Secrets of Data Wrangling
1. Introduction
1. Additional Software Required

for This Section
2. Advanced List Comprehension and the

zip Function
1. Introduction to Generator
Expressions
2. Exercise 73: Generator
Expressions
3. Exercise 74: One-Liner
Generator Expression
4. Exercise 75: Extracting a List
with Single Words
5. Exercise 76: The zip
Function
6. Exercise 77: Handling Messy
Data
3. Data Formatting
1. The % operator
2. Using the format Function
3. Exercise 78: Data
Representation Using {}
4. Identify and Clean Outliers
1. Exercise 79: Outliers in

Numerical Data
2. Z-score
3. Exercise 80: The Z-Score
Value to Remove Outliers
4. Exercise 81: Fuzzy Matching
of Strings
5. Activity 8: Handling Outliers and

Missing Data
6. Summary
14. Chapter 7
15. Advanced Web Scraping and Data Gathering
1. Introduction
2. The Basics of Web Scraping and the
Beautiful Soup Library
1. Libraries in Python
2. Exercise 81: Using the
Requests Library to Get a
Response from the Wikipedia
Home Page
3. Exercise 82: Checking the
Status of the Web Request
4. Checking the Encoding of the
Web Page
Function to Decode the
Contents of the Response
and Check its Length
6. Exercise 84: Extracting
Human-Readable Text From
a BeautifulSoup Object
7. Extracting Text from a
Section
8. Extracting Important
Historical Events that
Happened on Today's Date
9. Exercise 85: Using Advanced
BS4 Techniques to Extract
Relevant Text
Compact Function to Extract
the "On this Day" Text from
the Wikipedia Home Page
3. Reading Data from XML
1. Exercise 87: Creating an XML

File and Reading XML
Element Objects
2. Exercise 88: Finding Various
Elements of Data within a
Tree (Element)
3. Reading from a Local XML
File into an ElementTree
Object
4. Exercise 89: Traversing the
Tree, Finding the Root, and
Exploring all Child Nodes and
their Tags and Attributes
5. Exercise 90: Using the text
Method to Extract
Meaningful Data
6. Extracting and Printing the
GDP/Per Capita Information
Using a Loop
7. Exercise 91: Finding All the
Neighboring Countries for
each Country and Printing
Them
8. Exercise 92: A Simple Demo
of Using XML Data Obtained
by Web Scraping
4. Reading Data from an API
1. Defining the Base URL (or

API Endpoint)
2. Exercise 93: Defining and
Testing a Function to Pull
Country Data from an API
3. Using the Built-In JSON
Library to Read and Examine
Data
4. Printing All the Data
Elements
5. Using a Function that
Extracts a DataFrame
Containing Key Information
6. Exercise 94: Testing the
Function by Building a Small
Database of Countries'
Information
5. Fundamentals of Regular Expressions

(RegEx)
1. Regex in the Context of Web

Scraping
2. Exercise 95: Using the match
Method to Check Whether a
Pattern matches a
String/Sequence
3. Using the Compile Method to
Create a Regex Program
4. Exercise 96: Compiling
Programs to Match Objects
5. Exercise 97: Using Additional
Parameters in Match to
Check for Positional Matching
6. Finding the Number of
Words in a List That End
with "ing"
7. Exercise 98: The search
Method in Regex
8. Exercise 99: Using the span
Method of the Match Object
to Locate the Position of the
Matched Pattern
9. Exercise 100: Examples of
Single Character Pattern
Matching with search
Pattern Matching at the Start
or End of a String
Pattern Matching with
Multiple Characters
12. Exercise 103: Greedy versus
Non-Greedy Matching
13. Exercise 104: Controlling
Repetitions to Match
14. Exercise 105: Sets of
Matching Characters
15. Exercise 106: The use of OR
in Regex using the OR
Operator
16. The findall Method
17. Activity 9: Extracting the Top
100 eBooks from Gutenberg
18. Activity 10: Building Your
Own Movie Database by
Reading an API
6. Summary
16. Chapter 8
17. RDBMS and SQL
1. Introduction
2. Refresher of RDBMS and SQL
1. How is an RDBMS
Structured?
2. SQL
3. Using an RDBMS
(MySQL/PostgreSQL/SQLite)
1. Exercise 107: Connecting to

Database in SQLite
2. Exercise 108: DDL and DML
Commands in SQLite
3. Reading Data from a
Database in SQLite
4. Exercise 109: Sorting Values
that are Present in the
Database
5. Exercise 110: Altering the
Structure of a Table and
Updating the New Fields
6. Exercise 111: Grouping Values
in Tables
7. Relation Mapping in
Databases
8. Adding Rows in the
comments Table
9. Joins
10. Retrieving Specific Columns
from a JOIN query
11. Exercise 112: Deleting Rows
12. Updating Specific Values in a
Table
13. Exercise 113: RDBMS and
DataFrames
14. Activity 11: Retrieving Data
Correctly From Databases
4. Summary
18. Chapter 9
19. Application of Data Wrangling in Real Life
1. Introduction
2. Applying Your Knowledge to a Real-life
Data Wrangling Task
3. Activity 12: Data Wrangling Task –
Fixing UN Data
Cleaning GDP Data
Merging UN Data and GDP Data
Connecting the New Data to the
Database
7. An Extension to Data Wrangling
1. Additional Skills Required to

Become a Data Scientist
2. Basic Familiarity with Big
Data and Cloud Technologies
3. What Goes with Data
Wrangling?
4. Tips and Tricks for Mastering
Machine Learning
8. Summary
20. Appendix
1. Solution of Activity 1: Handling Lists
1. Solution of Activity 2: Analyze

a Multiline String and
Generate the Unique Word
Count
2. Solution of Activity 3:
Permutation, Iterator,
Lambda, List
3. Solution of Activity 4: Design
Your Own CSV Parser
Generating Statistics from a
CSV File
Working with the Adult
Income Dataset (UCI)
Reading Tabular Data from a
Web Page and Creating
DataFrames
Handling Outliers and
Missing Data
Extracting the Top 100
eBooks from Gutenberg
Extracting the top 100 eBooks
from Gutenberg.org
Retrieving Data Correctly
from Databases
11. Solution of Activity 12: Data
Wrangling Task – Fixing UN
Data
12. Activity 13: Data Wrangling
Task – Cleaning GDP Data
13. Solution of Activity 14: Data
Wrangling Task – Merging
UN Data and GDP Data
14. Activity 15: Data Wrangling
Task – Connecting the New
Data to a Database
Landmarks
1. Cover
2. Table of Contents
DATA WRANGLING WITH
PYTHON
Cop y r i gh t © 201 9 Pac k t Pu b l i sh i ng
A l l r i gh t s r eser v ed. N o p ar t of t h i s b ook may b e

r ep r odu c ed, st or ed i n a r et r i ev al sy st em, or
t r ansmi t t ed i n any f or m or b y any means,
w i t h ou t t h e p r i or w r i t t en p er mi ssi on of t h e
p u b l i sh er , ex c ep t i n t h e c ase of b r i ef qu ot at i ons
emb edded i n c r i t i c al ar t i c l es or r ev i ew s.
Ev er y ef f or t h as b een made i n t h e p r ep ar at i on of
t h i s b ook t o ensu r e t h e ac c u r ac y of t h e
i nf or mat i on p r esent ed. H ow ev er , t h e
i nf or mat i on c ont ai ned i n t h i s b ook i s sol d
w i t h ou t w ar r ant y , ei t h er ex p r ess or i mp l i ed.
N ei t h er t h e au t h or , nor Pac k t Pu b l i sh i ng, and i t s
deal er s and di st r i b u t or s w i l l b e h el d l i ab l e f or
any damages c au sed or al l eged t o b e c au sed
di r ec t l y or i ndi r ec t l y b y t h i s b ook .
Pac k t Pu b l i sh i ng h as endeav or ed t o p r ov i de
t r ademar k i nf or mat i on ab ou t al l of t h e
c omp ani es and p r odu c t s ment i oned i n t h i s b ook
b y t h e ap p r op r i at e u se of c ap i t al s. H ow ev er ,
Pac k t Pu b l i sh i ng c annot gu ar ant ee t h e ac c u r ac y
of t h i s i nf or mat i on.
A u t h or s: Dr . Ti r t h ajy ot i Sar k ar and Sh u b h adeep

Roy c h ow dh u r y
Managi ng Edi t or : St ef f i Mont ei r o
A c qu i si t i ons Edi t or : Ku nal Saw ant
Pr odu c t i on Edi t or : N i t esh Th ak u r
Edi t or i al Boar d: Dav i d Bar nes, Ew an

Bu c k i ngh am, Sh i v angi Ch at t er ji , Si mon Cox ,
Manasa Ku mar , A l ex Mazonow i c z, Dou gl as
Pat er son, Domi ni c Per ei r a, Sh i ny Poojar y ,
Saman Si ddi qu i , Er ol St av el ey , A nk i t a Th ak u r ,
and Moh i t a V y as.
Fi r st Pu b l i sh ed: Feb r u ar y 201 9
Pr odu c t i on Ref er enc e: 1 28021 9
ISBN : 97 8-1 -7 8980-01 1 -1
Pu b l i sh ed b y Pac k t Pu b l i sh i ng Lt d.
Li v er y Pl ac e, 35 Li v er y St r eet
Bi r mi ngh am B3 2PB, U K
Table of Contents
Preface
Introduction to Data
Wrangling with Python
INTRODUCTION
IMPORTANCE OF DATA
WRANGLING
PYTHON FOR DATA

WRANGLING
LISTS, SETS, STRINGS,

TUPLES, AND DICTIONARIES
LISTS
EXERCISE 1: ACCESSING THE

LIST MEMBERS
EXERCISE 2: GENERATING A
LIST
EXERCISE 3: ITERATING OVER

A LIST AND CHECKING
MEMBERSHIP
EXERCISE 4: SORTING A LIST
RANDOM LIST
ACTIVITY 1: HANDLING LISTS
SETS
INTRODUCTION TO SETS
UNION AND INTERSECTION

OF SETS
CREATING NULL SETS
DICTIONARY
EXERCISE 6: ACCESSING AND

SETTING VALUES IN A
DICTIONARY

A DICTIONARY
EXERCISE 8: REVISITING THE

UNIQUE VALUED LIST
PROBLEM
EXERCISE 9: DELETING VALUE

FROM DICT
EXERCISE 10: DICTIONARY

COMPREHENSION
TUPLES
CREATING A TUPLE WITH

DIFFERENT CARDINALITIES
UNPACKING A TUPLE
EXERCISE 11: HANDLING

TUPLES
STRINGS
EXERCISE 12: ACCESSING

STRINGS
EXERCISE 13: STRING SLICES
STRING FUNCTIONS
EXERCISE 14: SPLIT AND JOIN

ACTIVITY 2: ANALYZE A
MULTILINE STRING AND
GENERATE THE UNIQUE
WORD COUNT
SUMMARY
Advanced Data
Structures and File
Handling
INTRODUCTION
ADVANCED DATA
STRUCTURES
ITERATOR
EXERCISE 15: INTRODUCTION

TO THE ITERATOR
STACKS
EXERCISE 16: IMPLEMENTING

A STACK IN PYTHON

A STACK USING USER-
DEFINED METHODS
EXERCISE 18: LAMBDA

EXPRESSION
EXERCISE 19: LAMBDA

EXPRESSION FOR SORTING
EXERCISE 20: MULTI-ELEMENT

MEMBERSHIP CHECKING
QUEUE

A QUEUE IN PYTHON
ACTIVITY 3: PERMUTATION,
ITERATOR, LAMBDA, LIST
BASIC FILE OPERATIONS IN

PYTHON
EXERCISE 22: FILE

OPERATIONS
FILE HANDLING
EXERCISE 23: OPENING AND

CLOSING A FILE
THE WITH STATEMENT
OPENING A FILE USING THE

WITH STATEMENT
EXERCISE 24: READING A FILE

LINE BY LINE
EXERCISE 25: WRITE TO A FILE
ACTIVITY 4: DESIGN YOUR

OWN CSV PARSER
SUMMARY
Introduction to NumPy,
Pandas,and Matplotlib
INTRODUCTION
NUMPY ARRAYS
NUMPY ARRAY AND
FEATURES
EXERCISE 26: CREATING A

NUMPY ARRAY (FROM A LIST)
EXERCISE 27: ADDING TWO

NUMPY ARRAYS
EXERCISE 28: MATHEMATICAL

OPERATIONS ON NUMPY
ARRAYS
EXERCISE 29: ADVANCED

MATHEMATICAL OPERATIONS
ON NUMPY ARRAYS
EXERCISE 30: GENERATING

ARRAYS USING ARANGE AND
LINSPACE
EXERCISE 31: CREATING

MULTI-DIMENSIONAL ARRAYS
EXERCISE 32: THE

DIMENSION, SHAPE, SIZE,
AND DATA TYPE OF THE TWO-
DIMENSIONAL ARRAY
EXERCISE 33: ZEROS, ONES,

RANDOM, IDENTITY
MATRICES, AND VECTORS
EXERCISE 34: RESHAPING,

RAVEL, MIN, MAX, AND
SORTING
EXERCISE 35: INDEXING AND
SLICING
CONDITIONAL SUBSETTING
EXERCISE 36: ARRAY

OPERATIONS (ARRAY-ARRAY,
ARRAY-SCALAR, AND
UNIVERSAL FUNCTIONS)
STACKING ARRAYS
PANDAS DATAFRAMES

PANDAS SERIES
EXERCISE 38: PANDAS SERIES

AND DATA HANDLING

PANDAS DATAFRAMES
EXERCISE 40: VIEWING A

DATAFRAME PARTIALLY
INDEXING AND SLICING

COLUMNS
INDEXING AND SLICING ROWS
EXERCISE 41: CREATING AND

DELETING A NEW COLUMN OR
ROW
STATISTICS AND
VISUALIZATION WITH NUMPY
AND PANDAS
REFRESHER OF BASIC
DESCRIPTIVE STATISTICS
(AND THE MATPLOTLIB
LIBRARY FOR
VISUALIZATION)

TO MATPLOTLIB THROUGH A
SCATTER PLOT
DEFINITION OF STATISTICAL
MEASURES – CENTRAL
TENDENCY AND SPREAD
RANDOM VARIABLES AND

PROBABILITY DISTRIBUTION
WHAT IS A PROBABILITY
DISTRIBUTION?
DISCRETE DISTRIBUTIONS
CONTINUOUS
DISTRIBUTIONS
DATA WRANGLING IN
STATISTICS AND
VISUALIZATION
USING NUMPY AND PANDAS

TO CALCULATE BASIC
DESCRIPTIVE STATISTICS ON
THE DATAFRAME
RANDOM NUMBER
GENERATION USING NUMPY
RANDOM NUMBERS FROM A
UNIFORM DISTRIBUTION

BINOMIAL DISTRIBUTION
AND BAR PLOT

RANDOM NUMBERS FROM
NORMAL DISTRIBUTION AND
HISTOGRAMS
EXERCISE 46: CALCULATION

OF DESCRIPTIVE STATISTICS
FROM A DATAFRAME
EXERCISE 47: BUILT-IN

PLOTTING UTILITIES
ACTIVITY 5: GENERATING
STATISTICS FROM A CSV FILE
SUMMARY
A Deep Dive into Data

INTRODUCTION
SUBSETTING, FILTERING, AND

GROUPING
EXERCISE 48: LOADING AND

EXAMINING A SUPERSTORE'S
SALES DATA FROM AN EXCEL
FILE
SUBSETTING THE
DATAFRAME
AN EXAMPLE USE CASE:

DETERMINING STATISTICS ON
SALES AND PROFIT
EXERCISE 49: THE UNIQUE

FUNCTION
CONDITIONAL SELECTION
AND BOOLEAN FILTERING
EXERCISE 50: SETTING AND

RESETTING THE INDEX
EXERCISE 51: THE GROUPBY

METHOD
DETECTING OUTLIERS AND

HANDLING MISSING VALUES
MISSING VALUES IN PANDAS
EXERCISE 52: FILLING IN THE

MISSING VALUES WITH
FILLNA
EXERCISE 53: DROPPING

MISSING VALUES WITH
DROPNA
OUTLIER DETECTION USING A

SIMPLE STATISTICAL TEST
CONCATENATING, MERGING,
AND JOINING
EXERCISE 54:
CONCATENATION
EXERCISE 55: MERGING BY A

COMMON KEY
EXERCISE 56: THE JOIN

METHOD
USEFUL METHODS OF
PANDAS
EXERCISE 57: RANDOMIZED

SAMPLING
THE VALUE_COUNTS METHOD
PIVOT TABLE FUNCTIONALITY
EXERCISE 58: SORTING BY

COLUMN VALUES – THE
SORT_VALUES METHOD
EXERCISE 59: FLEXIBILITY FOR

USER-DEFINED FUNCTIONS
WITH THE APPLY METHOD
ACTIVITY 6: WORKING WITH

THE ADULT INCOME DATASET
(UCI)
SUMMARY
Getting Comfortable
with Different Kinds of
Data Sources
INTRODUCTION
READING DATA FROM
DIFFERENT TEXT-BASED (AND
NON-TEXT-BASED) SOURCES
DATA FILES PROVIDED WITH

THIS CHAPTER
LIBRARIES TO INSTALL FOR

THIS CHAPTER
EXERCISE 60: READING DATA

FROM A CSV FILE WHERE
HEADERS ARE MISSING
EXERCISE 61: READING FROM

A CSV FILE WHERE
DELIMITERS ARE NOT
COMMAS
EXERCISE 62: BYPASSING THE

HEADERS OF A CSV FILE
EXERCISE 63: SKIPPING

INITIAL ROWS AND FOOTERS
WHEN READING A CSV FILE
READING ONLY THE FIRST N

ROWS (ESPECIALLY USEFUL
FOR LARGE FILES)
EXERCISE 64: COMBINING

SKIPROWS AND NROWS TO
READ DATA IN SMALL
CHUNKS
SETTING THE
SKIP_BLANK_LINES OPTION
READ CSV FROM A ZIP FILE
READING FROM AN EXCEL FILE

USING SHEET_NAME AND
HANDLING A DISTINCT
SHEET_NAME
EXERCISE 65: READING A

GENERAL DELIMITED TEXT
FILE
READING HTML TABLES

DIRECTLY FROM A URL
EXERCISE 66: FURTHER

WRANGLING TO GET THE
DESIRED DATA

A JSON FILE
READING A STATA FILE
EXERCISE 68: READING

TABULAR DATA FROM A PDF
FILE
INTRODUCTION TO
BEAUTIFUL SOUP 4 AND WEB
PAGE PARSING
STRUCTURE OF HTML
EXERCISE 69: READING AN

HTML FILE AND EXTRACTING
ITS CONTENTS USING
BEAUTIFULSOUP
EXERCISE 70: DATAFRAMES
AND BEAUTIFULSOUP
EXERCISE 71: EXPORTING A

DATAFRAME AS AN EXCEL FILE
EXERCISE 72: STACKING URLS

FROM A DOCUMENT USING
BS4
ACTIVITY 7: READING
TABULAR DATA FROM A WEB
PAGE AND CREATING
DATAFRAMES
SUMMARY
Learning the Hidden

Secrets of Data
Wrangling
INTRODUCTION
ADDITIONAL SOFTWARE
REQUIRED FOR THIS SECTION
ADVANCED LIST
COMPREHENSION AND THE
ZIP FUNCTION
INTRODUCTION TO
GENERATOR EXPRESSIONS
EXERCISE 73: GENERATOR

EXPRESSIONS
EXERCISE 74: ONE-LINER

GENERATOR EXPRESSION
EXERCISE 75: EXTRACTING A
LIST WITH SINGLE WORDS
EXERCISE 76: THE ZIP

FUNCTION

MESSY DATA
DATA FORMATTING
THE % OPERATOR
USING THE FORMAT

FUNCTION
EXERCISE 78: DATA

REPRESENTATION USING {}
IDENTIFY AND CLEAN

OUTLIERS
EXERCISE 79: OUTLIERS IN

NUMERICAL DATA
Z-SCORE
EXERCISE 80: THE Z-SCORE

VALUE TO REMOVE OUTLIERS
EXERCISE 81: FUZZY

MATCHING OF STRINGS
ACTIVITY 8: HANDLING
OUTLIERS AND MISSING DATA
SUMMARY
Advanced Web Scraping

and Data Gathering
INTRODUCTION
THE BASICS OF WEB

SCRAPING AND THE
BEAUTIFUL SOUP LIBRARY
LIBRARIES IN PYTHON
EXERCISE 81: USING THE

REQUESTS LIBRARY TO GET A
RESPONSE FROM THE
WIKIPEDIA HOME PAGE
EXERCISE 82: CHECKING THE

STATUS OF THE WEB
REQUEST
CHECKING THE ENCODING OF

THE WEB PAGE

FUNCTION TO DECODE THE
CONTENTS OF THE
RESPONSE AND CHECK ITS
LENGTH
EXERCISE 84: EXTRACTING

HUMAN-READABLE TEXT
FROM A BEAUTIFULSOUP
OBJECT
EXTRACTING TEXT FROM A

SECTION
EXTRACTING IMPORTANT
HISTORICAL EVENTS THAT
HAPPENED ON TODAY'S DATE
EXERCISE 85: USING
ADVANCED BS4 TECHNIQUES
TO EXTRACT RELEVANT TEXT

COMPACT FUNCTION TO
EXTRACT THE "ON THIS DAY"
TEXT FROM THE WIKIPEDIA
HOME PAGE
READING DATA FROM XML
EXERCISE 87: CREATING AN

XML FILE AND READING XML
ELEMENT OBJECTS
EXERCISE 88: FINDING

VARIOUS ELEMENTS OF DATA
WITHIN A TREE (ELEMENT)
READING FROM A LOCAL XML

FILE INTO AN ELEMENTTREE
OBJECT
EXERCISE 89: TRAVERSING

THE TREE, FINDING THE
ROOT, AND EXPLORING ALL
CHILD NODES AND THEIR
TAGS AND ATTRIBUTES

TEXT METHOD TO EXTRACT
MEANINGFUL DATA
EXTRACTING AND PRINTING

THE GDP/PER CAPITA
INFORMATION USING A LOOP
EXERCISE 91: FINDING ALL
THE NEIGHBORING
COUNTRIES FOR EACH
COUNTRY AND PRINTING
THEM
EXERCISE 92: A SIMPLE DEMO

OF USING XML DATA
OBTAINED BY WEB SCRAPING
READING DATA FROM AN API
DEFINING THE BASE URL (OR

API ENDPOINT)
EXERCISE 93: DEFINING AND

TESTING A FUNCTION TO
PULL COUNTRY DATA FROM
AN API
USING THE BUILT-IN JSON

LIBRARY TO READ AND
EXAMINE DATA
PRINTING ALL THE DATA

ELEMENTS
USING A FUNCTION THAT

EXTRACTS A DATAFRAME
CONTAINING KEY
INFORMATION
EXERCISE 94: TESTING THE

FUNCTION BY BUILDING A
SMALL DATABASE OF
COUNTRIES' INFORMATION
FUNDAMENTALS OF REGULAR
EXPRESSIONS (REGEX)
REGEX IN THE CONTEXT OF

WEB SCRAPING

MATCH METHOD TO CHECK
WHETHER A PATTERN
MATCHES A
STRING/SEQUENCE
USING THE COMPILE METHOD

TO CREATE A REGEX
PROGRAM
EXERCISE 96: COMPILING

PROGRAMS TO MATCH
OBJECTS
EXERCISE 97: USING

ADDITIONAL PARAMETERS IN
MATCH TO CHECK FOR
POSITIONAL MATCHING
FINDING THE NUMBER OF

WORDS IN A LIST THAT END
WITH "ING"
EXERCISE 98: THE SEARCH

METHOD IN REGEX

SPAN METHOD OF THE MATCH
OBJECT TO LOCATE THE
POSITION OF THE MATCHED
PATTERN
EXERCISE 100: EXAMPLES OF
SINGLE CHARACTER PATTERN
MATCHING WITH SEARCH

PATTERN MATCHING AT THE
START OR END OF A STRING

PATTERN MATCHING WITH
MULTIPLE CHARACTERS
EXERCISE 103: GREEDY

VERSUS NON-GREEDY
MATCHING
EXERCISE 104: CONTROLLING

REPETITIONS TO MATCH
EXERCISE 105: SETS OF

MATCHING CHARACTERS
EXERCISE 106: THE USE OF OR

IN REGEX USING THE OR
OPERATOR
THE FINDALL METHOD
ACTIVITY 9: EXTRACTING THE

TOP 100 EBOOKS FROM
GUTENBERG
ACTIVITY 10: BUILDING YOUR

OWN MOVIE DATABASE BY
READING AN API
SUMMARY
RDBMS and SQL

INTRODUCTION
REFRESHER OF RDBMS AND

SQL
HOW IS AN RDBMS
STRUCTURED?
SQL
USING AN RDBMS
(MYSQL/POSTGRESQL/SQLIT
E)
EXERCISE 107: CONNECTING

TO DATABASE IN SQLITE
EXERCISE 108: DDL AND DML

COMMANDS IN SQLITE
READING DATA FROM A

DATABASE IN SQLITE
EXERCISE 109: SORTING

VALUES THAT ARE PRESENT
IN THE DATABASE
EXERCISE 110: ALTERING THE

STRUCTURE OF A TABLE AND
UPDATING THE NEW FIELDS
EXERCISE 111: GROUPING

VALUES IN TABLES
RELATION MAPPING IN
DATABASES
ADDING ROWS IN THE

COMMENTS TABLE
JOINS
RETRIEVING SPECIFIC
COLUMNS FROM A JOIN
QUERY
EXERCISE 112: DELETING

ROWS
UPDATING SPECIFIC VALUES

IN A TABLE
EXERCISE 113: RDBMS AND

DATAFRAMES
ACTIVITY 11: RETRIEVING

DATA CORRECTLY FROM
DATABASES
SUMMARY
Application of
Data Wrangling in Real
Life
INTRODUCTION
APPLYING YOUR KNOWLEDGE

TO A REAL-LIFE DATA
WRANGLING TASK
ACTIVITY 12: DATA

WRANGLING TASK – FIXING
UN DATA
ACTIVITY 13: DATA

WRANGLING TASK –
CLEANING GDP DATA
ACTIVITY 14: DATA
WRANGLING TASK – MERGING
UN DATA AND GDP DATA
ACTIVITY 15: DATA

WRANGLING TASK –
CONNECTING THE NEW DATA
TO THE DATABASE
AN EXTENSION TO DATA
WRANGLING
ADDITIONAL SKILLS
REQUIRED TO BECOME A DATA
SCIENTIST
BASIC FAMILIARITY WITH BIG

DATA AND CLOUD
TECHNOLOGIES
WHAT GOES WITH DATA

WRANGLING?
TIPS AND TRICKS FOR

MASTERING MACHINE
LEARNING
SUMMARY
Appendix
Preface
About
Th i s sec t i on b r i ef l y i nt r odu c es t h e au t h or (s), t h e
c ov er age of t h i s b ook , t h e t ec h ni c al sk i l l s y ou 'l l
need t o get st ar t ed, and t h e h ar dw ar e and
sof t w ar e r equ i r ement s r equ i r ed t o c omp l et e al l
of t h e i nc l u ded ac t i v i t i es and ex er c i ses.
About the Book

For dat a t o b e u sef u l and meani ngf u l , i t mu st b e
c u r at ed and r ef i ned. Da ta Wra ng ling w ith Py th on
t eac h es y ou al l t h e c or e i deas b eh i nd t h ese
p r oc esses and equ i p s y ou w i t h k now l edge ab ou t
t h e most p op u l ar t ool s and t ec h ni qu es i n
t h e domai n.
Th e b ook st ar t s w i t h t h e ab sol u t e b asi c s of

Py t h on, f oc u si ng mai nl y on dat a st r u c t u r es, and
t h en qu i c k l y ju mp s i nt o t h e N u mPy and p andas
l i b r ar i es as t h e f u ndament al t ool s f or dat a
w r angl i ng. W e emp h asi ze w h y y ou sh ou l d st ay
aw ay f r om t h e t r adi t i onal w ay of dat a c l eani ng,
as done i n ot h er l angu ages, and t ak e adv ant age of
t h e sp ec i al i zed p r e-b u i l t r ou t i nes i n Py t h on.
Th er eaf t er , y ou w i l l l ear n h ow , u si ng t h e same
Py t h on b ac k end, y ou c an ex t r ac t and t r ansf or m
dat a f r om a di v er se ar r ay of sou r c es, su c h as t h e
i nt er net , l ar ge dat ab ase v au l t s, or Ex c el
f i nanc i al t ab l es. Th en, y ou w i l l al so l ear n h ow t o
h andl e mi ssi ng or i nc or r ec t dat a, and r ef or mat i t
b ased on t h e r equ i r ement s f r om t h e dow nst r eam
anal y t i c s t ool . You w i l l l ear n ab ou t t h ese
c onc ep t s t h r ou gh r eal -w or l d ex amp l es and
dat aset s.
By t h e end of t h i s b ook , y ou w i l l b e c onf i dent

enou gh t o h andl e a my r i ad of sou r c es t o ex t r ac t ,
c l ean, t r ansf or m, and f or mat y ou r dat a
ef f i c i ent l y .
ABOUT THE AUTHORS

Dr. T irthaj y oti Sarkar w or k s as a seni or
p r i nc i p al engi neer i n t h e semi c ondu c t or
t ec h nol ogy domai n, w h er e h e ap p l i es c u t t i ng-
edge dat a sc i enc e/mac h i ne l ear ni ng t ec h ni qu es
t o desi gn au t omat i on and p r edi c t i v e anal y t i c s.
H e w r i t es r egu l ar l y ab ou t Py t h on p r ogr ammi ng
and dat a sc i enc e t op i c s. H e h ol ds a Ph .D. f r om t h e
U ni v er si t y of Il l i noi s, and c er t i f i c at i ons i n
ar t i f i c i al i nt el l i genc e and mac h i ne l ear ni ng
f r om St anf or d and MIT.
Shubhade e p Roy chowdhury w or k s as a seni or

sof t w ar e engi neer at a Par i s-b ased
c y b er sec u r i t y st ar t -u p , w h er e h e i s ap p l y i ng
st at e-of -t h e-ar t c omp u t er v i si on and dat a
engi neer i ng al gor i t h ms and t ool s t o dev el op
c u t t i ng-edge p r odu c t s. H e of t en w r i t es ab ou t
al gor i t h m i mp l ement at i on i n Py t h on and
si mi l ar t op i c s. H e h ol ds a mast er 's degr ee i n
c omp u t er sc i enc e f r om W est Bengal U ni v er si t y
of Tec h nol ogy and c er t i f i c at i ons i n mac h i ne
l ear ni ng f r om St anf or d.
LEARNING OBJECTIVES
Use and m anipu late com plex
and sim ple data str u ctu r es
Har ness th e fu ll potential of

DataFr am es and
nu m py .ar r ay at r u n tim e
Per for m w eb scr aping w ith

Beau tifu lSou p4 and h tm l5lib
Execu te adv anced str ing

sear ch and m anipu lation
w ith RegEX
Handle ou tlier s and per for m

data im pu tation w ith Pandas
Use descr iptiv e statistics and

plotting tech niqu es
Pr actice data w r angling and

m odeling u sing data
gener ation tech niqu es
APPROACH
Dat a W r angl i ng w i t h Py t h on t ak es a p r ac t i c al
ap p r oac h t o equ i p b egi nner s w i t h t h e most
essent i al dat a anal y si s t ool s i n t h e sh or t est
p ossi b l e t i me. It c ont ai ns mu l t i p l e ac t i v i t i es
t h at u se r eal -l i f e b u si ness sc enar i os f or y ou t o
p r ac t i c e and ap p l y y ou r new sk i l l s i n a h i gh l y
r el ev ant c ont ex t .
AUDIENCE
Dat a W r angl i ng w i t h Py t h on i s desi gned f or
dev el op er s, dat a anal y st s, and b u si ness anal y st s
w h o ar e k een t o p u r su e a c ar eer as a f u l l -f l edged
dat a sc i ent i st or anal y t i c s ex p er t . A l t h ou gh , t h i s
b ook i s f or b egi nner s, p r i or w or k i ng k now l edge
of Py t h on i s nec essar y t o easi l y gr asp t h e
c onc ep t s c ov er ed h er e. It w i l l al so h el p t o h av e
r u di ment ar y k now l edge of r el at i onal dat ab ase
and SQL.
MINIMUM HARDWARE
REQUIREMENTS
For t h e op t i mal st u dent ex p er i enc e, w e
r ec ommend t h e f ol l ow i ng h ar dw ar e
c onf i gu r at i on:
Pr ocessor : Intel Cor e i5 or

equ iv alent
Mem or y : 8 GB RA M
Stor age: 3 5 GB av ailable

space
SOFTWARE REQUIREMENTS
You 'l l al so need t h e f ol l ow i ng sof t w ar e i nst al l ed
i n adv anc e:
OS: Window s 7 SP1 6 4 -bit,

Window s 8.1 6 4 -bit or
Window s 1 0 6 4 -bit,
Ubu ntu Linu x, or th e latest
v er sion of m acOS
v er sion of OS X
Pr ocessor : Intel Cor e i5 or

equ iv alent
Mem or y : 4 GB RA M (8 GB
Pr efer r ed)
Stor age: 3 5 GB av ailable

space
CONVENTIONS
Code w or ds i n t ex t , dat ab ase t ab l e names, f ol der
names, f i l enames, f i l e ex t ensi ons, p at h names,
du mmy U RLs, u ser i np u t , and Tw i t t er h andl es
ar e sh ow n as f ol l ow s: " Th i s w i l l r et u r n t h e
v al u e assoc i at ed w i t h i t - ["list_element1", 34]"
A b l oc k of c ode i s set as f ol l ow s:
list_1 = []
for x in range(0, 10):
list_1.append(x)
list_1
N ew t er ms and i mp or t ant w or ds ar e sh ow n i n
b ol d. W or ds t h at y ou see on t h e sc r een, f or
ex amp l e, i n menu s or di al og b ox es, ap p ear i n t h e
t ex t l i k e t h i s: "Cl i c k on Ne w and c h oose Py thon
3."
INSTALLATION AND SETUP

Eac h gr eat jou r ney b egi ns w i t h a h u mb l e st ep .
Ou r u p c omi ng adv ent u r e i n t h e l and of dat a
w r angl i ng i s no ex c ep t i on. Bef or e w e c an do
aw esome t h i ngs w i t h dat a, w e need t o b e
p r ep ar ed w i t h t h e most p r odu c t i v e
env i r onment . In t h i s sh or t sec t i on, w e sh al l see
h ow t o do t h at .
Th e onl y p r er equ i si t e r egar di ng t h e

env i r onment f or t h i s b ook i s t o h av e Doc k er
i nst al l ed. If y ou h av e nev er h ear d of Doc k er or
y ou h av e onl y a v er y f ai nt i dea w h at i t i s, t h en
f ear not . A l l y ou need t o k now ab ou t Doc k er f or
t h e p u r p ose of t h i s b ook i s t h i s: Doc k er i s a
l i gh t w ei gh t c ont ai ner i zat i on engi ne t h at r u ns
on al l t h r ee major p l at f or ms (Li nu x , W i ndow s,
and mac OS). Th e mai n i dea b eh i nd Doc k er i s gi v e
y ou saf e, easy , and l i gh t w ei gh t v i r t u al i zat i on on
t op of y ou r nat i v e OS.
I nstall Docke r
1 . To install Docker on a Mac or

Window s m ach ine, cr eate an
accou nt on Docker and
dow nload th e latest v er sion.
It's easy to install and set u p.
2 . Once y ou h av e set u p
Docker , open a sh ell (or
Ter m inal if y ou ar e a Mac
u ser ) and ty pe th e follow ing
com m and to v er ify th at th e
installation h as
been su ccessfu l:
docker version
If th e ou tpu t sh ow s y ou th e
ser v er and client v er sion of
Docker , th en y ou ar e all set
u p.
Pull the imag e
1 . Pu ll th e im age and y ou w ill

h av e all th e necessar y
packages (inclu ding Py th on
3 .6 .6 ) installed and r eady
for y ou to star t w or king.
Ty pe th e follow ing com m and
in a sh ell:
docker pull
rcshubhadeep/packt-
data-wrangling-base
2 . If y ou w ant to know th e fu ll
list of all th e packages and
th eir v er sions inclu ded in
th is im age, y ou can ch eck
ou t th e requirements.txt
file in th e setup folder of th e
sou r ce code r epositor y of th is
book. Once th e im age
is th er e, y ou ar e r eady to
r oll. Dow nloading it m ay
take tim e, depending
on y ou r connection speed.
Run the e nv ironme nt
1 . Ru n th e im age u sing th e
follow ing com m and:
docker run -p
8888:8888 -v
'pwd':/notebooks -it
rcshubhadeep/packt-
data-wrangling-base
Th is w ill giv e y ou a r eady -to-

u se env ir onm ent.
2 . Open a br ow ser tab in

Ch r om e or Fir efox and go to
http://localhost:8888.
You w ill be pr om pted to
enter a token. Th e token is
dw_4_all.
3 . Befor e y ou r u n th e im age,
cr eate a new folder and
nav igate th er e fr om th e sh ell
u sing th e cd com m and.
Once y ou cr eate a notebook

and sav e it as ipynb file. You
can u se Ctrl + C to stop
r u nning th e im age.
I ntroduction to Jupy te r note book
Pr ojec t Ju p y t er i s op en sou r c e, f r ee sof t w ar e

t h at gi v es y ou t h e ab i l i t y t o r u n c ode, w r i t t en i n
Py t h on and some ot h er l angu ages, i nt er ac t i v el y
f r om a sp ec i al not eb ook , si mi l ar t o a b r ow ser
i nt er f ac e. It w as b or n i n 201 4 f r om t h e IPython
p r ojec t and h as si nc e b ec ome t h e def au l t c h oi c e
f or t h e ent i r e dat a sc i enc e w or k f or c e.
1 . Once y ou ar e r u nning th e
Ju py ter ser v er , click on New
and ch oose Py t hon 3. A new
br ow ser tab w ill open w ith a
new and em pty notebook.
Renam e th e Ju py ter file:
Figure 0.1: Jupyter server interface
Th e m ain bu ilding blocks of

Ju py ter notebooks ar e cells.
Th er e ar e tw o ty pes of cells:
In (sh or t for inpu t) and Out
(sh or t for ou tpu t). You can
w r ite code, nor m al text, and
Mar kdow n in In cells, pr ess
Shift + Enter (or Shift +
Return), and th e code w r itten
in th at par ticu lar In cell w ill
be execu ted. Th e r esu lt w ill
be sh ow n in an Out cell, and
y ou w ill land in a new In
cell, r eady for th e next block
of code. Once y ou get u sed to
th is inter face, y ou w ill
slow ly discov er th e pow er
and flexibility it offer s.
2 . One final th ing y ou sh ou ld

know abou t Ju py ter cells is
th at w h en y ou star t a new
cell, by defau lt, it is assu m ed
th at y ou w ill w r ite code in it.
How ev er , if y ou w ant to
w r ite text, th en y ou h av e to
ch ange th e ty pe. You can do
th at u sing th e follow ing
sequ ence of key s: Escape-> m-
> Enter:
Figure 0.2: Jupyter notebook

3 . A nd w h en y ou ar e done w ith
w r iting th e text, execu te it
u sing Shift + Enter. Unlike
th e code cells, th e r esu lt of
th e com piled Mar kdow n w ill
be sh ow n in th e sam e place
as th e "In" cell.
Note
To have a "Cheat sheet" of all

the handy key shortcuts in
Jupyter, you can bookmark
this Gist:
https://gist.github.com/kidpi
xo/f4318f8c8143adee5b40.
With this basic introduction
and the image ready to be
used, w e are ready to embark
on the exciting and
enlightening journey that
aw aits us!
INSTALLING THE CODE

BUNDLE
Cop y t h e c ode b u ndl e f or t h e c l ass t o t h e C:/Code
f ol der .
ADDITIONAL RESOURCES
Th e c ode b u ndl e f or t h i s b ook i s al so h ost ed on
Gi t H u b at :
h t t p s://gi t h u b .c om/Tr ai ni ngBy Pac k t /Dat a-
W r angl i ng-w i t h -Py t h on.
W e al so h av e ot h er c ode b u ndl es f r om ou r r i c h
c at al og of b ook s and v i deos av ai l ab l e at
h t t p s://gi t h u b .c om/Pac k t Pu b l i sh i ng/. Ch ec k
t h em ou t !
Chapter 1
Introduction to Data
Learning Objectives
By t h e end of t h i s c h ap t er , y ou w i l l b e ab l e t o do
t h e f ol l ow i ng:
Define th e im por tance of

data w r angling in data
science
Manipu late th e data

str u ctu r es th at ar e av ailable
in Py th on
Com par e th e differ ent

im plem entations of th e
inbu ilt Py th on data
str u ctu r es
Th i s c h ap t er desc r i b es t h e i mp or t anc e of dat a

w r angl i ng, i dent i f i es t h e i mp or t ant t ask s t o b e
p er f or med i n dat a w r angl i ng, and i nt r odu c es
b asi c Py t h on dat a st r u c t u r es.
Introduction
Dat a sc i enc e and anal y t i c s ar e t ak i ng ov er t h e
w h ol e w or l d and t h e job of a dat a sc i ent i st i s
r ou t i nel y b ei ng c al l ed t h e c ool est job of t h e 21 st
c ent u r y . Bu t f or al l t h e emp h asi s on dat a, i t i s t h e
sc i enc e t h at mak es y ou – t h e p r ac t i t i oner –
t r u l y v al u ab l e.
To p r ac t i c e h i gh -qu al i t y sc i enc e w i t h dat a, y ou

need t o mak e su r e i t i s p r op er l y sou r c ed, c l eaned,
f or mat t ed, and p r e-p r oc essed. Th i s b ook t eac h es
y ou t h e most essent i al b asi c s of t h i s i nv al u ab l e
c omp onent of t h e dat a sc i enc e p i p el i ne: dat a
w r angl i ng. In sh or t , dat a w r angl i ng i s t h e
p r oc ess t h at ensu r es t h at t h e dat a i s i n a f or mat
t h at i s c l ean, ac c u r at e, f or mat t ed, and r eady t o b e
u sed f or dat a anal y si s.
A p r omi nent ex amp l e of dat a w r angl i ng w i t h a

l ar ge amou nt of dat a i s t h e one c ondu c t ed at t h e
Su p er c omp u t er Cent er of U ni v er si t y of
Cal i f or ni a San Di ego (U CSD). Th e p r ob l em i n
Cal i f or ni a i s t h at w i l df i r es ar e v er y c ommon,
mai nl y b ec au se of t h e dr y w eat h er and ex t r eme
h eat , esp ec i al l y du r i ng t h e su mmer s. Dat a
sc i ent i st s at t h e U CSD Su p er c omp u t er Cent er
gat h er dat a t o p r edi c t t h e nat u r e and sp r ead
di r ec t i on of t h e f i r e. Th e dat a t h at c omes f r om
di v er se sou r c es su c h as w eat h er st at i ons, sensor s
i n t h e f or est , f i r e st at i ons, sat el l i t e i mager y , and
Tw i t t er f eeds mi gh t st i l l b e i nc omp l et e or
mi ssi ng. Th i s dat a needs t o b e c l eaned and
f or mat t ed so t h at i t c an b e u sed t o p r edi c t f u t u r e
oc c u r r enc es of w i l df i r es.
Th i s i s an ex amp l e of h ow dat a w r angl i ng and

dat a sc i enc e c an p r ov e t o b e h el p f u l and
r el ev ant .
IMPORTANCE OF DATA
WRANGLING
Oi l does not c ome i n i t s f i nal f or m f r om t h e r i g; i t
h as t o b e r ef i ned. Si mi l ar l y , dat a mu st b e
c u r at ed, massaged, and r ef i ned t o b e u sed i n
i nt el l i gent al gor i t h ms and c onsu mer p r odu c t s.
Th i s i s k now n as w r angl i ng. Most dat a sc i ent i st s
sp end t h e major i t y of t h ei r t i me dat a w r angl i ng.
Dat a w r angl i ng i s gener al l y done at t h e v er y

f i r st st age of a dat a sc i enc e/anal y t i c s p i p el i ne.
A f t er t h e dat a sc i ent i st s i dent i f y u sef u l dat a
sou r c es f or sol v i ng t h e b u si ness p r ob l em (f or
i nst anc e, i n-h ou se dat ab ase st or age or i nt er net
or st r eami ng sensor dat a), t h ey t h en p r oc eed t o
ex t r ac t , c l ean, and f or mat t h e nec essar y dat a
f r om t h ose sou r c es.
Gener al l y , t h e t ask of dat a w r angl i ng i nv ol v es

t h e f ol l ow i ng st ep s:
Scr aping r aw data fr om

m u ltiple sou r ces (inclu ding
w eb and database tables)
Im pu ting, for m atting, and

tr ansfor m ing – basically
m aking it r eady to be u sed in
th e m odeling pr ocess (su ch
as adv anced m ach ine
lear ning)
Handling r ead/w r ite er r or s
Detecting ou tlier s
Per for m ing qu ick

v isu alizations (plotting) and
basic statistical analy sis to
ju dge th e qu ality of y ou r
for m atted data
Th i s i s an i l l u st r at i v e r ep r esent at i on of t h e
p osi t i oni ng and essent i al f u nc t i onal r ol e of dat a
w r angl i ng i n a t y p i c al dat a sc i enc e p i p el i ne:
Figure 1.1: Process of data wrangling
Th e p r oc ess of dat a w r angl i ng i nc l u des f i r st

f i ndi ng t h e ap p r op r i at e dat a t h at 's nec essar y f or
t h e anal y si s. Th i s dat a c an b e f r om one or
mu l t i p l e sou r c es, su c h as t w eet s, b ank
t r ansac t i on st at ement s i n a r el at i onal dat ab ase,
sensor dat a, and so on. Th i s dat a needs t o b e
c l eaned. If t h er e i s mi ssi ng dat a, w e w i l l ei t h er
del et e or su b st i t u t e i t , w i t h t h e h el p of sev er al
t ec h ni qu es. If t h er e ar e ou t l i er s, w e need t o f i r st
det ec t t h em and t h en h andl e t h em ap p r op r i at el y .
If dat a i s f r om mu l t i p l e sou r c es, w e w i l l h av e t o
p er f or m joi n op er at i ons t o c omb i ne i t .
In an ex t r emel y r ar e si t u at i on, dat a w r angl i ng

may not b e needed. For ex amp l e, i f t h e dat a t h at 's
nec essar y f or a mac h i ne l ear ni ng t ask i s al r eady
st or ed i n an ac c ep t ab l e f or mat i n an i n-h ou se
dat ab ase, t h en a si mp l e SQL qu er y may b e enou gh
t o ex t r ac t t h e dat a i nt o a t ab l e, r eady t o b e p assed
on t o t h e model i ng st age.
Python for Data

Wrangling
Th er e i s al w ay s a deb at e on w h et h er t o p er f or m
t h e w r angl i ng p r oc ess u si ng an ent er p r i se t ool
or b y u si ng a p r ogr ammi ng l angu age and
assoc i at ed f r amew or k s. Th er e ar e many
c ommer c i al , ent er p r i se-l ev el t ool s f or dat a
f or mat t i ng and p r e-p r oc essi ng t h at do not
i nv ol v e mu c h c odi ng on t h e p ar t of t h e u ser .
Th ese ex amp l es i nc l u de t h e f ol l ow i ng:
Gener al pu r pose data

analy sis platfor m s su ch as
Micr osoft Excel (w ith add-
ins)
Statistical discov er y package

su ch as JMP (fr om SA S)
Modeling platfor m s su ch as
RapidMiner
A naly tics platfor m s fr om

nich e play er s focu sing on
data w r angling, su ch as
Trifact a, Paxat a, and
Alt ery x
H ow ev er , p r ogr ammi ng l angu ages su c h as

Py t h on p r ov i de mor e f l ex i b i l i t y , c ont r ol , and
p ow er c omp ar ed t o t h ese of f -t h e-sh el f t ool s.
A s t h e v ol u me, v el oc i t y , and v ar i et y (t h e t h r ee
V s of big data) of dat a u nder go r ap i d c h anges, i t
i s al w ay s a good i dea t o dev el op and nu r t u r e a
si gni f i c ant amou nt of i n-h ou se ex p er t i se i n dat a
w r angl i ng u si ng f u ndament al p r ogr ammi ng
f r amew or k s so t h at an or gani zat i on i s not
b eh ol den t o t h e w h i ms and f anc i es of any
ent er p r i se p l at f or m f or as b asi c a t ask as dat a
w r angl i ng:
Figure 1.2: Google trend worldwide over the last Five
years
A f ew of t h e ob v i ou s adv ant ages of u si ng an op en

sou r c e, f r ee p r ogr ammi ng p ar adi gm su c h as
Py t h on f or dat a w r angl i ng ar e t h e f ol l ow i ng:
Gener al pu r pose open sou r ce

par adigm pu tting no
r estr iction on any of th e
m eth ods y ou can dev elop for
th e specific pr oblem at h and
Gr eat ecosy stem of fast,

optim ized, open sou r ce
libr ar ies, focu sed on data
analy tics
Gr ow ing su ppor t to connect

Py th on to ev er y conceiv able
data sou r ce ty pe
Easy inter face to basic

statistical testing and qu ick
v isu alization libr ar ies to
ch eck data qu ality
Seam less inter face of th e

data w r angling ou tpu t w ith
adv anced m ach ine lear ning
m odels
Py t h on i s t h e most p op u l ar l angu age of c h oi c e of

mac h i ne l ear ni ng and ar t i f i c i al i nt el l i genc e
t h ese day s.
Lists, Sets, Strings,
Tuples, and Dictionaries
N ow t h at w e h av e l ear ned t h e i mp or t anc e of
Py t h on, w e w i l l st ar t b y ex p l or i ng v ar i ou s b asi c
dat a st r u c t u r es i n Py t h on. W e w i l l l ear n
t ec h ni qu es t o h andl e dat a. Th i s i s i nv al u ab l e f or
a dat a p r ac t i t i oner .
W e c an i ssu e t h e f ol l ow i ng c ommand t o st ar t a
new Ju p y t er ser v er b y t y p i ng t h e f ol l ow i ng i n
t o t h e Command Pr omp t w i ndow :
docker run -p 8888:8888 -v

'pwd':/notebooks -it rcshubhadeep/packt-
data-wrangling-base:latest ipython
Th i s w i l l st ar t a ju p y t er ser v er and y ou c an v i si t
i t at http://localhost:8888 and u se t h e p assc ode
dw_4_all t o ac c ess t h e mai n i nt er f ac e.
LISTS
Li st s ar e f u ndament al Py t h on dat a st r u c t u r es
t h at h av e c ont i nu ou s memor y l oc at i ons, c an h ost
di f f er ent dat a t y p es, and c an b e ac c essed b y t h e
i ndex .
W e w i l l st ar t w i t h a l i st and l i st c omp r eh ensi on.

W e w i l l gener at e a l i st of nu mb er s, and t h en
ex ami ne w h i c h ones among t h em ar e ev en. W e
w i l l sor t , r ev er se, and c h ec k f or du p l i c at es. W e
w i l l al so see h ow many di f f er ent w ay s w e c an
ac c ess t h e l i st el ement s, i t er at i ng ov er t h em and
c h ec k i ng t h e memb er sh i p of an el ement .
Th e f ol l ow i ng i s an ex amp l e of a si mp l e l i st :
list_example = [51, 27, 34, 46, 90, 45,

-19]
Th e f ol l ow i ng i s al so an ex amp l e of a l i st :
list_example2 = [15, "Yellow car", True,

9.456, [12, "Hello"]]
A s y ou c an see, a l i st c an c ont ai n any nu mb er of

t h e al l ow ed dat at y p e, su c h as int, float, string,
and Boolean, and a l i st c an al so b e a mi x of
di f f er ent dat a t y p es (i nc l u di ng nest ed l i st s).
If y ou ar e c omi ng f r om a st r ongl y t y p ed
l angu age, su c h as C, C++, or Jav a, t h en t h i s w i l l
p r ob ab l y b e st r ange as y ou ar e not al l ow ed t o
mi x di f f er ent k i nds of dat a t y p es i n a si ngl e
ar r ay i n t h ose l angu ages. Li st s ar e somew h at l i k e
ar r ay s, i n t h e sense t h at t h ey ar e b ot h b ased on
c ont i nu ou s memor y l oc at i ons and c an b e
ac c essed u si ng i ndex es. Bu t t h e p ow er of Py t h on
l i st s c ome f r om t h e f ac t t h at t h ey c an h ost
di f f er ent dat a t y p es and y ou ar e al l ow ed t o
mani p u l at e t h e dat a.
Note
Be c a re fu l, th ou g h , a s th e v e ry p ow e r of lis ts , a nd
th e fa c t th a t y ou c a n m ix d iffe re nt d a ta ty p e s in a
s ing le lis t, c a n a c tu a lly c re a te s u btle bu g s th a t
c a n be v e ry d iffic u lt to tra c k .
EXERCISE 1: ACCESSING THE

LIST MEMBERS
In t h e f ol l ow i ng ex er c i se, w e w i l l b e c r eat i ng a
l i st and t h en ob ser v i ng t h e di f f er ent w ay s of
ac c essi ng t h e el ement s:
1 . Define a list called list_1

w ith fou r integer m em ber s,
u sing th e follow ing
com m and:
list_1 = [34, 12, 89,

1]
Th e indices w ill be
au tom atically assigned, as
follow s:
Figure 1.3: List showing the forward and

backward indices
2 . A ccess th e fir st elem ent fr om
list_1 u sing its for w ar d
index:
list_1[0] #34
3 . A ccess th e last elem ent fr om

list_1 u sing its for w ar d
index:
list_1[3] #1
4 . A ccess th e last elem ent fr om

list_1 u sing th e len
fu nction:
list_1[len(list_1) -
1] #1
Th e len fu nction in Py th on
r etu r ns th e length of th e
specified list.
5. A ccess th e last elem ent fr om

list_1 u sing its backw ar d
index:
list_1[-1] #1
6 . A ccess th e fir st th r ee
elem ents fr om list_1 u sing
for w ar d indices:
list_1[1:3] # [12, 89]
Th is is also called list slicing,

as it r etu r ns a sm aller list
fr om th e or iginal list by
extr acting only , a par t of it.
To slice a list, w e need tw o
integer s. Th e fir st integer
w ill denote th e star t of th e
slice and th e second integer
w ill denote th e end-1
elem ent.
Note
Notice that slicing did not

include the third index or the
end element. This is how list
slicing w orks.
7 . A ccess th e last tw o elem ents

fr om list_1 by slicing:
list_1[-2:] # [89, 1]
8. A ccess th e fir st tw o elem ents

u sing backw ar d indices:
list_1[:-2] # [34, 12]
Wh en w e leav e one side of

th e colon (:) blank, w e ar e
basically telling Py th on
eith er to go u ntil th e end or
star t fr om th e beginning of
th e list. It w ill au tom atically
apply th e r u le of list slices
th at w e ju st lear ned.
9 . Rev er se th e elem ents in th e

str ing:
list_1[-1::-1] # [1,
89, 12, 34]
Note
The last bit of code is not very

readable, meaning it is not
obvious just by looking at it
w hat it is doing. I t is against
Python's philosophy. So,
although this kind of code
may look clever, w e should
resist the temptation to w rite
code like this.
LIST
W e ar e goi ng t o ex ami ne v ar i ou s w ay s of
gener at i ng a l i st :
1 . Cr eate a list u sing th e

append m eth od:
list_1 = []
for x in range(0, 10):
list_1.append(x)
list_1
Th e ou tpu t w ill be as follow s:
[0, 1, 2, 3, 4, 5, 6,
7, 8, 9]
Her e, w e star ted by
declar ing an em pty list and
th en w e u sed a for loop to
append v alu es to it. Th e
append m eth od is a m eth od
th at's giv en to u s by th e
Py th on list data ty pe.
2 . Gener ate a list u sing th e

list_2 = [x for x in
range(0, 100)]
list_2
Th e par tial ou tpu t is as

follow s:
Figure 1.4: List comprehension
Th is is list com pr eh ension,

w h ich is a v er y pow er fu l tool
th at w e need to m aster . Th e
pow er of list com pr eh ension
com es fr om th e fact th at w e
can u se conditionals inside
th e com pr eh ension itself.
3 . Use a while loop to iter ate

ov er a list to u nder stand th e
differ ence betw een a while
loop and a for loop:
i = 0
while i < len(list_1)

:
print(list_1[i])
i += 1
Th e par tial ou tpu t w ill be as

follow s:
Figure 1.5: Output showing the contents of

list_1 using a while loop
4 . Cr eate list_3 w ith nu m ber s

th at ar e div isible by 5:
range(0, 100) if x % 5
== 0]
list_3
Th e ou tpu t w ill be a list of

nu m ber s u p to 1 00 in
incr em ents of 5:
[0, 5, 10, 15, 20, 25,

30, 35, 40, 45, 50,
55, 60, 65, 70, 75,
80, 85, 90, 95]
5. Gener ate a list by adding th e

tw o lists:
list_1 = [1, 4, 56,

-1]
list_2 = [1, 39, 245,

-23, 0, 45]
list_3 = list_1 +
list_2
list_3
Th e ou tpu t is as follow s:
[1, 4, 56, -1, 1, 39,
245, -23, 0, 45]
6 . Extend a str ing u sing th e

extend key w or d:
list_1.extend(list_2)
list_1

follow s:
Figure 1.6: Contents of list_1
Th e sec ond op er at i on c h anges t h e or i gi nal l i st

(l i st _1 ) and ap p ends al l t h e el ement s of l i st _2 t o
i t . So, b e c ar ef u l w h en u si ng i t .

A LIST AND CHECKING
MEMBERSHIP
W e ar e goi ng t o i t er at e ov er a l i st and t est
w h et h er a c er t ai n v al u e ex i st s i n i t :
1 . Iter ate ov er a list:
range(0, 100)]
for i in range(0,
len(list_1)):
print(list_1[i])
Figure 1.7: Section of list_1
2 . How ev er , it is not v er y
Py th onic. Being Py th onic is
to follow and confor m to a set
of best pr actices and
conv entions th at h av e been
cr eated ov er th e y ear s by
th ou sands of v er y able
dev eloper s, w h ich in th is
case m eans to u se th e in
key w or d, becau se Py th on
does not h av e index
initialization, bou nds
ch ecking, or index
incr em enting, u nlike
tr aditional langu ages. Th e
Py th onic w ay of iter ating
ov er a list is as follow s:
for i in list_1:
print(i)
Figure 1.8: A section of list_1
Notice th at, in th e second

m eth od, w e do not need a
cou nter any m or e to access
th e list index; instead,
Py th on's in oper ator giv es u s
th e elem ent at th e i th
position dir ectly .
3 . Ch eck w h eth er th e integer s

2 5 and -4 5 ar e in th e list
u sing th e in oper ator :
25 in list_1
Th e ou tpu t is True.
-45 in list_1
Th e ou tpu t is False.
EXERCISE 4: SORTING A LIST

W e gener at ed a l i st c al l ed list_1 i n t h e p r ev i ou s
ex er c i se. W e ar e goi ng t o sor t i t now :
1 . A s th e list w as or iginally a
list of nu m ber s fr om 0 to 99,
w e w ill sor t it in th e r ev er se
dir ection. To do th at, w e w ill
u se th e sort m eth od w ith
reverse=True:
list_1.sort(reverse=Tr
ue)
list_1

follow s:
Figure 1.9: Section of output showing the
reversed list
2 . We can u se th e reverse
m eth od dir ectly to ach iev e
th is r esu lt:
list_1.reverse()
list_1
Figure 1.10: Section of output a er reversing the string
Note
Th e d iffe re nc e be tw e e n th e s ort fu nc tion a nd th e
re v e rs e fu nc tion is th e fa c t th a t w e c a n u s e s ort
w ith c u s tom s orting fu nc tions to d o c u s tom
s orting , w h e re a s w e c a n only u s e re v e rs e to
re v e rs e a lis t. He re a ls o, both th e fu nc tions w ork
in-p la c e , s o be a w a re of th is w h ile u s ing th e m .
RANDOM LIST
In t h i s ex er c i se, w e w i l l b e gener at i ng a list
w i t h r andom nu mb er s:
1 . Im por t th e random libr ar y :
import random
2 . Use th e randint fu nction to

gener ate r andom integer s
and add th em to a list:
list_1 =
[random.randint(0, 30)
for x in range (0,
100)]
3 . Pr int th e list u sing

print(list_1). Note th at
th er e w ill be du plicate
v alu es in list_1:
list_1
Th e sam ple ou tpu t is as

follow s:
Figure 1.11: Section of the sample output for list_1
Th er e ar e many w ay s t o get a l i st of u ni qu e
nu mb er s, and w h i l e y ou may b e ab l e t o w r i t e a
f ew l i nes of c ode u si ng a f or l oop and anot h er l i st
(y ou sh ou l d ac t u al l y t r y doi ng i t !), l et 's see h ow
w e c an do t h i s w i t h ou t a f or l oop and w i t h a
si ngl e l i ne of c ode. Th i s w i l l b r i ng u s t o t h e nex t
dat a st r u c t u r e, set s.
ACTIVITY 1: HANDLING LISTS

In t h i s ac t i v i t y , w e w i l l gener at e a list of
r andom nu mb er s and t h en gener at e anot h er list
f r om t h e f i r st one, w h i c h onl y c ont ai ns nu mb er s
t h at ar e di v i si b l e b y t h r ee. Rep eat t h e
ex p er i ment t h r ee t i mes. Th en, w e w i l l c al c u l at e
t h e av er age di f f er enc e of l engt h b et w een t h e
t w o l i st s.
Th ese ar e t h e st ep s f or c omp l et i ng t h i s ac t i v i t y :
1 . Cr eate a list of 1 00 r andom
nu m ber s.
2 . Cr eate a new list fr om th is

r andom list, w ith nu m ber s
th at ar e div isible by 3.
3 . Calcu late th e length of th ese

tw o lists and stor e th e
differ ence in a new v ar iable.
4 . Using a loop, per for m steps 2

and 3 and find th e differ ence
v ar iable th r ee tim es.
5. Find th e ar ith m etic m ean of

th ese th r ee differ ence v alu es.
Note
The solution for this activity

can be found on page 282.
SETS
A set , mat h emat i c al l y sp eak i ng, i s ju st a
c ol l ec t i on of w el l -def i ned di st i nc t ob jec t s.
Py t h on gi v es u s a st r ai gh t f or w ar d w ay t o deal
w i t h t h em u si ng i t s set dat at y p e.
INTRODUCTION TO SETS
W i t h t h e l ast l i st t h at w e gener at ed, w e ar e goi ng
t o r ev i si t t h e p r ob l em of get t i ng r i d of
du p l i c at es f r om i t . W e c an ac h i ev e t h at w i t h t h e
f ol l ow i ng l i ne of c ode:
list_12 = list(set(list_1))
If w e p r i nt t h i s, w e w i l l see t h at i t onl y c ont ai ns

u ni qu e nu mb er s. W e u sed t h e set dat a t y p e t o
t u r n t h e f i r st l i st i nt o a set , t h u s get t i ng r i d of
al l du p l i c at e el ement s, and t h en w e u sed t h e list
f u nc t i on on i t t o t u r n i t i nt o a l i st f r om a set onc e
mor e:
list_12
Th e ou t p u t w i l l b e as f ol l ow s:
Figure 1.12: Section of output for list_21
UNION AND INTERSECTION

OF SETS
Th i s i s w h at a u ni on b et w een t w o set s l ook s l i k e:
Figure 1.13: Venn diagram showing the union of two
sets
Th i s si mp l y means t ak e ev er y t h i ng f r om b ot h
set s b u t t ak e t h e c ommon el ement s onl y onc e.
W e c an c r eat e t h i s u si ng t h e f ol l ow i ng c ode:
set1 = {"Apple", "Orange", "Banana"}
set2 = {"Pear", "Peach", "Mango",

"Banana"}
To f i nd t h e u ni on of t h e t w o set s, t h e f ol l ow i ng
i nst r u c t i ons sh ou l d b e u sed:
set1 | set2
Th e ou t p u t w ou l d b e as f ol l ow s:
{'Apple', 'Banana', 'Mango', 'Orange',

'Peach', 'Pear'}
N ot i c e t h at t h e c ommon el ement , Banana,

ap p ear s onl y onc e i n t h e r esu l t i ng set . Th e
c ommon el ement s b et w een t w o set s c an b e
i dent i f i ed b y ob t ai ni ng t h e i nt er sec t i on of t h e
t w o set s, as f ol l ow s:
Figure 1.14: Venn diagram showing the intersection of

two sets
W e get t h e i nt er sec t i on of t w o set s i n Py t h on as

f ol l ow s:
set1 & set2
Th i s w i l l gi v e u s a set w i t h onl y one el ement .

Th e ou t p u t i s as f ol l ow s:
{'Banana'}
Note
You c a n a ls o c a lc u la te th e d iffe re nc e be tw e e n
s e ts (a ls o k now n a s c om p le m e nts ). To find ou t
m ore , re fe r to th is link :
h ttp s ://d oc s .p y th on.org /3/tu toria l/d a ta s tru c tu re
s .h tm l#s e ts .
CREATING NULL SETS

You c an c r eat e a nu l l set b y c r eat i ng a set
c ont ai ni ng no el ement s. You c an do t h i s b y u si ng
t h e f ol l ow i ng c ode:
null_set_1 = set({})
null_set_1
set()
H ow ev er , t o c r eat e a di c t i onar y , u se t h e
f ol l ow i ng c ommand:
null_set_2 = {}
null_set_2
{}
W e ar e goi ng t o l ear n ab ou t t h i s i n det ai l i n t h e

nex t t op i c .
DICTIONARY
A di c t i onar y i s l i k e a l i st , w h i c h means i t i s a
c ol l ec t i on of sev er al el ement s. H ow ev er , w i t h
t h e di c t i onar y , i t i s a c ol l ec t i on of k ey -v al u e
p ai r s, w h er e t h e k ey c an b e any t h i ng t h at c an b e
h ash ed. Gener al l y , w e u se nu mb er s or st r i ngs as
k ey s.
To c r eat e a di c t i onar y , u se t h e f ol l ow i ng c ode:
dict_1 = {"key1": "value1", "key2":

"value2"}
dict_1
{'key1': 'value1', 'key2': 'value2'}
Th i s i s al so a v al i d di c t i onar y :
dict_2 = {"key1": 1, "key2":

["list_element1", 34], "key3": "value3",
"key4": {"subkey1": "v1"}, "key5": 4.5}
dict_2
{'key1': 1,
'key2': ['list_element1', 34],

'key3': 'value3',
'key4': {'subkey1': 'v1'},
'key5': 4.5}
Th e k ey s mu st b e u ni qu e i n a di c t i onar y .
EXERCISE 6: ACCESSING AND

SETTING VALUES IN A
DICTIONARY
In t h i s ex er c i se, w e ar e goi ng t o ac c ess and set
v al u es i n a di c t i onar y :
1 . A ccess a par ticu lar key in a

dictionar y :
dict_2["key2"]
Th is w ill r etu r n th e v alu e

associated w ith it as follow s:
['list_element1', 34]
2 . A ssign a new v alu e to th e

key :
dict_2["key2"] = "My
new value"
3 . Define a blank dictionar y

and th en u se th e key
notation to assign v alu es to
it:
dict_3 = {} # Not a
null set. It is a dict
dict_3["key1"] =
"Value1"
dict_3
{'key1': 'Value1'}

A DICTIONARY
In t h i s ex er c i se, w e ar e goi ng t o i t er at e ov er a
di c t i onar y :
1 . Cr eate dict_1:
dict_1 = {"key1": 1,
"key2":
["list_element1", 34],
"key3": "value3",
"key4": {"subkey1":
"v1"}, "key5": 4.5}
2 . Use th e looping v ar iables k

and v:
for k, v in
dict_1.items():
print("{} -
{}".format(k, v))
key1 - 1
key2 -
['list_element1', 34]
key3 - value3
key4 - {'subkey1':
'v1'}
key5 - 4.5
Note
Notice the difference betw een

how w e did the iteration on
the list and how w e are doing
it here.
EXERCISE 8: REVISITING THE

UNIQUE VALUED LIST
PROBLEM
W e w i l l u se t h e f ac t t h at di c t i onar y k ey s c annot
b e du p l i c at ed t o gener at e t h e u ni qu e v al u ed l i st :
1 . Fir st, gener ate a r andom list

w ith du plicate v alu es:
list_1 =
[random.randint(0, 30)
for x in range (0,
100)]
2 . Cr eate a u niqu e v alu ed list

fr om list_1:
list(dict.fromkeys(lis
t_1).keys())
follow s:
Figure 1.15: Output showing the unique valued list
H er e, w e h av e u sed t w o u sef u l f u nc t i ons on t h e

di c t dat a t y p e i n Py t h on, fromkeys and keys.
fromkeys c r eat es a di c t w h er e t h e k ey s c ome f r om
t h e iterable (i n t h i s c ase, w h i c h i s a l i st ), v al u es
def au l t t o N one, and keys gi v e u s t h e k ey s of a
di c t .
EXERCISE 9: DELETING VALUE

FROM DICT
In t h i s ex er c i se, w e ar e goi ng t o del et e a v al u e
f r om a dict:
1 . Cr eate list_1 w ith fiv e

elem ents:
dict_1 = {"key1": 1,
"key2":
["list_element1", 34],
"key3": "value3",
"key4": {"subkey1":
"v1"}, "key5": 4.5}
dict_1
{'key1': 1,
'key2':
['list_element', 34],
'key3': 'value3',
'key4': {'subkey1':
'v1'},
'key5': 4.5}
2 . We w ill u se th e del fu nction

and specify th e elem ent:
del dict_1["key2"]
{'key3': 'value3',
'key4': {'subkey1':
'v1'}, 'key5': 4.5}
Note
The del operator can be used

to delete a specific index from
a list as w ell.
EXERCISE 10: DICTIONARY

COMPREHENSION
In t h i s f i nal ex er c i se on dict, w e w i l l go ov er a
l ess u sed c omp r eh ensi on t h an t h e l i st one:
di c t i onar y c omp r eh ensi on. W e w i l l al so
ex ami ne t w o ot h er w ay s t o c r eat e a dict, w h i c h
w i l l b e u sef u l i n t h e f u t u r e.
A di c t i onar y c omp r eh ensi on w or k s ex ac t l y t h e

same w ay as t h e l i st one, b u t w e need t o sp ec i f y
b ot h t h e k ey s and v al u es:
1 . Gener ate a dict th at h as 0 to

9 as th e key s and th e squ ar e
of th e key as th e v alu es:
range(0, 10)]
dict_1 = {x : x**2 for

x in list_1}
dict_1
{0: 0, 1: 1, 2: 4, 3:
9, 4: 16, 5: 25, 6:
36, 7: 49, 8: 64, 9:
81}
Can y ou gener ate a dict

u sing dict com pr eh ension
w h er e th e key s ar e fr om 0 to
9 and th e v alu es ar e th e
squ ar e r oot of th e key s? Th is
tim e, w e w on't u se a list.
2 . Gener ate a dictionary

u sing th e dict fu nction:
dict_2 = dict([('Tom',
100), ('Dick', 200),
('Harry', 300)])
dict_2
{'Tom': 100, 'Dick':

200, 'Harry': 300}
You can also gener ate

dictionary u sing th e dict
fu nction, as follow s:
dict_3 = dict(Tom=100,
Dick=200, Harry=300)
dict_3
{'Tom': 100, 'Dick':

200, 'Harry': 300}
It is pr etty v er satile. So, both

th e pr eceding com m ands
w ill gener ate v alid
dictionar ies.
Th e str ange looking pair of

v alu es th at w e h ad ju st
noticed ('Har r y ', 3 00) is
called a tuple. Th is is
anoth er im por tant
fu ndam ental data ty pe in
Py th on. We w ill lear n abou t
tu ples in th e next topic.
TUPLES
A t u p l e i s anot h er dat a t y p e i n Py t h on. It i s
sequ ent i al i n nat u r e and si mi l ar t o l i st s.
A t u p l e c onsi st s of v al u es sep ar at ed b y c ommas,

as f ol l ow s:
tuple_1 = 24, 42, 2.3456, "Hello"
N ot i c e t h at , u nl i k e l i st s, w e di d not op en and
c l ose squ ar e b r ac k et s h er e.
CREATING A TUPLE WITH

DIFFERENT CARDINALITIES
Th i s i s h ow w e c r eat e an emp t y t u p l e:
tuple_1 = ()
A nd t h i s i s h ow w e c r eat e a t u p l e w i t h onl y one

v al u e:
tuple_1 = "Hello",
N ot i c e t h e t r ai l i ng c omma h er e.
W e c an nest t u p l es, si mi l ar t o l i st and di c t s, as

f ol l ow s:
tuple_1 = "hello", "there"
tuple_12 = tuple_1, 45, "Sam"
One sp ec i al t h i ng ab ou t t u p l es i s t h e f ac t t h at
t h ey ar e an i mmu t ab l e dat a t y p e. So, onc e
c r eat ed, w e c annot c h ange t h ei r v al u es. W e c an
ju st ac c ess t h em, as f ol l ow s:
tuple_1 = "Hello", "World!"
tuple_1[1] = "Universe!"
Th e l ast l i ne of c ode w i l l r esu l t i n a TypeError as

a t u p l e does not al l ow modi f i c at i on.
Th i s mak es t h e u se c ase f or t u p l es a b i t di f f er ent

t h an l i st s, al t h ou gh t h ey l ook and b eh av e v er y
si mi l ar l y i n a f ew asp ec t s.
UNPACKING A TUPLE
Th e t er m u np ac k i ng a t u p l e si mp l y means t o get
t h e v al u es c ont ai ned i n t h e t u p l e i n di f f er ent
v ar i ab l es:
tuple_1 = "Hello", "World"
hello, world = tuple_1

print(hello)
print(world)
Hello
World
Of c ou r se, as soon as w e do t h at , w e c an modi f y

t h e v al u es c ont ai ned i n t h ose v ar i ab l es.

TUPLES
1 . Cr eate a tu ple to
dem onstr ate h ow tu ples ar e
im m u table. Unpack it to
r ead all elem ents, as follow s:
tupleE = "1", "3", "5"
tupleE
('1', '3', '5')
2 . Tr y to ov er r ide a v ar iable
fr om th e tupleE tu ple:
tupleE[1] = "5"
Th is step w ill r esu lt in

TypeError as th e tu ple does
not allow m odification.
3 . Tr y to assign a ser ies to th e

tupleE tu ple:
1, 3, 5 = tupleE
4 . Pr int th e ou tpu t:
print(1)
print(3)
W e h av e mai nl y seen t w o di f f er ent t y p es of dat a

so f ar . One i s r ep r esent ed b y nu mb er s; anot h er i s
r ep r esent ed b y t ex t u al dat a. W h er eas nu mb er s
h av e t h ei r ow n t r i c k s, w h i c h w e w i l l see l at er , i t
i s t i me t o l ook i nt o t ex t u al dat a i n a b i t mor e
det ai l .
STRINGS
In t h e f i nal sec t i on of t h i s sec t i on, w e w i l l l ear n
ab ou t st r i ngs. St r i ngs i n Py t h on ar e si mi l ar t o
any ot h er p r ogr ammi ng l angu age.
Th i s i s a st r i ng:
string1 = 'Hello World!'
A st r i ng c an al so b e dec l ar ed i n t h i s manner :
string2 = "Hello World 2!"
You c an u se si ngl e qu ot es and dou b l e qu ot es t o

def i ne a st r i ng.
EXERCISE 12: ACCESSING

STRINGS
St r i ngs i n Py t h on b eh av e si mi l ar t o l i st s, ap ar t
f r om one b i g c av eat . St r i ngs ar e i mmu t ab l e,
w h er eas l i st s ar e mu t ab l e dat a st r u c t u r es:
1 . Cr eate a str ing called str_1:
str_1 = "Hello World!"
A ccess th e elem ents of th e

str ing by specify ing th e
location of th e elem ent, like
w e did in lists.
2 . A ccess th e fir st m em ber of

th e str ing:
str_1[0]
'H'
3 . A ccess th e fou r th m em ber of

th e str ing:
str_1[4]
'o'
4 . A ccess th e last m em ber of

th e str ing:
str_1[len(str_1) - 1]
'!'
5. A ccess th e last m em ber of

th e str ing:
str_1[-1]
'!'
Each of th e pr eceding
oper ations w ill giv e y ou th e
ch ar acter at th e specific
index.
Note
The method for accessing the

elements of a string is like
accessing a list.
EXERCISE 13: STRING SLICES

Ju st l i k e l i st s, w e c an sl i c e st r i ngs:
1 . Cr eate a str ing, str_1:
str_1 = "Hello World!

I am learning data
wrangling"
2 . Specify th e slicing v alu es

and slice th e str ing:
str_1[2:10]
Th e ou tpu t is th is:
'llo Worl'
3 . Slice a str ing by skipping a

slice v alu e:
str_1[-31:]
'd! I am learning data

wrangling'
4 . Use negativ e nu m ber s to

slice th e str ing:
str_1[-10:-5]
' wran'
STRING FUNCTIONS
To f i nd ou t t h e l engt h of a st r i ng, w e si mp l y u se
t h e len f u nc t i on:
str_1 = "Hello World! I am learning data

wrangling"
len(str_1)
Th e l engt h of t h e st r i ng i s 41 . To c onv er t a
st r i ng's c ase, w e c an u se t h e lower and upper
met h ods:
str_1 = "A COMPLETE UPPER CASE STRING"
str_1.lower()
str_1.upper()
'A COMPLETE UPPER CASE STRING'
To sear c h f or a st r i ng w i t h i n a st r i ng, w e c an u se
t h e find met h od:
str_1 = "A complicated string looks like

this"
str_1.find("complicated")
str_1.find("hello")# This will return -1
Th e ou t p u t i s -1 . Can y ou f i gu r e ou t w h et h er t h e
f i nd met h od i s c ase-sensi t i v e or not ? A l so, w h at
do y ou t h i nk t h e f i nd met h od r et u r ns w h en i t
ac t u al l y f i nds t h e st r i ng?
To r ep l ac e one st r i ng w i t h anot h er , w e h av e t h e
replace met h od. Si nc e w e k now t h at a st r i ng i s an
i mmu t ab l e dat a st r u c t u r e, r ep l ac e ac t u al l y
r et u r ns a new st r i ng i nst ead of r ep l ac i ng and
r et u r ni ng t h e ac t u al one:
str_1 = "A complicated string looks like

this"
str_1.replace("complicated", "simple")
'A simple string looks like this'
You sh ou l d l ook u p st r i ng met h ods i n t h e

st andar d doc u ment at i on of Py t h on 3 t o di sc ov er
mor e ab ou t t h ese met h ods.
EXERCISE 14: SPLIT AND JOIN

Th ese t w o st r i ng met h ods need sep ar at e
i nt r odu c t i ons, as t h ey enab l e y ou t o c onv er t a
st r i ng i nt o a l i st and v i c e v er sa:
1 . Cr eate a str ing and conv er t

it to a list u sing th e split
m eth od:
str_1 = "Name, Age,

Sex, Address"
list_1 =
str_1.split(",")
list_1
Th e pr eceding code w ill giv e

y ou a list sim ilar to th e
follow ing:
['Name', ' Age', '

Sex', ' Address']
2 . Com bine th is list into

anoth er str ing u sing th e
join m eth od:
" | ".join(list_1)
Th is code w ill giv e y ou a

str ing like th is:
'Name | Age | Sex |

Address'
W i t h t h ese, w e ar e at t h e end of ou r sec ond t op i c

of t h i s c h ap t er . W e now h av e t h e mot i v at i on t o
l ear n dat a w r angl i ng and h av e a sol i d
i nt r odu c t i on t o t h e f u ndament al s of dat a
st r u c t u r es u si ng Py t h on. Th er e i s mor e t o t h i s
t op i c , w h i c h w i l l b e c ov er ed i n a f u t u r e
c h ap t er s.
W e h av e desi gned an ac t i v i t y f or y ou so t h at y ou
c an p r ac t i c e al l t h e sk i l l s y ou ju st l ear ned. Th i s
smal l ac t i v i t y sh ou l d t ak e ar ou nd 30 t o 45
mi nu t es t o f i ni sh .
ACTIVITY 2: ANALYZE A
MULTILINE STRING AND
GENERATE THE UNIQUE
WORD COUNT
Th i s sec t i on w i l l ensu r e t h at y ou h av e
u nder st ood t h e v ar i ou s b asi c dat a st r u c t u r es and
t h ei r mani p u l at i on. W e w i l l do t h at b y goi ng
t h r ou gh an ac t i v i t y t h at h as b een desi gned
sp ec i f i c al l y f or t h i s p u r p ose.
In t h i s ac t i v i t y , w e w i l l do t h e f ol l ow i ng:
Get m u ltiline text and sav e

it in a Py th on v ar iable
Get r id of all new lines in it

u sing str ing m eth ods
Get all th e u niqu e w or ds and

th eir occu r r ences fr om th e
str ing
Repeat th e step to find all
u niqu e w or ds and
occu r r ences, w ith ou t
consider ing case sensitiv ity
Note
For the sake of simplicity for

this activity, the original text
(w hich can be found
at https://w w w .gutenberg.or
g/files/1342/1342-h/1342-
h.htm) has been pre-
processed a bit.
Th ese ar e t h e st ep s t o gu i de y ou t h r ou gh sol v i ng
t h i s ac t i v i t y :
1 . Cr eate a mutliline_text
v ar iable by copy ing th e text
fr om th e fir st ch apter of Pride
and Prejudice.
Note
The first chapter of Pride and

Prejudice by Jane Austen has
been made available on the
GitHub repository at
https://github.com/TrainingB
yPackt/Data-Wrangling-w ith-
Python/blob/master/Chapter
01/Activity02/.
2 . Find th e ty pe and length of

th e multiline_text str ing
u sing th e com m ands type
and len.
3 . Rem ov e all new lines and

sy m bols u sing th e replace
fu nction.
4 . Find all of th e w or ds in
multiline_text u sing th e
split fu nction.
5. Cr eate a list fr om th is list

th at w ill contain only th e
u niqu e w or ds.
6 . Cou nt th e nu m ber of tim es

th e u niqu e w or d h as
appear ed in th e list u sing th e
key and value in dict.
7 . Find th e top 2 5 w or ds fr om
th e u niqu e w or ds th at y ou
h av e fou nd u sing th e slice
fu nction.
You ju st cr eated, step by

step, a u niqu e w or d cou nter
u sing all th e neat tr icks th at
y ou lear ned abou t in th is
ch apter .
Note

Summary
In t h i s c h ap t er , w e l ear ned w h at t h e t er m dat a
w r angl i ng means. W e al so got ex amp l es f r om
v ar i ou s r eal -l i f e dat a sc i enc e si t u at i ons w h er e
dat a w r angl i ng i s v er y u sef u l and i s u sed i n
i ndu st r y . W e mov ed on t o l ear n ab ou t t h e
di f f er ent b u i l t -i n dat a st r u c t u r es t h at Py t h on
h as t o of f er . W e got ou r h ands di r t y b y ex p l or i ng
l i st s, set s, di c t i onar i es, t u p l es, and st r i ngs. Th ey
ar e t h e f u ndament al b u i l di ng b l oc k s i n Py t h on
dat a st r u c t u r es, and w e need t h em al l t h e t i me
w h i l e w or k i ng and mani p u l at i ng dat a i n Py t h on.
W e di d sev er al smal l h ands-on ex er c i ses t o l ear n
mor e ab ou t t h em. W e f i ni sh ed t h i s c h ap t er w i t h
a c ar ef u l l y desi gned ac t i v i t y , w h i c h l et u s
c omb i ne a l ot of di f f er ent t r i c k s f r om al l t h e
di f f er ent dat a st r u c t u r es i nt o a r eal -l i f e
si t u at i on and l et u s ob ser v e t h e i nt er p l ay
b et w een al l of t h em.
In t h e nex t c h ap t er , w e w i l l l ear n ab ou t t h e dat a

st r u c t u r es i n Py t h on and u t i l i ze t h em t o sol v e
r eal -w or l d p r ob l ems.
Chapter 2
Advanced Data
Structures and File
Handling
Learning Objectives
By t h e end of t h i s c h ap t er , y ou w i l l b e ab l e t o:
Com par e Py th on’s adv anced

data str u ctu r es
Utilize data str u ctu r es to

solv e r eal-w or ld pr oblem s
Make u se of OS file-h andling

oper ations
Th i s c h ap t er emp h asi zes t h e dat a st r u c t u r es i n

Py t h on and t h e op er at i ng sy st em f u nc t i ons t h at
ar e t h e f ou ndat i on of t h i s b ook .
Introduction
W e w er e i nt r odu c ed t o t h e b asi c c onc ep t s of
di f f er ent f u ndament al dat a st r u c t u r es i n t h e l ast
c h ap t er . W e l ear ned ab ou t t h e l i st , set , di c t ,
t u p l e, and st r i ng. Th ey ar e t h e b u i l di ng b l oc k s of
f u t u r e c h ap t er s and ar e essent i al f or dat a
sc i enc e.
H ow ev er , w h at w e h av e c ov er ed so f ar w er e onl y
b asi c op er at i ons on t h em. Th ey h av e mu c h mor e
t o of f er onc e y ou l ear n h ow t o u t i l i ze t h em
ef f ec t i v el y . In t h i s c h ap t er , w e w i l l v ent u r e
f u r t h er i nt o t h e l and of dat a st r u c t u r es. W e w i l l
l ear n ab ou t adv anc ed op er at i ons and
mani p u l at i ons and u se t h ese f u ndament al dat a
st r u c t u r es t o r ep r esent mor e c omp l ex and
h i gh er -l ev el dat a st r u c t u r es; t h i s i s of t en h andy
w h i l e w r angl i ng dat a i n r eal l i f e.
In r eal l i f e, w e deal w i t h dat a t h at c omes f r om

di f f er ent sou r c es and gener al l y r ead dat a f r om a
f i l e or a dat ab ase. W e w i l l b e i nt r odu c ed t o
op er at i ons r el at ed t o f i l es. W e w i l l see h ow t o
op en a f i l e and h ow many w ay s t h er e ar e t o do i t ,
h ow t o r ead dat a f r om i t , h ow t o w r i t e dat a t o i t ,
and h ow t o saf el y c l ose i t onc e w e ar e done. Th e
l ast p ar t , w h i c h many p eop l e t end t o i gnor e, i s
su p er i mp or t ant . W e of t en r u n i nt o v er y st r ange
and h ar d-t o-t r ac k -dow n b u gs i n a r eal -w or l d
sy st em ju st b ec au se a p r oc ess op ened a f i l e and
di d not c l ose i t p r op er l y . W i t h ou t f u r t h er ado,
l et 's b egi n ou r jou r ney .
Advanced Data
Structures
W e w i l l st ar t t h i s c h ap t er b y di sc u ssi ng
adv anc ed dat a st r u c t u r es. W e w i l l do t h at b y
r ev i si t i ng l i st s. W e w i l l c onst r u c t a st ac k and a
qu eu e, ex p l or e mu l t i p l e el ement memb er sh i p
c h ec k i ng, and t h r ow a b i t of f u nc t i onal
p r ogr ammi ng i n f or good measu r e. If al l of t h i s
sou nds i nt i mi dat i ng, t h en do not w or r y . W e w i l l
get t o t h i ngs st ep b y st ep , l i k e i n t h e p r ev i ou s
c h ap t er , and y ou w i l l f eel c onf i dent onc e y ou
h av e f i ni sh ed t h i s c h ap t er .
To st ar t t h i s c h ap t er , y ou h av e t o op en an emp t y
not eb ook . To do t h at , y ou c an si mp l y i np u t t h e
f ol l ow i ng c ommand i n a sh el l . It i s adv i sed t h at
y ou f i r st nav i gat e t o an emp t y di r ec t or y u si ng cd
b ef or e y ou ent er t h e c ommand:
docker run -p 8888:8888 -v

'pwd':/notebooks -it rcshubhadeep/packt-
data-wrangling-base:latest
Onc e t h e Doc k er c ont ai ner i s r u nni ng, p oi nt

y ou r b r ow ser t o h t t p ://l oc al h ost :8888 and u se
dw_4_all as t h e p assc ode t o ac c ess t h e not eb ook
i nt er f ac e.
ITERATOR
W e w i l l st ar t of f t h i s t op i c w i t h l i st s. H ow ev er ,
b ef or e w e get i nt o l i st s, w e w i l l i nt r odu c e t h e
c onc ep t of an i t er at or . A n i t er at or i s an ob jec t
t h at i mp l ement s t h e next met h od, meani ng an
i t er at or i s an ob jec t t h at c an i t er at e ov er a
c ol l ec t i on (l i st s, t u p l es, di c t s, and so on). It i s
st at ef u l , w h i c h means t h at eac h t i me w e c al l t h e
next met h od, i t gi v es u s t h e nex t el ement f r om
t h e c ol l ec t i on. A nd i f t h er e i s no f u r t h er
el ement , t h en i t r ai ses a StopIteration
ex c ep t i on.
Note
A StopIteration e xc e p tion oc c u rs w ith th e
ite ra tor's ne xt m e th od w h e n th e re a re no fu rth e r
v a lu e s to ite ra te .
If y ou ar e f ami l i ar w i t h a p r ogr ammi ng

l angu age l i k e C, C++, Jav a, Jav aSc r i p t , or PH P,
y ou may h av e not i c ed t h e di f f er enc e b et w een
t h e for loop i mp l ement at i on i n t h ose l angu ages,
w h i c h c onsi st s of t h r ee di st i nc t p ar t s, p r ec i sel y
t h e i ni t i at i on, t h e i nc r ement , and t h e
t er mi nat i on c ondi t i on, and t h e for loop i n
Py t h on. In Py t h on, w e do not u se t h at k i nd of f or
l oop . W h at w e u se i n Py t h on i s mor e l i k e a
foreach l oop : for i in list_1. Th i s i s b ec au se,
u nder t h e h ood, t h e f or l oop i s u si ng an i t er at or ,
and t h u s w e do not need t o do al l t h e ex t r a st ep s.
Th e i t er at or does t h i s f or u s.
TO THE ITERATOR
To gener at e l i st s of nu mb er s, w e c an u se
di f f er ent met h ods:
1 . Gener ate a list th at w ill

contain 1 0000000: ones:
big_list_of_numbers =
[1 for x in range(0,
10000000)]
2 . Ch eck th e size of th is
v ar iable:
from sys import

getsizeof
getsizeof(big_list_of_
numbers)
Th e v alu e it w ill sh ow y ou
w ill be som eth ing ar ou nd
81528056 (it is in by tes).
Th is is a lot of m em or y ! A nd
th e big_list_of_numbers
v ar iable is only av ailable
once th e list com pr eh ension
is ov er . It can also ov er flow
th e av ailable sy stem
m em or y if y ou tr y too big a
nu m ber .
3 . Use an iter ator to r edu ce

m em or y u tilization:
from itertools import
repeat
small_list_of_numbers
= repeat(1,
times=10000000)
getsizeof(small_list_o
f_numbers)
Th e last line sh ow s th at ou r
small_list_of_numbers is
only 56 by tes in size. A lso, it
is a lazy m eth od, as it did not
gener ate all th e elem ents. It
w ill gener ate th em one by
one w h en asked, th u s sav ing
u s tim e. In fact, if y ou om it
th e times key w or d
ar gu m ent, th en y ou can
pr actically gener ate an
infinite nu m ber of 1 s.
4 . Loop ov er th e new ly
gener ated iter ator :
for i, x in
enumerate(small_list_o
f_numbers):
print(x)
if i > 10:
break
We u se th e enumerate
fu nction so th at w e get th e
loop cou nter , along w ith th e
v alu es. Th is w ill h elp u s
br eak once w e r each a
cer tain nu m ber of th e
cou nter (1 0 for exam ple).
Th e ou tpu t w ill be a list of 1 0

ones.
5. To look u p th e definition of
any fu nction, ty pe th e
fu nction nam e, follow ed by a
? and pr ess Shift + Enter in a
Ju py ter notebook. Ru n th e
follow ing code to u nder stand
h ow w e can u se
per m u tations and
com binations w ith iter tools:

(permutations,
combinations,
dropwhile, repeat,
zip_longest)
permutations?
combinations?
dropwhile?
repeat?
zip_longest?
STACKS
A st ac k i s a v er y u sef u l dat a st r u c t u r e. If y ou
k now a b i t ab ou t CPU i nt er nal s and h ow a
p r ogr am get s ex ec u t ed, t h en y ou h av e an i dea
t h at a st ac k i s p r esent i n many su c h c ases. It i s
si mp l y a l i st w i t h one r est r i c t i on, Last In Fi r st
Ou t (LIFO), meani ng an el ement t h at c omes i n
l ast goes ou t f i r st w h en a v al u e i s r ead f r om a
st ac k . Th e f ol l ow i ng i l l u st r at i on w i l l mak e t h i s
a b i t c l ear er :
Figure 2.1: A stack with two insert elements and one

pop operation
A s y ou c an see, w e h av e a LIFO st r at egy t o r ead

v al u es f r om a st ac k . W e w i l l i mp l ement a st ac k
u si ng a Py t h on l i st . Py t h on's l i st s h av e a met h od
c al l ed pop, w h i c h does t h e ex ac t same p op
op er at i on t h at y ou c an see i n t h e p r ec edi ng
i l l u st r at i on. W e w i l l u se t h at t o i mp l ement a
st ac k .

A STACK IN PYTHON
1 . Fir st, define an em pty stack:
stack = []
2 . Use th e append m eth od to

add an elem ent in th e stack.
Th anks to append, th e
elem ent w ill be alw ay s
appended at th e end of th e
list:
stack.append(25)
stack
[25]
3 . A ppend anoth er v alu e to th e

stack:
stack.append(-12)
stack
[25, -12]
4 . Read a v alu e fr om ou r stack

u sing th e pop m eth od. Th is
m eth od r eads at th e cu r r ent
last index of th e list and
r etu r ns it to u s. It also deletes
th e index once th e r ead is
done:
tos = stack.pop()tos
-12
A fter w e execu te th e
pr eceding code, w e w ill h av e
-1 2 in tos and th e stack w ill
h av e only one elem ent in it,
25.
5. A ppend h ello to th e stack:
stack.append("Hello")
stack
[25, 'Hello']
Imagi ne y ou ar e sc r ap i ng a w eb p age and y ou

w ant t o f ol l ow eac h U RL t h at i s p r esent t h er e. If
y ou i nser t (ap p end) t h em one b y one i n a st ac k ,
w h i l e y ou r ead t h e w eb p age, and t h en p op t h em
one b y one and f ol l ow t h e l i nk , t h en y ou h av e a
c l ean and ex t endab l e sol u t i on t o t h e p r ob l em.
W e w i l l ex ami ne p ar t of t h i s t ask i n t h e nex t
ex er c i se.

A STACK USING USER-
DEFINED METHODS
W e w i l l c ont i nu e t h e t op i c of t h e st ac k f r om t h e
l ast ex er c i se. Bu t t h i s t i me, w e w i l l i mp l ement
t h e append and pop f u nc t i ons b y ou r sel v es. Th e
ai m of t h i s ex er c i se i s t w of ol d. On one h and, w e
w i l l i mp l ement t h e st ac k , and t h i s t i me w i t h a
r eal -l i f e ex amp l e, w h i c h al so i nv ol v es
k now l edge of st r i ng met h ods and t h u s ser v es as a
r emi nder of t h e l ast c h ap t er and ac t i v i t y . On t h e
ot h er h and, i t w i l l sh ow u s a su b t l e f eat u r e of
Py t h on and h ow i t h andl es p assi ng l i st v ar i ab l es
t o f u nc t i ons, and w i l l b r i ng u s t o t h e nex t
ex er c i se, f u nc t i onal p r ogr ammi ng:
1 . Fir st, w e w ill define tw o

fu nctions, stack_push and
stack_pop. We r enam ed
th em so th at w e do not h av e
a nam espace conflict. A lso,
cr eate a stack called
url_stack for later u se:
def stack_push(s,
value):
return s + [value]
def stack_pop(s):
tos = s[-1]
del s[-1]
return tos
url_stack = []
2 . Th e fir st fu nction takes th e

alr eady existing stack and
adds th e v alu e at th e end of
it.
Note
Notice the square brackets

around the value to convert it
in to a one-element list for the
sake of the + operation.
3 . Th e second one r eads th e

v alu e th at's cu r r ently at th e
-1 index of th e stack and
th en u ses th e del oper ator to
delete th at index, and finally
r etu r ns th e v alu e it r ead
ear lier .
4 . Now , w e ar e going to h av e a
str ing w ith a few URLs in it.
Ou r job is to analy ze th e
str ing so th at w e pu sh th e
URLs in th e stack one by one
as w e encou nter th em , and
th en finally u se a for loop to
pop th em one by one. Let's
take th e fir st line fr om th e
Wikipedia ar ticle abou t data
science:
wikipedia_datascience
= "Data science is an
interdisciplinary
field that uses
scientific methods,
processes, algorithms
and systems to extract
knowledge
[https://en.wikipedia.
org/wiki/Knowledge]
and insights from data
org/wiki/Data] in
various forms, both
structured and
unstructured,similar
to data mining
org/wiki/Data_mining]"
5. For th e sake of th e sim plicity

of th is exer cise, w e h av e kept
th e links in squ ar e br ackets
beside th e tar get w or ds.
6 . Find th e length of th e str ing:
len(wikipedia_datascie
nce)
347
7 . Conv er t th is str ing into a list

by u sing th e split m eth od
fr om th e str ing and th en
calcu late its length :
wd_list =
wikipedia_datascience.
split()
len(wd_list)
34
8. Use a for loop to go ov er each

w or d and ch eck w h eth er it is
a URL. To do th at, w e w ill u se
th e startswith m eth od
fr om th e str ing, and if it is a
URL, th en w e pu sh it into th e
stack:
for word in wd_list:
if word.startswith("
[https://"):
url_stack =
stack_push(url_stack,
word[1:-1])
# Notice the clever

use of string slicing
9 . Pr int th e v alu e in
url_stack:
url_stack
['https://en.wikipedia
.org/wiki/Knowledge',
'https://en.wikipedia.
org/wiki/Data',
'https://en.wikipedia.
org/wiki/Data_mining']
1 0. Iter ate ov er th e list and pr int

th e URLs one by one by u sing
th e stack_pop fu nction:
for i in range(0,
len(url_stack)):
print(stack_pop(url_st
ack))
Figure 2.2: Output of the URLs that are

printed using a stack
1 1 . Pr int it again to m ake su r e

th at th e stack is em pty after
th e final for loop:
print(url_stack)
[]
W e h av e not i c ed a st r ange p h enomenon i n t h e
stack_pop met h od. W e p assed t h e l i st v ar i ab l e
t h er e, and w e u sed t h e del op er at or i nsi de t h e
f u nc t i on, b u t i t c h anged t h e or i gi nal v ar i ab l e b y
del et i ng t h e l ast i ndex eac h t i me w e c al l t h e
f u nc t i on. If y ou ar e c omi ng f r om a l angu age l i k e
C, C++, and Jav a, t h en t h i s i s a c omp l et el y
u nex p ec t ed b eh av i or , as i n t h ose l angu ages t h i s
c an onl y h ap p en i f w e p ass t h e v ar i ab l e b y
r ef er enc e and i t c an l ead t o su b t l e b u gs i n
Py t h on c ode. So b e c ar ef u l . In gener al , i t i s not a
good i dea t o c h ange a v ar i ab l e's v al u e i n p l ac e,
meani ng i nsi de a f u nc t i on. A ny v ar i ab l e t h at 's
p assed t o t h e f u nc t i on sh ou l d b e c onsi der ed and
t r eat ed as i mmu t ab l e. Th i s i s c l ose t o t h e
p r i nc i p l es of f u nc t i onal p r ogr ammi ng. A l amb da
ex p r essi on i n Py t h on i s a w ay t o c onst r u c t one-
l i ne, namel ess f u nc t i ons t h at ar e, b y c onv ent i on,
si de ef f ec t -f r ee.
EXERCISE 18: LAMBDA

EXPRESSION
In t h i s ex er c i se, w e w i l l u se a l amb da ex p r essi on
t o p r ov e t h e f amou s t r i gonomet r i c i dent i t y :
Figure 2.3 Trigonometric identity
1 . Im por t th e math package:
import math
2 . Define tw o fu nctions,
my_sine and my_cosine. Th e
r eason w e ar e declar ing
th ese fu nctions is becau se th e
or iginal sin and cos
fu nctions fr om th e m ath
package take r adians as
inpu t, bu t w e ar e m or e
fam iliar w ith degr ees. So, w e
w ill u se a lam bda expr ession
to define a nam eless one-line
fu nction and u se it. Th is
lam bda fu nction w ill
au tom atically conv er t ou r
degr ee inpu t to r adians and
th en apply sin or cos on it
and r etu r n th e v alu e:
def my_sine():
return lambda x:
math.sin(math.radians(
x))
def my_cosine():
return lambda x:
math.cos(math.radians(
x))
3 . Define sine and cosine for

ou r pu r pose:
sine = my_sine()
cosine = my_cosine()
math.pow(sine(30), 2)
+ math.pow(cosine(30),
2)
1.0
Notice th at w e h av e assigned
th e r etu r n v alu e fr om both
my_sine and my_cosine to
tw o v ar iables, and th en u sed
th em dir ectly as th e
fu nctions. It is a m u ch
cleaner appr oach th an u sing
th em explicitly . Notice th at
w e did not explicitly w r ite a
return statem ent inside th e
lam bda fu nction. It is
assu m ed.
EXERCISE 19: LAMBDA

EXPRESSION FOR SORTING
Th e l amb da ex p r essi on w i l l t ak e an i np u t and
sor t i t ac c or di ng t o t h e v al u es i n t u p l es. A
l amb da c an t ak e one or mor e i np u t s. A l amb da
ex p r essi on c an al so b e u sed t o r ev er se sor t b y
u si ng t h e p ar amet er of reverse as True:
1 . Im agine y ou 'r e in a data

w r angling job w h er e y ou ar e
confr onted w ith th e
follow ing list of tu ples:
capitals = [("USA",
"Washington"),
("India", "Delhi"),
("France", "Paris"),
("UK", "London")]
capitals
[('USA',
'Washington'),
('India', 'Delhi'),
('France', 'Paris'),
('UK', 'London')]
2 . Sor t th is list by th e nam e of

th e capitals of each cou ntr y ,
u sing a sim ple lam bda
expr ession. Use th e follow ing
code:
capitals.sort(key=lamb
da item: item[1])
capitals
[('India', 'Delhi'),
('UK', 'London'),
('France', 'Paris'),
('USA', 'Washington')]
A s w e c an see, l amb da ex p r essi ons ar e p ow er f u l

i f w e mast er t h em and u se t h em i n ou r dat a
w r angl i ng job s. Th ey ar e al so si de ef f ec t -f r ee,
meani ng t h at t h ey do not c h ange t h e v al u es of
t h e v ar i ab l es t h at ar e p assed t o t h em i n p l ac e.
EXERCISE 20: MULTI-ELEMENT

MEMBERSHIP CHECKING
H er e i s an i nt er est i ng p r ob l em. Let 's i magi ne a
l i st of a f ew w or ds sc r ap ed f r om a t ex t c or p u s
y ou ar e w or k i ng w i t h :
1 . Cr eate a list_of_words list

w ith w or ds scr aped fr om a
text cor pu s:
list_of_words =
["Hello", "there.",
"How", "are", "you",
"doing?"]
2 . Find ou t w h eth er th is list
contains all th e elem ents
fr om anoth er list:
check_for = ["How",
"are"]
Th er e exists an elabor ate

solu tion, w h ich inv olv es a
for loop and few if-else
conditions (and y ou sh ou ld
tr y to w r ite it!), bu t th er e
also exists an elegant
Py th onic solu tion to th is
pr oblem , w h ich takes one
line and u ses th e all
fu nction. Th e all fu nction
r etu r ns True if all elem ents
of th e iter able ar e tr u e.
3 . Using th e in key w or d to
ch eck m em ber sh ip in th e list
list_of_words:
all(w in list_of_words
for w in check_for)
True
It is indeed elegant and

sim ple to r eason abou t, and
th is neat tr ick is v er y
im por tant w h en dealing
w ith lists.
QUEUE
A p ar t f r om st ac k s, anot h er h i gh -l ev el dat a
st r u c t u r e t h at w e ar e i nt er est ed i n i s qu eu e. A
qu eu e i s l i k e a st ac k , meani ng t h at y ou c ont i nu e
addi ng el ement s one b y one. W i t h a qu eu e, t h e
r eadi ng of el ement s ob ey s a FIFO (Fi r st In Fi r st
Ou t ) st r at egy . Ch ec k ou t t h e f ol l ow i ng di agr am
t o u nder st and t h i s b et t er :
Figure 2.4: Pictorial representation of a queue
W e w i l l ac c omp l i sh t h i s f i r st u si ng l i st met h ods

and w e w i l l sh ow y ou t h at f or t h i s p u r p ose, i t i s
i nef f i c i ent . Th en, w e w i l l l ear n ab ou t t h e
dequeue dat a st r u c t u r e f r om t h e c ol l ec t i on
modu l e of Py t h on.

A QUEUE IN PYTHON
1 . Cr eate a Py th on qu eu e w ith
th e plain list m eth ods:
%%time
queue = []
for i in range(0,
100000):
queue.append(i)
print("Queue created")
Queue created
Wall time: 11 ms
2 . Use th e pop fu nction to

em pty th e qu eu e and ch eck
item s in it:
for i in range(0,
100000):
queue.pop(0)
print("Queue emptied")
Queue emptied
If w e u se th e %%time m agic
com m and w h ile execu ting
th e pr eceding code, w e w ill
see th at it takes a w h ile to
finish . In a m oder n MacBook,
w ith a qu ad-cor e pr ocessor
and 8 GB RA M, it took
ar ou nd 1 .2 0 seconds to
finish . Th is tim e is taken
becau se of th e pop(0)
oper ation, w h ich m eans
ev er y tim e w e pop a v alu e
fr om th e left of th e list
(w h ich is th e cu r r ent 0
index), Py th on h as to
r ear r ange all th e oth er
elem ents of th e list by
sh ifting th em one space left.
Indeed, it is not a v er y
optim ized im plem entation.
3 . Im plem ent th e sam e qu eu e

u sing th e deque data
str u ctu r e fr om Py th on's
collection package:
%%time
from collections
import deque
queue2 = deque()
for i in range(0,
100000):
queue2.append(i)
print("Queue created")
for i in range(0,
100000):
queue2.popleft()
print("Queue emptied")
Queue created
Queue emptied
Wall time: 23 ms
4 . With th e specialized and

optim ized qu eu e
im plem entation fr om
Py th on's standar d libr ar y ,
th e tim e th at's taken for th is
oper ation is only in th e
r ange of 2 8 m illiseconds!
Th is is a h u ge im pr ov em ent
on th e pr ev iou s one.
A qu eu e i s a v er y i mp or t ant dat a st r u c t u r e. To
gi v e one ex amp l e f r om r eal l i f e, w e c an t h i nk
ab ou t a p r odu c er -c onsu mer sy st em desi gn. W h i l e
doi ng dat a w r angl i ng, y ou w i l l of t en c ome ac r oss
a p r ob l em w h er e y ou mu st p r oc ess v er y b i g f i l es.
One of t h e w ay s t o deal w i t h t h i s p r ob l em i s t o
c h u nk t h e c ont ent s of t h e f i l e i n t o smal l er p ar t s
and t h en p u sh t h em i n t o a qu eu e w h i l e c r eat i ng
smal l , dedi c at ed w or k er p r oc esses, w h i c h r eads
of f t h e qu eu e and p r oc esses one smal l c h u nk at a
t i me. Th i s i s a v er y p ow er f u l desi gn, and y ou c an
ev en u se i t ef f i c i ent l y t o desi gn h u ge mu l t i -node
dat a w r angl i ng p i p el i nes.
W e w i l l end t h e di sc u ssi on on dat a st r u c t u r es

h er e. W h at w e di sc u ssed h er e i s ju st t h e t i p of t h e
i c eb er g. Dat a st r u c t u r es ar e a f asc i nat i ng
su b jec t . Th er e ar e many ot h er dat a st r u c t u r es
t h at w e di d not t ou c h and w h i c h , w h en u sed
ef f i c i ent l y , c an of f er enor mou s added v al u e. W e
st r ongl y enc ou r age y ou t o ex p l or e dat a
st r u c t u r es mor e. Tr y t o l ear n ab ou t l i nk ed l i st s,
t r ee, gr ap h , t r i e, and al l t h e di f f er ent v ar i at i ons
of t h em as mu c h as y ou c an. N ot onl y do t h ey
of f er t h e joy of l ear ni ng, b u t t h ey ar e al so t h e
sec r et mega w eap ons i n t h e ar senal of a dat a
p r ac t i t i oner t h at y ou c an b r i ng ou t ev er y t i me
y ou ar e c h al l enged w i t h a di f f i c u l t dat a
w r angl i ng job .
ACTIVITY 3: PERMUTATION,
ITERATOR, LAMBDA, LIST
In t h i s ac t i v i t y , w e w i l l b e u si ng permutations t o
gener at e al l p ossi b l e t h r ee-di gi t nu mb er s t h at
c an b e gener at ed u si ng 0, 1 , and 2. Th en, l oop ov er
t h i s i t er at or , and al so u se isinstance and assert
t o mak e su r e t h at t h e r et u r n t y p es ar e t u p l es.
A l so, u se a si ngl e l i ne of c ode i nv ol v i ng
dropwhile and lambda ex p r essi ons t o c onv er t al l
t h e t u p l es t o l i st s w h i l e dr op p i ng any l eadi ng
zer os (f or ex amp l e, (0, 1 , 2) b ec omes [1 , 2]).
Fi nal l y , w r i t e a f u nc t i on t h at t ak es a l i st l i k e
b ef or e and r et u r ns t h e ac t u al nu mb er c ont ai ned
i n i t.
Th ese st ep s w i l l gu i de y ou t o sol v e t h i s ac t i v i t y :
1 . Look u p th e definition of
permutations and
dropwhile fr om itertools.
2 . Wr ite an expr ession to
gener ate all th e possible
th r ee-digit nu m ber s u sing 0,
1, and 2.
3 . Loop ov er th e iter ator

expr ession y ou gener ated
befor e. Pr int each elem ent
th at's r etu r ned by th e
iter ator . Use assert and
isinstance to m ake su r e
th at th e elem ents ar e of th e
tu ple ty pe.
4 . Wr ite th e loop again u sing

dropwhile w ith a lam bda
expr ession to dr op any
leading zer os fr om th e tu ples.
A s an exam ple, (0, 1, 2)
w ill becom e [0, 2]. A lso,
cast th e ou tpu t of dropwhile
to a list.
5. Ch eck th e actu al ty pe th at
dropwhile r etu r ns.
6 . Com bine th e pr eceding code

into one block, and th is tim e
w r ite a separ ate fu nction
w h er e y ou w ill pass th e list
gener ated fr om dropwhile,
and th e fu nction w ill r etu r n
th e w h ole nu m ber contained
in th e list. A s an exam ple, if
y ou pass [1, 2] to th e
fu nction, it w ill r etu r n 12.
Make su r e th at th e r etu r n
ty pe is indeed a nu m ber and
not a str ing. A lth ou gh th is
task can be ach iev ed u sing
oth er tr icks, w e r equ ir e th at
y ou tr eat th e incom ing list
as a stack in th e fu nction and
gener ate th e nu m ber by
r eading th e indiv idu al digits
fr om th e stack.
W i t h t h i s ac t i v i t y , w e h av e f i ni sh ed t h i s t op i c
and w e w i l l h ead ov er t o t h e nex t t op i c , w h i c h
i nv ol v es b asi c f i l e-l ev el op er at i ons. Bu t b ef or e
w e l eav e t h i s t op i c , w e enc ou r age y ou t o t h i nk
ab ou t a sol u t i on t o t h e p r ec edi ng p r ob l em
w i t h ou t u si ng al l t h e adv anc ed op er at i ons and
dat a st r u c t u r es w e h av e u sed h er e. You w i l l soon
r eal i ze h ow c omp l ex t h e nai v e sol u t i on i s, and
h ow mu c h v al u e t h ese dat a st r u c t u r es and
op er at i ons b r i ng.
Note
Th e s olu tion for th is a c tiv ity c a n be fou nd on p a g e
289.
Basic File Operations in

Python
In t h e p r ev i ou s t op i c , w e i nv est i gat ed a f ew
adv anc ed dat a st r u c t u r es and al so l ear ned neat
and u sef u l f u nc t i onal p r ogr ammi ng met h ods t o
mani p u l at e t h em w i t h ou t si de ef f ec t s. In t h i s
t op i c , w e w i l l l ear n ab ou t a f ew op er at i ng
sy st em (OS)-l ev el f u nc t i ons i n Py t h on. W e w i l l
c onc ent r at e mai nl y on f i l e-r el at ed f u nc t i ons and
l ear n h ow t o op en a f i l e, r ead t h e dat a l i ne b y
l i ne or al l at onc e, and f i nal l y h ow t o c l eanl y
c l ose t h e f i l e w e op ened. W e w i l l ap p l y a f ew of
t h e t ec h ni qu es w e h av e l ear ned ab ou t on a f i l e
t h at w e w i l l r ead t o p r ac t i c e ou r dat a w r angl i ng
sk i l l s f u r t h er .
EXERCISE 22: FILE

OPERATIONS
In t h i s ex er c i se, w e w i l l l ear n ab ou t t h e OS
modu l e of Py t h on, and w e w i l l al so see t w o v er y
u sef u l w ay s t o w r i t e and r ead env i r onment
v ar i ab l es. Th e p ow er of w r i t i ng and r eadi ng
env i r onment v ar i ab l es i s of t en v er y i mp or t ant
w h i l e desi gni ng and dev el op i ng dat a w r angl i ng
p i p el i nes.
Note
I n fa c t, one of th e fa c tors of th e fa m ou s 12-fa c tor
a p p d e s ig n is th e v e ry id e a of s toring
c onfig u ra tion in th e e nv ironm e nt. You c a n c h e c k
it ou t a t th is URL: h ttp s ://12fa c tor.ne t/c onfig .
Th e p u r p ose of t h e OS modu l e i s t o gi v e y ou w ay s
t o i nt er ac t w i t h op er at i ng sy st em-dep endent
f u nc t i onal i t i es. In gener al , i t i s p r et t y l ow -l ev el
and most of t h e f u nc t i ons f r om t h er e ar e not
u sef u l on a day -t o-day b asi s, h ow ev er , some ar e
w or t h l ear ni ng. os.environ i s t h e c ol l ec t i on
Py t h on mai nt ai ns w i t h al l t h e p r esent
env i r onment v ar i ab l es i n y ou r OS. It gi v es y ou
t h e p ow er t o c r eat e new ones. Th e os.getenv
f u nc t i on gi v es y ou t h e ab i l i t y t o r ead an
env i r onment v ar i ab l e:
1 . Im por t th e os m odu le.
import os
2 . Set few env ir onm ent
v ar iables:
os.environ['MY_KEY'] =
"MY_VAL"
os.getenv('MY_KEY')
'MY_VAL'
Pr int th e env ir onm ent

v ar iable w h en it is not set:
print(os.getenv('MY_KE
Y_NOT_SET'))
None
3 . Pr int th e os env ir onm ent:
print(os.environ)
Note
The output has not been added

for security reasons.
A fter execu ting th e

pr eceding code, y ou w ill be
able to see th at y ou h av e
su ccessfu lly pr inted th e
v alu e of MY_KEY, and w h en
y ou tr ied to pr int
MY_KEY_NOT_SET, it pr inted
None.
FILE HANDLING
In t h i s ex er c i se, w e w i l l l ear n ab ou t h ow t o op en
a f i l e i n Py t h on. W e w i l l l ear n ab ou t t h e
di f f er ent modes t h at w e c an u se and w h at t h ey
st and f or . Py t h on h as a b u i l t -i n open f u nc t i on
t h at w e w i l l u se t o op en a f i l e. Th e open f u nc t i on
t ak es f ew ar gu ment s as i np u t . A mong t h em, t h e
f i r st one, w h i c h st ands f or t h e name of t h e f i l e
y ou w ant t o op en, i s t h e onl y one t h at 's
mandat or y . Ev er y t h i ng el se h as a def au l t v al u e.
W h en y ou c al l open, Py t h on u ses u nder l y i ng
sy st em-l ev el c al l s t o op en a f i l e h andl er and w i l l
r et u r n i t t o t h e c al l er .
U su al l y , a f i l e c an b e op ened ei t h er f or r eadi ng
or f or w r i t i ng. If w e op en a f i l e i n one mode, t h e
ot h er op er at i on i s not su p p or t ed. W h er eas
r eadi ng u su al l y means w e st ar t t o r ead f r om t h e
b egi nni ng of an ex i st i ng f i l e, w r i t i ng c an mean
ei t h er st ar t i ng a new f i l e and w r i t i ng f r om t h e
b egi nni ng or op eni ng an ex i st i ng f i l e and
ap p endi ng t o i t . H er e i s a t ab l e sh ow i ng y ou al l
t h e di f f er ent modes Py t h on su p p or t s f or op eni ng
a f i l e:
Figure 2.5 Modes to read a file
Th er e al so ex i st s a dep r ec at ed mode, U, w h i c h i n a
Py t h on3 env i r onment does not h i ng. One t h i ng
w e mu st r ememb er h er e i s t h at Py t h on w i l l
al w ay s di f f er ent i at e b et w een t and b modes, ev en
i f t h e u nder l y i ng OS doesn't . Th i s i s b ec au se i n b
mode, Py t h on does not t r y t o dec ode w h at i t i s
r eadi ng and gi v es u s b ac k t h e b y t es ob jec t
i nst ead, w h er eas i n t mode, i t does t r y t o dec ode
t h e st r eam and gi v es u s b ac k t h e st r i ng
r ep r esent at i on.
You c an op en a f i l e f or r eadi ng l i k e so:
fd = open("Alice’s Adventures in
Wonderland, by Lewis Carroll")
Th i s i s op ened i n rt mode. You c an op en t h e same
f i l e i n b i nar y mode i f y ou w ant . To op en t h e f i l e
i n b i nar y mode, u se t h e rb mode:
fd = open("Alice’s Adventures in
Wonderland, by Lewis Carroll",
"rb")
fd
<_io.BufferedReader name='Alice's
Adventures in Wonderland, by Lewis
Carroll'>
Th i s i s h ow w e op en a f i l e f or w r i t i ng:
fd = open("interesting_data.txt", "w")
fd
<_io.TextIOWrapper
name='interesting_data.txt' mode='w'
encoding='cp1252'>
EXERCISE 23: OPENING AND

CLOSING A FILE
In t h i s ex er c i se, w e w i l l l ear n h ow t o c l ose an
op en f i l e. It i s v er y i mp or t ant t h at w e c l ose a
f i l e onc e w e op en i t . A l ot of sy st em-l ev el b u gs
c an oc c u r du e t o a dangl i ng f i l e h andl er . Onc e w e
c l ose a f i l e, no f u r t h er op er at i ons c an b e
p er f or med on t h at f i l e u si ng t h at sp ec i f i c f i l e
h andl er :
1 . Open a file in binar y m ode:
fd = open("Alice's
Adventures in
Wonderland, by Lewis
Carroll",
"rb")
2 . Close a file u sing close():
fd.close()
3 . Py th on also giv es u s a
closed flag w ith th e file
h andler . If w e pr int it befor e
closing, th en w e w ill see
False, w h er eas if w e pr int it
after closing, th en w e w ill see
True. If ou r logic ch ecks
w h eth er a file is pr oper ly
closed or not, th en th is is th e
flag w e w ant to u se.
THE WITH STATEMENT
In t h i s ex er c i se, w e w i l l l ear n ab ou t t h e with
st at ement i n Py t h on and h ow w e c an ef f ec t i v el y
u se i t i n t h e c ont ex t of op eni ng and c l osi ng f i l es.
Th e with c ommand i s a c omp ou nd st at ement i n

Py t h on. Li k e any c omp ou nd st at ement , with al so
af f ec t s t h e ex ec u t i on of t h e c ode enc l osed b y i t .
In t h e c ase of with, i t i s u sed t o w r ap a b l oc k of
c ode i n t h e sc op e of w h at w e c al l a Context
Manager i n Py t h on. A det ai l ed di sc u ssi on of t h e
c ont ex t manager i s ou t of t h e sc op e of t h i s
ex er c i se and t h i s t op i c i n gener al , b u t i t i s
su f f i c i ent t o say t h at t h ank s t o a c ont ex t
manager i mp l ement ed i nsi de t h e open c al l f or
op eni ng a f i l e i n Py t h on, i t i s gu ar ant eed t h at a
c l ose c al l w i l l au t omat i c al l y h ap p en i f w e w r ap
i t i nsi de a with st at ement .
Note
Th e re is a n e ntire PEP for w ith a t
h ttp s ://w w w .p y th on.org /d e v /p e p s /p e p -0343/.
We e nc ou ra g e y ou to look into it.
OPENING A FILE USING THE

WITH STATEMENT
Op en a f i l e u si ng t h e w i t h st at ement :
with open("Alice’s Adventures in

Wonderland, by Lewis Carroll")as fd:
print(fd.closed)
print(fd.closed)
False
True
If w e ex ec u t e t h e p r ec edi ng c ode, w e w i l l see

t h at t h e f i r st p r i nt w i l l end u p p r i nt i ng False,
w h er eas t h e sec ond one w i l l p r i nt True. Th i s
means t h at as soon as t h e c ont r ol goes ou t of t h e
with b l oc k , t h e f i l e desc r i p t or i s au t omat i c al l y
c l osed.
Note
Th is is by fa r th e c le a ne s t a nd m os t Py th onic w a y
to op e n a file a nd obta in a file d e s c rip tor for it. We
e nc ou ra g e y ou to u s e th is p a tte rn w h e ne v e r y ou
ne e d to op e n a file by y ou rs e lf.
EXERCISE 24: READING A FILE

LINE BY LINE
1 . Open a file and th en r ead th e
file line by line and pr int it
as w e r ead it:
with open("Alice’s
Adventures in
Carroll",
encoding="utf8") as
fd:
for line in fd:
print(line)
Figure 2.6: Screenshot from the Jupyter

notebook
2 . Looking at th e pr eceding
code, w e can r eally see w h y
it is im por tant. With th is
sm all snippet of code, y ou
can ev en open and r ead files
th at ar e m any GBs in size,
line by line, and w ith ou t
flooding or ov er r u nning th e
sy stem m em or y !
Th er e is anoth er explicit
m eth od in th e file descr iptor
object called readline,
w h ich r eads one line at a
tim e fr om a file.
3 . Du plicate th e sam e for loop,

ju st after th e fir st one:
with open("Alice’s
Adventures in
Carroll",
encoding="utf8") as
fd:
for line in fd:
print(line)
print("Ended first
loop")
for line in fd:
print(line)
Figure 2.7: Section of the open file
EXERCISE 25: WRITE TO A FILE

W e w i l l end t h i s t op i c on f i l e op er at i ons b y
sh ow i ng y ou h ow t o w r i t e t o a f i l e. W e w i l l
w r i t e a f ew l i nes t o a f i l e and r ead t h e f i l e:
1 . Use th e write fu nction fr om
th e file descr iptor object:
data_dict = {"India":
"Delhi", "France":
"Paris", "UK":
"London",
"USA": "Washington"}
with
open("data_temporary_f
iles.txt", "w") as fd:
for country, capital

in data_dict.items():
fd.write("The capital
of {} is {}\n".format(
country, capital))
2 . Read th e file u sing th e

with
iles.txt", "r") as fd:
for line in fd:
print(line)
The capital of India

is Delhi
The capital of France

is Paris
The capital of UK is
London
The capital of USA is

Washington
3 . Use th e pr int fu nction to

w r ite to a file u sing th e
data_dict_2 =
{"China": "Beijing",
"Japan": "Tokyo"}
with
iles.txt", "a") as fd:
for country, capital

in
data_dict_2.items():
print("The capital of
{} is {}".format(
country, capital),
file=fd)
4 . Read th e file u sing th e

with
iles.txt", "r") as fd:
for line in fd:
print(line)
The capital of India

is Delhi
The capital of France

is Paris
The capital of UK is
London
The capital of USA is

Washington
The capital of China

is Beijing
The capital of Japan

is Tokyo
Note:
I n the second case, w e did not

add an extra new line
character, \n, at the end of the
string to be w ritten. The print
function does that
automatically for us.
W i t h t h i s, w e w i l l end t h i s t op i c . Ju st l i k e t h e
p r ev i ou s t op i c s, w e h av e desi gned an ac t i v i t y f or
y ou t o p r ac t i c e y ou r new l y ac qu i r ed sk i l l s.
ACTIVITY 4: DESIGN YOUR

OWN CSV PARSER
A CSV f i l e i s somet h i ng y ou w i l l enc ou nt er a l ot
i n y ou r l i f e as a dat a p r ac t i t i oner . A CSV i s a
c omma-sep ar at ed f i l e w h er e dat a f r om a t ab u l ar
f or mat i s gener al l y st or ed and sep ar at ed u si ng
c ommas, al t h ou gh ot h er c h ar ac t er s c an al so b e
u sed.
In t h i s ac t i v i t y , w e w i l l b e t ask ed w i t h b u i l di ng
ou r ow n CSV r eader and p ar ser . A l t h ou gh i t i s a
b i g t ask i f w e t r y t o c ov er al l u se c ases and edge
c ases, al ong w i t h esc ap e c h ar ac t er s and al l , f or
t h e sak e of t h i s smal l ac t i v i t y , w e w i l l k eep ou r
r equ i r ement s smal l . W e w i l l assu me t h at t h er e i s
no esc ap e c h ar ac t er , meani ng t h at i f y ou u se a
c omma at any p l ac e i n y ou r r ow , i t means y ou ar e
st ar t i ng a new c ol u mn. W e w i l l al so assu me t h at
t h e onl y f u nc t i on w e ar e i nt er est ed i n i s t o b e
ab l e t o r ead a CSV f i l e l i ne b y l i ne w h er e eac h
r ead w i l l gener at e a new di c t w i t h t h e c ol u mn
names as k ey s and r ow names as v al u es.
H er e i s an ex amp l e:
Figure 2.8 Table with sample data
W e c an c onv er t t h e dat a i n t h e p r ec edi ng t ab l e

i nt o a Py t h on di c t i onar y , w h i c h w ou l d l ook as
f ol l ow s: {"Name": "Bob", "Age": "24",
"Location": "California"}:
1 . Im por t zip_longest fr om
itertools. Cr eate a
fu nction to zip header, line
and fillvalue=None.
2 . Open th e accom pany ing

sales_record.csv file fr om
th e GitHu b link by u sing r
m ode inside a w ith block and
fir st ch eck th at it is opened.
3 . Read th e fir st line and u se

str ing m eth ods to gener ate a
list of all th e colu m n nam es.
4 . Star t r eading th e file. Read it
line by line.
5. Read each line and pass th at

line to a fu nction, along w ith
th e list of th e h eader s. Th e
w or k of th e fu nction is to
constr u ct a dict ou t of th ese
tw o and fill u p th e
key:values. Keep in m ind
th at a m issing v alu e sh ou ld
r esu lt in None.
Note

Summary
In t h i s c h ap t er , w e l ear ned ab ou t t h e w or k i ngs
of adv anc ed dat a st r u c t u r es su c h as st ac k s and
qu eu es. W e i mp l ement ed and mani p u l at ed b ot h
st ac k s and qu eu es. W e t h en f oc u sed on di f f er ent
met h ods of f u nc t i onal p r ogr ammi ng, i nc l u di ng
i t er at or s, and c omb i ned l i st s and f u nc t i ons
t oget h er . A f t er t h i s, w e l ook ed at t h e OS-l ev el
f u nc t i ons and t h e management of env i r onment
v ar i ab l es and f i l es. W e al so ex ami ned a c l ean
w ay t o deal w i t h f i l es, and w e c r eat ed ou r ow n
CSV p ar ser i n t h e l ast ac t i v i t y .
In t h e nex t c h ap t er , w e w i l l b e deal i ng w i t h t h e
t h r ee most i mp or t ant l i b r ar i es, namel y N u mPy ,
p andas, and mat p l ot l i b .
Chapter 3
Introduction to NumPy,
Pandas,and Matplotlib
Learning Objectives
By t h e end of t h e c h ap t er , y ou w i l l b e ab l e t o:
Cr eate and m anipu late one-

dim ensional and m u lti-
dim ensional ar r ay s
Cr eate and m anipu late

pandas DataFr am es and
ser ies objects
Plot and v isu alize nu m er ical

data u sing th e Matplotlib
libr ar y
A pply m atplotlib, Nu m Py ,
and pandas to calcu late
descr iptiv e statistics fr om a
DataFr am e/m atr ix
In t h i s c h ap t er , y ou w i l l l ear n ab ou t t h e
f u ndament al s of t h e N u mPy , p andas, and
mat p l ot l i b l i b r ar i es.
Introduction
In t h e p r ec edi ng c h ap t er s, w e h av e c ov er ed some
adv anc ed dat a st r u c t u r es, su c h as st ac k , qu eu e,
i t er at or , and f i l e op er at i ons i n Py t h on. In t h i s
sec t i on, w e w i l l c ov er t h r ee essent i al l i b r ar i es,
namel y N u mPy , p andas, and mat p l ot l i b .
NumPy Arrays
In t h e l i f e of a dat a sc i ent i st , r eadi ng and
mani p u l at i ng ar r ay s i s of p r i me i mp or t anc e, and
i t i s al so t h e most f r equ ent l y enc ou nt er ed t ask .
Th ese ar r ay s c ou l d b e a one-di mensi onal l i st or a
mu l t i -di mensi onal t ab l e or a mat r i x f u l l of
nu mb er s.
Th e ar r ay c ou l d b e f i l l ed w i t h i nt eger s, f l oat i ng-

p oi nt nu mb er s, Bool eans, st r i ngs, or ev en mi x ed
t y p es. H ow ev er , i n t h e major i t y of c ases,
nu mer i c dat a t y p es ar e p r edomi nant .
Some ex amp l e sc enar i os w h er e y ou w i l l need t o

h andl e nu mer i c ar r ay s ar e as f ol l ow s:
To r ead a list of ph one
nu m ber s and postal codes
and extr act a cer tain patter n
To cr eate a m atr ix w ith

r andom nu m ber s to r u n a
Monte Car lo sim u lation on
som e statistical pr ocess
To scale and nor m alize a

sales figu r e table, w ith lots of
financial and tr ansactional
data
To cr eate a sm aller table of

key descr iptiv e statistics (for
exam ple, m ean, m edian,
m in/m ax r ange, v ar iance,
inter -qu ar tile r anges) fr om a
lar ge r aw data table
To r ead in and analy ze tim e

ser ies data in a one-
dim ensional ar r ay daily ,
su ch as th e stock pr ice of an
or ganization ov er a y ear or
daily tem per atu r e data fr om
a w eath er station
In sh or t , ar r ay s and nu mer i c dat a t ab l es ar e

ev er y w h er e. A s a dat a w r angl i ng p r of essi onal ,
t h e i mp or t anc e of t h e ab i l i t y t o r ead and p r oc ess
nu mer i c ar r ay s c annot b e ov er st at ed. In t h i s
r egar d, N u mPy ar r ay s w i l l b e t h e most
i mp or t ant ob jec t i n Py t h on t h at y ou need t o
k now ab ou t .
NUMPY ARRAY AND

FEATURES
NumPy and SciPy ar e op en sou r c e add-on
modu l es f or Py t h on t h at p r ov i de c ommon
mat h emat i c al and nu mer i c al r ou t i nes i n p r e-
c omp i l ed, f ast f u nc t i ons. Th ese h av e gr ow n i nt o
h i gh l y mat u r e l i b r ar i es t h at p r ov i de
f u nc t i onal i t y t h at meet s, or p er h ap s ex c eeds,
w h at i s assoc i at ed w i t h c ommon c ommer c i al
sof t w ar e su c h as MAT LAB or Mathe matica.
One of t h e mai n adv ant ages of t h e N u mPy modu l e

i s t o h andl e or c r eat e one-di mensi onal or mu l t i -
di mensi onal ar r ay s. Th i s adv anc ed dat a
st r u c t u r e/c l ass i s at t h e h ear t of t h e N u mPy
p ac k age and i t ser v es as t h e f u ndament al
b u i l di ng b l oc k of mor e adv anc ed c l asses su c h as
pandas and DataF rame , w h i c h w e w i l l c ov er
sh or t l y i n t h i s c h ap t er .
N u mPy ar r ay s ar e di f f er ent t h an c ommon
Py t h on l i st s, si nc e Py t h on l i st s c an b e t h ou gh t as
si mp l e ar r ay . N u mPy ar r ay s ar e b u i l t f or
v e ctorize d op er at i ons t h at p r oc ess a l ot of
nu mer i c al dat a w i t h ju st a si ngl e l i ne of c ode.
Many b u i l t -i n mat h emat i c al f u nc t i ons i n
N u mPy ar r ay s ar e w r i t t en i n l ow -l ev el
l angu ages su c h as C or For t r an and p r e-c omp i l ed
f or r eal , f ast ex ec u t i on.
Note
Nu m Py a rra y s a re op tim iz e d d a ta s tru c tu re s for
nu m e ric a l a na ly s is , a nd th a t's w h y th e y a re s o
im p orta nt to d a ta s c ie ntis ts .

NUMPY ARRAY (FROM A LIST)
In t h i s ex er c i se, w e w i l l c r eat e a N u mPy ar r ay
f r om a l i st :
1 . To w or k w ith Nu m Py , w e
m u st im por t it. By
conv ention, w e giv e it a
sh or t nam e, np, w h ile
im por ting:
import numpy as np
2 . Cr eate a list w ith th r ee

elem ents, 1 , 2 , and 3 :
list_1 = [1,2,3]
3 . Use th e array fu nction to

conv er t it into an ar r ay :
array_1 =
np.array(list_1)
We ju st cr eated a Nu m Py
ar r ay object called array_1
fr om th e r egu lar Py th on list
object, list_1.
4 . Cr eate an ar r ay of floating
ty pe elem ents 1 .2 , 3 .4 , and
5.6 :
import array as arr
a = arr.array('d',
[1.2, 3.4, 5.6])
print(a)
array('d', [1.2, 3.4,
5.6])
5. Let's ch eck th e ty pe of th e
new ly cr eated object by
u sing th e type fu nction:
type(array_1)
numpy.ndarray
6 . Use type on list_1:
type (list_1)
list
So, t h i s i s i ndeed di f f er ent f r om t h e r egu l ar list

ob jec t .
EXERCISE 27: ADDING TWO

NUMPY ARRAYS
Th i s si mp l e ex er c i se w i l l demonst r at e t h e
addi t i on of t w o N u mPy ar r ay s, and t h er eb y sh ow
t h e k ey di f f er enc e b et w een a r egu l ar Py t h on
l i st /ar r ay and a N u mPy ar r ay :
1 . Consider list_1 and

array_1 fr om th e pr eceding
exer cise. If y ou h av e
ch anged th e Ju py ter
notebook, y ou w ill h av e to
declar e th em again.
2 . Use th e + notation to add tw o

list_1 object and sav e th e
r esu lts in list_2:
list_2 = list_1 +
list_1
print(list_2)
[1, 2, 3, 1, 2, 3]
3 . Use th e sam e + notation to

add tw o array_1 objects and
sav e th e r esu lt in array_2:
array_2 = array_1 +
array_1
print(array_2)
[2, ,4, 6]
Di d y ou not i c e t h e di f f er enc e? Th e f i r st p r i nt
sh ow s a l i st w i t h 6 el ement s [1 , 2, 3, 1 , 2, 3]. Bu t t h e
sec ond p r i nt sh ow s anot h er N u mPy ar r ay (or
v ec t or ) w i t h t h e el ement s [2, 4, 6], w h i c h ar e ju st
t h e su m of t h e i ndi v i du al el ement s of array_1.
N u mPy ar r ay s ar e l i k e mat h emat i c al ob jec t s –

v e ctors. Th ey ar e b u i l t f or el ement -w i se
op er at i ons, t h at i s, w h en w e add t w o N u mPy
ar r ay s, w e add t h e f i r st el ement of t h e f i r st ar r ay
t o t h e f i r st el ement of t h e sec ond ar r ay – t h er e i s
an el ement -t o-el ement c or r esp ondenc e i n t h i s
op er at i on. Th i s i s i n c ont r ast t o Py t h on l i st s,
w h er e t h e el ement s ar e si mp l y ap p ended and
t h er e i s no el ement -t o-el ement r el at i on. Th i s i s
t h e r eal p ow er of a N u mPy ar r ay : t h ey c an b e
t r eat ed ju st l i k e mat h emat i c al v ec t or s.
A v ec t or i s a c ol l ec t i on of nu mb er s t h at c an
r ep r esent , f or ex amp l e, t h e c oor di nat es of p oi nt s
i n a t h r ee-di mensi onal sp ac e or t h e c ol or of
nu mb er s (RGB) i n a p i c t u r e. N at u r al l y , r el at i v e
or der i s i mp or t ant f or su c h a c ol l ec t i on and as
w e di sc u ssed p r ev i ou sl y , a N u mPy ar r ay c an
mai nt ai n su c h or der r el at i onsh i p s. Th at 's w h y
t h ey ar e p er f ec t l y su i t ab l e t o u se i n nu mer i c al
c omp u t at i ons.
EXERCISE 28: MATHEMATICAL

OPERATIONS ON NUMPY
ARRAYS
N ow t h at y ou k now t h at t h ese ar r ay s ar e l i k e
v ec t or s, w e w i l l t r y some mat h emat i c al
op er at i ons on ar r ay s.
N u mPy ar r ay s ev en su p p or t el ement -w i se
ex p onent i at i on. For ex amp l e, su p p ose t h er e ar e
t w o ar r ay s – t h e el ement s of t h e f i r st ar r ay w i l l
b e r ai sed t o t h e p ow er of t h e el ement s i n t h e
sec ond ar r ay :
1 . Mu ltiply tw o ar r ay s u sing
th e follow ing com m and:
print("array_1
multiplied by array_1:
",array_1*array_1)
array_1 multiplied by
array_1: [1 4 9]
2 . Div ide tw o ar r ay s u sing th e
print("array_1 divided
by array_1:
",array_1/array_1)
array_1 divided by
array_1: [1. 1. 1.]
3 . Raise one ar r ay to th e second

ar r ay s pow er u sing th e
print("array_1 raised
to the power of
array_1:
",array_1**array_1)
array_1 raised to the

power of array_1: [ 1
4 27]
EXERCISE 29: ADVANCED

MATHEMATICAL OPERATIONS
ON NUMPY ARRAYS
N u mPy h as al l t h e b u i l t -i n mat h emat i c al
f u nc t i ons t h at y ou c an t h i nk of . H er e, w e ar e
goi ng t o b e c r eat i ng a l i st and c onv er t i ng i t i nt o
a N u mPy ar r ay . Th en, w e w i l l p er f or m some
adv anc ed mat h emat i c al op er at i ons on t h at ar r ay .
H er e, w e ar e c r eat i ng a l i st and t h en c onv er t i ng

t h at i nt o a N u mPy ar r ay . W e w i l l t h en sh ow y ou
h ow t o p er f or m some adv anc ed mat h emat i c al
op er at i ons on t h at ar r ay :
1 . Cr eate a list w ith fiv e

elem ents:
list_5=[i for i in
range(1,6)]
print(list_5)
[1, 2, 3, 4, 5]
2 . Conv er t th e list into a

Nu m Py ar r ay by u sing th e
array_5=np.array(list_
5)
array_5
array([1, 2, 3, 4, 5])
3 . Find th e sine v alu e of th e

ar r ay by u sing th e follow ing
com m and:
# sine function
print("Sine:
",np.sin(array_5))
Sine: [ 0.84147098
0.90929743 0.14112001
-0.7568025
-0.95892427]
4 . Find th e logar ith m ic v alu e of

th e ar r ay by u sing th e
# logarithm
print("Natural
logarithm:
",np.log(array_5))
print("Base-10
logarithm:
",np.log10(array_5))
print("Base-2
logarithm:
",np.log2(array_5))
Natural logarithm: [0.

0.69314718 1.09861229
1.38629436 1.60943791]
Base-10 logarithm: [0.

0.30103 0.47712125
0.60205999 0.69897 ]
Base-2 logarithm: [0.

1. 1.5849625 2.
2.32192809]
5. Find th e exponential v alu e of

th e ar r ay by u sing th e
# Exponential
print("Exponential:
",np.exp(array_5))
Exponential: [
2.71828183 7.3890561
20.08553692
54.59815003
148.4131591 ]

ARRAYS USING ARANGE AND
LINSPACE
Gener at i on of nu mer i c al ar r ay s i s a f ai r l y
c ommon t ask . So f ar , w e h av e b een doi ng t h i s b y
c r eat i ng a Py t h on l i st ob jec t and t h en
c onv er t i ng t h at i nt o a N u mPy ar r ay . H ow ev er ,
w e c an b y p ass t h at and w or k di r ec t l y w i t h
nat i v e N u mPy met h ods.
Th e arange f u nc t i on c r eat es a ser i es of nu mb er s

b ased on t h e mi ni mu m and max i mu m b ou nds y ou
gi v e and t h e st ep si ze y ou sp ec i f y . A not h er
f u nc t i on, linspace, c r eat es a ser i es of t h e f i x ed
nu mb er s of i nt er medi at e p oi nt s b et w een t w o
ex t r emes:
1 . Cr eate a ser ies of nu m ber s

u sing th e arange m eth od, by
com m and:
print("A series of
numbers:",np.arange(5,
16))
A series of numbers: [
5 6 7 8 9 10 11 12 13
14 15]
2 . Pr int nu m ber s u sing th e

arange fu nction by u sing th e
print("Numbers spaced
apart by 2:
",np.arange(0,11,2))
print("Numbers spaced
apart by a floating
point number:
",np.arange(0,11,2.5))
print("Every 5th
number from 30 in
reverse
order\n",np.arange(30,
-1,-5))
Numbers spaced apart

by 2: [ 0 2 4 6 8 10]
Numbers spaced apart

by a floating point
number: [ 0. 2.5 5.
7.5 10. ]
Every 5th number from

30 in reverse order
[30 25 20 15 10 5 0]
3 . For linear ly spaced nu m ber s,

w e can u se th e linspace
m eth od, as follow s:
print("11 linearly
spaced numbers between
1 and 5:
",np.linspace(1,5,11))
11 linearly spaced
numbers between 1 and
5: [1. 1.4 1.8 2.2 2.6
3. 3.4 3.8 4.2 4.6 5.
]

MULTI-DIMENSIONAL ARRAYS
So f ar , w e h av e c r eat ed onl y one-di mensi onal
ar r ay s. N ow , l et 's c r eat e some mu l t i -di mensi onal
ar r ay s (su c h as a mat r i x i n l i near al geb r a). Ju st
l i k e w e c r eat ed t h e one-di mensi onal ar r ay f r om
a si mp l e f l at l i st , w e c an c r eat e a t w o-
di mensi onal ar r ay f r om a l i st of l i st s:
1 . Cr eate a list of lists and

conv er t it into a tw o-
dim ensional Nu m Py ar r ay
by u sing th e follow ing
com m and:
list_2D = [[1,2,3],
[4,5,6],[7,8,9]]
mat1 =
np.array(list_2D)
print("Type/Class of
this
object:",type(mat1))
print("Here is the
matrix\n----------
\n",mat1,"\n----------
")
Type/Class of this
object: <class
'numpy.ndarray'>
Here is the matrix
----------
[[1 2 3]
[4 5 6]
[7 8 9]]
----------
2 . Tu ples can be conv er ted into

m u lti-dim ensional ar r ay s by
u sing th e follow ing code:
tuple_2D =
np.array([(1.5,2,3),
(4,5,6)])
mat_tuple =
np.array(tuple_2D)
print (mat_tuple)
[[1.5 2. 3. ]
[4. 5. 6. ]]
Th u s, w e h av e c r eat ed mu l t i -di mensi onal ar r ay s

u si ng Py t h on l i st s and t u p l es.
EXERCISE 32: THE

DIMENSION, SHAPE, SIZE,
AND DATA TYPE OF THE TWO-
DIMENSIONAL ARRAY
Th e f ol l ow i ng met h ods l et y ou c h ec k t h e
di mensi on, sh ap e, and si ze of t h e ar r ay . N ot e t h at
i f i t 's a 3x 2 mat r i x , t h at i s, i t h as 3 r ow s and 2
c ol u mns, t h en t h e sh ap e w i l l b e (3,2), b u t t h e si ze
w i l l b e 6, as 6 = 3x 2:
1 . Pr int th e dim ension of th e

m atr ix u sing ndim by u sing
print("Dimension of
this matrix:
",mat1.ndim,sep='')
Dimension of this
matrix: 2
2 . Pr int th e size u sing size:
print("Size of this
matrix: ",
mat1.size,sep='')
Size of this matrix: 9
3 . Pr int th e sh ape of th e m atr ix

u sing shape:
print("Shape of this
matrix: ",
mat1.shape,sep='')
Shape of this matrix:

(3, 3)
4 . Pr int th e dim ension ty pe

u sing dtype:
print("Data type of
this matrix: ",
mat1.dtype,sep='')
Data type of this

matrix: int32
EXERCISE 33: ZEROS, ONES,
RANDOM, IDENTITY
MATRICES, AND VECTORS
N ow t h at w e ar e f ami l i ar w i t h b asi c v ec t or (one-
di mensi onal ) and mat r i x dat a st r u c t u r es i n
N u mPy , w e w i l l t ak e a l ook h ow t o c r eat e sp ec i al
mat r i c es easi l y . Of t en, y ou may h av e t o c r eat e
mat r i c es f i l l ed w i t h zer os, ones, r andom
nu mb er s, or ones i n t h e di agonal :
1 . Pr int th e v ector of zer os by

com m and:
print("Vector of
zeros: ",np.zeros(5))
Vector of zeros: [0.

0. 0. 0. 0.]
2 . Pr int th e m atr ix of zer os by

com m and:
print("Matrix of
zeros:
",np.zeros((3,4)))
Matrix of zeros: [[0.
0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]]
3 . Pr int th e m atr ix of fiv es by

com m and:
print("Matrix of 5's:
",5*np.ones((3,3)))
Matrix of 5's: [[5. 5.

5.]
[5. 5. 5.]
[5. 5. 5.]]
4 . Pr int an identity m atr ix by

com m and:
print("Identity matrix
of dimension
2:",np.eye(2))
Identity matrix of
dimension 2: [[1. 0.]
[0. 1.]]
5. Pr int an identity m atr ix

w ith a dim ension of 4 x4 by
com m and:
print("Identity matrix
of dimension
4:",np.eye(4))
Identity matrix of
dimension 4: [[1. 0.
0. 0.]
[0. 1. 0. 0.]
[0. 0. 1. 0.]
[0. 0. 0. 1.]]
6 . Pr int a m atr ix of r andom

sh ape u sing th e randint
fu nction:
print("Random matrix
of shape
(4,3):\n",np.random.ra
ndint(low=1,high=10,si
ze=(4,3)))

follow s:
Random matrix of shape

(4,3):
[[6 7 6]
[5 6 7]
[5 3 6]
[2 9 4]]
Note
When creating matrices, you

need to pass on tuples of
integers as arguments.
Random nu mb er gener at i on i s a v er y u sef u l

u t i l i t y and needs t o b e mast er ed f or dat a
sc i enc e/dat a w r angl i ng t ask s. W e w i l l l ook at
t h e t op i c of r andom v ar i ab l es and di st r i b u t i ons
agai n i n t h e sec t i on on st at i st i c s and see h ow
N u mPy and p andas h av e b u i l t -i n r andom nu mb er
and ser i es gener at i on, as w el l as mani p u l at i on
f u nc t i ons.
EXERCISE 34: RESHAPING,

RAVEL, MIN, MAX, AND
SORTING
Re shaping an ar r ay i s a v er y u sef u l op er at i on
f or v ec t or s as mac h i ne l ear ni ng al gor i t h ms may
demand i np u t v ec t or s i n v ar i ou s f or mat s f or
mat h emat i c al mani p u l at i on. In t h i s sec t i on, w e
w i l l b e l ook i ng at h ow r esh ap i ng c an t ak e b e
done on an ar r ay . Th e op p osi t e of reshape i s t h e
ravel f u nc t i on, w h i c h f l at t ens any gi v en ar r ay
i nt o a one-di mensi onal ar r ay . It i s a v er y u sef u l
ac t i on i n many mac h i ne l ear ni ng and dat a
anal y t i c s t ask s.
Th e f ol l ow i ng f u nc t i ons r esh ap e t h e f u nc t i on.

W e w i l l f i r st gener at e a r andom one-di mensi onal
v ec t or of 2-di gi t nu mb er s and t h en r esh ap e t h e
v ec t or i nt o mu l t i -di mensi onal v ec t or s:
1 . Cr eate an ar r ay of 3 0
r andom integer s (sam pled
fr om 1 to 9 9 ) and r esh ape it
into tw o differ ent for m s
a =
np.random.randint(1,10
0,30)
b = a.reshape(2,3,5)
c = a.reshape(6,5)
2 . Pr int th e sh ape u sing th e

shape fu nction by u sing th e
follow ing code:
print ("Shape of a:",

a.shape)
print ("Shape of b:",

b.shape)
print ("Shape of c:",

c.shape)
Shape of a: (30,)
Shape of b: (2, 3, 5)
Shape of c: (6, 5)
3 . Pr int th e ar r ay s a, b, and c
print("\na looks
like\n",a)
print("\nb looks
like\n",b)
print("\nc looks
like\n",c)

follow s:
a looks like
[ 7 82 9 29 50 50 71
65 33 84 55 78 40 68
50 15 65 55 98 38 23
75 50 57
32 69 34 59 98 48]
b looks like
[[[ 7 82 9 29 50]
[50 71 65 33 84]
[55 78 40 68 50]]
[[15 65 55 98 38]
[23 75 50 57 32]
[69 34 59 98 48]]]
c looks like
[[ 7 82 9 29 50]
[50 71 65 33 84]
[55 78 40 68 50]
[15 65 55 98 38]
[23 75 50 57 32]
[69 34 59 98 48]]
Note
"b" is a three-dimensional
array – a kind of list of a list of
a list.
4 . Rav el file b u sing th e
follow ing code:
b_flat = b.ravel()
print(b_flat)

follow s:
[ 7 82 9 29 50 50 71
65 33 84 55 78 40 68
50 15 65 55 98 38 23
75 50 57
32 69 34 59 98 48]
EXERCISE 35: INDEXING AND

SLICING
I nde xing and slicing of N u mPy ar r ay s i s v er y
si mi l ar t o r egu l ar l i st i ndex i ng. W e c an ev en
st ep t h r ou gh a v ec t or of el ement s w i t h a def i ni t e
st ep si ze b y p r ov i di ng i t as an addi t i onal
ar gu ment i n t h e f or mat (st ar t , st ep , end).
Fu r t h er mor e, w e c an p ass a l i st as t h e ar gu ment
t o sel ec t sp ec i f i c el ement s.
In t h i s ex er c i se, w e w i l l l ear n ab ou t i ndex i ng

and sl i c i ng on one-di mensi onal and mu l t i -
di mensi onal ar r ay s:
Note
I n m u lti-d im e ns iona l a rra y s , y ou c a n u s e tw o
nu m be rs to d e note th e p os ition of a n e le m e nt. For
e xa m p le , if th e e le m e nt is in th e th ird row a nd
s e c ond c olu m n, its ind ic e s a re 2 a nd 1 (be c a u s e of
Py th on's z e ro-ba s e d ind e xing ).
1 . Cr eate an ar r ay of 1 0
elem ents and exam ine its
v ar iou s elem ents by slicing
and indexing th e ar r ay w ith
sligh tly differ ent sy ntaxes.
Do th is by u sing th e
array_1 =
np.arange(0,11)
print("Array:",array_1
)
Array: [ 0 1 2 3 4 5 6
7 8 9 10]
2 . Pr int th e elem ent in th e
sev enth position by u sing th e
print("Element at 7th
index is:",
array_1[7])
Element at 7th index

is: 7
3 . Pr int th e elem ents betw een

th e th ir d and sixth positions
com m and:
print("Elements from
3rd to 5th index
are:", array_1[3:6])
Elements from 3rd to

5th index are: [3 4 5]
4 . Pr int th e elem ents u ntil th e

fou r th position by u sing th e
print("Elements up to
4th index are:",
array_1[:4])
Elements up to 4th
index are: [0 1 2 3]
5. Pr int th e elem ents

backw ar ds by u sing th e
print("Elements from
last backwards are:",
array_1[-1::-1])
Elements from last

backwards are: [10 9 8
7 6 5 4 3 2 1 0]
6 . Pr int th e elem ents u sing

th eir backw ar d index,
skipping th r ee v alu es, by
com m and:
print("3 Elements from

last backwards are:",
array_1[-1:-6:-2])
3 Elements from last

backwards are: [10 8
6]
7 . Cr eate a new ar r ay called

array_2 by u sing th e
array_2 =
np.arange(0,21,2)
print("New
array:",array_2)
New array: [ 0 2 4 6 8
10 12 14 16 18 20]
8. Pr int th e second, fou r th , and

ninth elem ents of th e ar r ay :
print("Elements at
2nd, 4th, and 9th
index are:",
array_2[[2,4,9]])
Elements at 2nd, 4th,

and 9th index are: [ 4
8 18]
9 . Cr eate a m u lti-dim ensional

ar r ay by u sing th e follow ing
com m and:
matrix_1 =
00,15).reshape(3,5)
print("Matrix of
random 2-digit
numbers\n ",matrix_1)

follow s:
Matrix of random 2-
digit numbers
[[21 57 60 24 15]
[53 20 44 72 68]
[39 12 99 99 33]]
1 0. A ccess th e v alu es u sing

dou ble br acket indexing by
com m and:
print("\nDouble
bracket indexing\n")
print("Element in row
index 1 and column
index 2:", matrix_1[1]
[2])

follow s:
Double bracket
indexing
Element in row index 1

and column index 2: 44
1 1 . A ccess th e v alu es u sing

single br acket indexing by
com m and:
print("\nSingle
bracket with comma
indexing\n")
print("Element in row
index 1 and column
index 2:",
matrix_1[1,2])

follow s:
Single bracket with

comma indexing
Element in row index 1

and column index 2: 44
1 2 . A ccess th e v alu es in a m u lti-

dim ensional ar r ay u sing a
r ow or colu m n by u sing th e
print("\nRow or column
extract\n")
print("Entire row at
index 2:",
matrix_1[2])
print("Entire column
at index 3:",
matrix_1[:,3])

follow s:
Row or column extract
Entire row at index 2:

[39 12 99 99 33]
Entire column at index

3: [24 72 99]
1 3 . Pr int th e m atr ix w ith th e

specified r ow and colu m n
indices by u sing th e
print("\nSubsetting
sub-matrices\n")
print("Matrix with row

indices 1 and 2 and
column indices 3 and
4\n",
matrix_1[1:3,3:5])

follow s:
Subsetting sub-
matrices
Matrix with row

indices 1 and 2 and
column indices 3 and 4
[[72 68]
[99 33]]
1 4 . Pr int th e m atr ix w ith th e

specified r ow and colu m n
indices by u sing th e
print("Matrix with row

indices 0 and 1 and
column indices 1 and
3\n", matrix_1[0:2,
[1,3]])

follow s:
Matrix with row

indices 0 and 1 and
column indices 1 and 3
[[57 24]
[20 72]]
CONDITIONAL SUBSETTING
Conditional subse tting i s a w ay t o sel ec t
sp ec i f i c el ement s b ased on some nu mer i c
c ondi t i on. It i s al most l i k e a sh or t ened v er si on of
a SQL qu er y t o su b set el ement s. See t h e f ol l ow i ng
ex amp l e:
matrix_1 =
np.array(np.random.randint(10,100,15)).res
hape(3,5)
print("Matrix of random 2-digit

numbers\n",matrix_1)
print ("\nElements greater than 50\n",

matrix_1[matrix_1>50])
Th e samp l e ou t p u t i s as f ol l ow s (not e t h at t h e
ex ac t ou t p u t w i l l b e di f f er ent f or y ou as i t i s
r andom):
Matrix of random 2-digit numbers
[[71 89 66 99 54]
[28 17 66 35 85]
[82 35 38 15 47]]
Elements greater than 50
[71 89 66 99 54 66 85 82]
EXERCISE 36: ARRAY

OPERATIONS (ARRAY-ARRAY,
ARRAY-SCALAR, AND
UNIVERSAL FUNCTIONS)
N u mPy ar r ay s op er at e ju st l i k e mat h emat i c al
mat r i c es, and t h e op er at i ons ar e p er f or med
el ement -w i se.
Cr eat e t w o mat r i c es (mu l t i -di mensi onal ar r ay s)

w i t h r andom i nt eger s and demonst r at e el ement -
w i se mat h emat i c al op er at i ons su c h as addi t i on,
su b t r ac t i on, mu l t i p l i c at i on, and di v i si on. Sh ow
t h e ex p onent i at i on (r ai si ng a nu mb er t o a
c er t ai n p ow er ) op er at i on, as f ol l ow s:
Note
Du e to ra nd om nu m be r g e ne ra tion, y ou r s p e c ific
ou tp u t c ou ld be d iffe re nt to w h a t is s h ow n h e re .
1 . Cr eate tw o m atr ices:
matrix_1 =
,9).reshape(3,3)
matrix_2 =
,9).reshape(3,3)
print("\n1st Matrix of
random single-digit
print("\n2nd Matrix of
random single-digit

follow s (note th at th e exact
ou tpu t w ill be differ ent for
y ou as it is r andom ):
1st Matrix of random

single-digit numbers
[[6 5 9]
[4 7 1]
[3 2 7]]
2nd Matrix of random

single-digit numbers
[[2 3 1]
[9 9 9]
[9 9 6]]
2 . Per for m addition,

su btr action, div ision, and
linear com bination on th e
m atr ices:
print("\nAddition\n",
matrix_1+matrix_2)
print("\nMultiplicatio
n\n",
matrix_1*matrix_2)
print("\nDivision\n",
matrix_1/matrix_2)
print("\nLinear
combination: 3*A -
2*B\n", 3*matrix_1-
2*matrix_2)

Addition
[[ 8 8 10]
[13 16 10]
[12 11 13]] ^
Multiplication
[[12 15 9]
[36 63 9]
[27 18 42]]
Division
[[3. 1.66666667 9. ]
[0.44444444 0.77777778
0.11111111]
[0.33333333 0.22222222
1.16666667]]
Linear combination:
3*A - 2*B
[[ 14 9 25]
[ -6 3 -15]
[ -9 -12 9]]
3 . Per for m th e addition of a

scalar , exponential m atr ix
cu be, and exponential squ ar e
r oot:
print("\nAddition of a
scalar (100)\n",
100+matrix_1)
print("\nExponentiatio
n, matrix cubed
here\n", matrix_1**3)
print("\nExponentiatio
n, square root using
'pow'
function\n",pow(matrix
_1,0.5))

Addition of a scalar
(100)
[[106 105 109]
[104 107 101]
[103 102 107]]
Exponentiation, matrix
cubed here
[[216 125 729]
[ 64 343 1]
[ 27 8 343]]
Exponentiation, square
root using 'pow'
function
[[2.44948974
2.23606798 3. ]
[2. 2.64575131 1. ]
[1.73205081 1.41421356
2.64575131]]
STACKING ARRAYS
Stacking array s on t op of eac h ot h er (or si de b y
si de) i s a u sef u l op er at i on f or dat a w r angl i ng.
H er e i s t h e c ode:
a = np.array([[1,2],[3,4]])
b = np.array([[5,6],[7,8]])
print("Matrix a\n",a)
print("Matrix b\n",b)
print("Vertical
stacking\n",np.vstack((a,b)))
print("Horizontal
stacking\n",np.hstack((a,b)))
Matrix a
[[1 2]
[3 4]]
Matrix b
[[5 6]
[7 8]]
Vertical stacking
[[1 2]
[3 4]
[5 6]
[7 8]]
Horizontal stacking
[[1 2 5 6]
[3 4 7 8]]
N u mPy h as many ot h er adv anc ed f eat u r es,

mai nl y r el at ed t o st at i st i c s and l i near al geb r a
f u nc t i ons, w h i c h ar e u sed ex t ensi v el y i n
mac h i ne l ear ni ng and dat a sc i enc e t ask s.
H ow ev er , not al l of t h at i s di r ec t l y u sef u l f or
b egi nner l ev el dat a w r angl i ng, so w e w on't c ov er
i t h er e.
Pandas DataFrames
Th e p andas l i b r ar y i s a Py t h on p ac k age t h at
p r ov i des f ast , f l ex i b l e, and ex p r essi v e dat a
st r u c t u r es t h at ar e desi gned t o mak e w or k i ng
w i t h r el at i onal or l ab el ed dat a b ot h easy and
i nt u i t i v e. It ai ms t o b e t h e f u ndament al h i gh -
l ev el b u i l di ng b l oc k f or doi ng p r ac t i c al , r eal -
w or l d dat a anal y si s i n Py t h on. A ddi t i onal l y , i t
h as t h e b r oader goal of b ec omi ng t h e most
p ow er f u l and f l ex i b l e op en sou r c e dat a
anal y si s/mani p u l at i on t ool t h at 's av ai l ab l e i n
any l angu age.
Th e t w o p r i mar y dat a st r u c t u r es of p andas,

Series (one-di mensi onal ) and DataFrame (t w o-
di mensi onal ), h andl e t h e v ast major i t y of t y p i c al
u se c ases. Pandas i s b u i l t on t op of N u mPy and i s
i nt ended t o i nt egr at e w el l w i t h i n a sc i ent i f i c
c omp u t i ng env i r onment w i t h many ot h er t h i r d-
p ar t y l i b r ar i es.

PANDAS SERIES
In t h i s ex er c i se, w e w i l l l ear n ab ou t h ow t o
c r eat e a p andas ser i es ob jec t f r om t h e dat a
st r u c t u r es t h at w e c r eat ed p r ev i ou sl y . If y ou
h av e i mp or t ed p andas as pd, t h en t h e f u nc t i on t o
c r eat e a ser i es i s si mp l y pd.Series:
1 . Initialize labels, lists, and a

dictionar y :
labels = ['a','b','c']
my_data = [10,20,30]
array_1 =
np.array(my_data)
d =
{'a':10,'b':20,'c':30}
print ("Labels:",
labels)
print("My data:",
my_data)
print("Dictionary:",
d)
Labels: ['a', 'b',

'c']
My data: [10, 20, 30]
Dictionary: {'a': 10,

'b': 20, 'c': 30}
2 . Im por t pandas as pd by u sing

import pandas as pd
3 . Cr eate a ser ies fr om th e

my_data list by u sing th e
series_1=pd.Series(dat
a=my_data)
print(series_1)
0 10
1 20
2 30
dtype: int64

my_data list along w ith th e
labels as follow s:
series_2=pd.Series(dat
a=my_data, index =
labels)
print(series_2)
a 10
b 20
c 30
dtype: int64
5. Th en, cr eate a ser ies fr om

th e Nu m Py ar r ay , as follow s:
series_3=pd.Series(arr
ay_1,labels)
print(series_3)
a 10
b 20
c 30
dtype: int32

dictionar y , as follow s:
series_4=pd.Series(d)
print(series_4)
a 10
b 20
c 30
dtype: int64
EXERCISE 38: PANDAS SERIES

AND DATA HANDLING
Th e p andas ser i es ob jec t c an h ol d many t y p es of
dat a. Th i s i s t h e k ey t o c onst r u c t i ng a b i gger
t ab l e w h er e mu l t i p l e ser i es ob jec t s ar e st ac k ed
t oget h er t o c r eat e a dat ab ase-l i k e ent i t y :
1 . Cr eate a pandas ser ies w ith

nu m er ical data by u sing th e
print ("\nHolding
numerical data\n",'-
'*25, sep='')
print(pd.Series(array_
1))
Holding numerical data
----------------------
---
0 10
1 20
2 30
dtype: int32

labels by u sing th e follow ing
com m and:
print ("\nHolding text

labels\n",'-'*20,
sep='')
print(pd.Series(labels
))
Holding text labels
--------------------
0 a
1 b
2 c
dtype: object

fu nctions by u sing th e
print ("\nHolding
functions\n",'-'*20,
sep='')
print(pd.Series(data=
[sum,print,len]))
Holding functions
--------------------
0 <built-in function
sum>
print>
len>
dtype: object
4 . Cr eate a pandas ser ies w ith a

dictionar y by u sing th e
print ("\nHolding
objects from a
dictionary\n",'-'*40,
sep='')
print(pd.Series(data=
[d.keys, d.items,
d.values]))
Holding objects from a

dictionary
----------------------
------------------
0 <built-in method
keys of dict object at
0x0000...
1 <built-in method
items of dict object
at 0x000...
2 <built-in method
values of dict object
at 0x00...
dtype: object

PANDAS DATAFRAMES
Th e p andas Dat aFr ame i s si mi l ar t o an Ex c el
t ab l e or r el at i onal dat ab ase (SQL) t ab l e t h at
c onsi st s of t h r ee mai n c omp onent s: t h e dat a, t h e
i ndex (or r ow s), and t h e c ol u mns. U nder t h e h ood,
i t i s a st ac k of p andas ser i es ob jec t s, w h i c h ar e
t h emsel v es b u i l t on t op of N u mPy ar r ay s. So, al l
of ou r p r ev i ou s k now l edge of N u mPy ar r ay
ap p l i es h er e:
1 . Cr eate a sim ple DataFr am e

fr om a tw o-dim ensional
m atr ix of nu m ber s. Fir st, th e
code dr aw s 2 0 r andom
integer s fr om th e u nifor m
distr ibu tion. Th en, w e need
to r esh ape it into a (5,4 )
Nu m Py ar r ay – 5 r ow s and 4
colu m ns:
matrix_data =
,size=20).reshape(5,4)
2 . Define th e r ow s labels as
('A','B','C','D','E')
and colu m n labels as
('W','X','Y','Z'):
row_labels =
['A','B','C','D','E']
column_headings =
['W','X','Y','Z']
df =
pd.DataFrame(data=matr
ix_data,
index=row_labels,
columns=column_heading
s)
3 . Th e fu nction to cr eate a
DataFr am e is pd.DataFrame
and it is called in next:
print("\nThe data
frame looks like\n",'-
'*45, sep='')
print(df)

follow s:
The data frame looks

like
----------------------
----------------------
-
W X Y Z
A 6 3 3 3
B 1 9 9 4
C 4 3 6 9
D 4 8 6 7
E 6 6 9 1
4 . Cr eate a DataFr am e fr om a
Py th on dictionar y of som e
lists of integer s by u sing th e
d={'a':[10,20],'b':
[30,40],'c':[50,60]}
5. Pass th is dictionar y as th e
data ar gu m ent to th e
pd.DataFrame fu nction. Pass
on a list of r ow s or indices.
Notice h ow th e dictionar y
key s becam e th e colu m n
nam es and th at th e v alu es
w er e distr ibu ted am ong
m u ltiple r ow s:
df2=pd.DataFrame(data=
d,index=['X','Y'])
print(df2)
a b c
X 10 30 50
Y 20 40 60
Note
The most common w ay that

you w ill encounter to create a
pandas DataFrame w ill be to
read tabular data from a file
on your local disk or over the
internet – CSV, text, JSON,
HTML, Excel, and so on. We
w ill cover some of these in the
next chapter.
EXERCISE 40: VIEWING A

DATAFRAME PARTIALLY
In t h e p r ev i ou s sec t i on, w e u sed print(df) t o
p r i nt t h e w h ol e Dat aFr ame. For a l ar ge dat aset ,
w e w ou l d l i k e t o p r i nt onl y sec t i ons of dat a. In
t h i s ex er c i se, w e w i l l r ead a p ar t of t h e
Dat aFr ame:
1 . Execu te th e follow ing code to

cr eate a DataFr am e w ith 2 5
r ow s and fill it w ith r andom
nu m ber s:
# 25 rows and 4
columns
matrix_data =
0,100).reshape(25,4)
column_headings =
['W','X','Y','Z']
df =
ix_data,columns=column
_headings)
2 . Ru n th e follow ing code to

v iew only th e fir st fiv e r ow s
of th e DataFr am e:
df.head()

follow s (note th at y ou r
ou tpu t cou ld be differ ent du e
to r andom ness):
Figure 3.1: First five rows of the

DataFrame
By defau lt, head sh ow s only

fiv e r ow s. If y ou w ant to see
any specific nu m ber of r ow s
ju st pass th at as an
ar gu m ent.
3 . Pr int th e fir st eigh t r ow s by

com m and:
df.head(8)

follow s:
Figure 3.2: First eight rows of the
DataFrame
Ju st like head sh ow s th e fir st

few r ow s, tail sh ow s th e last
few r ow s.
4 . Pr int th e DataFr am e u sing

th e tail com m and, as
follow s:
df.tail(10)

follow s:
Figure 3.3: Last ten rows of the DataFrame
INDEXING AND SLICING

COLUMNS
Th er e ar e t w o met h ods f or i ndex i ng and sl i c i ng
c ol u mns f r om a Dat aFr ame. Th ey ar e as f ol l ow s:
DOT met hod
Bracket met hod
Th e DOT met h od i s good t o f i nd sp ec i f i c el ement .

Th e b r ac k et met h od i s i nt u i t i v e and easy t o
f ol l ow . In t h i s met h od, y ou c an ac c ess t h e dat a b y
t h e gener i c name/h eader of t h e c ol u mn.
Th e f ol l ow i ng c ode i l l u st r at es t h ese c onc ep t s.

Ex ec u t e t h em i n y ou r Ju p y t er not eb ook :
print("\nThe 'X' column\n",'-'*25, sep='')
print(df['X'])
print("\nType of the column: ",

type(df['X']), sep='')
print("\nThe 'X' and 'Z' columns indexed

by passing a list\n",'-'*55, sep='')
print(df[['X','Z']])
print("\nType of the pair of columns: ",

type(df[['X','Z']]), sep='')
Th e ou t p u t i s as f ol l ow s (a sc r eensh ot i s sh ow n
h er e b ec au se t h e ac t u al c ol u mn i s l ong):
Figure 3.4: Rows of the 'X' columns
Th i s i s t h e ou t p u t sh ow i ng t h e t y p e of c ol u mn:
Figure 3.5: Type of 'X' column

Th i s i s t h e ou t p u t sh ow i ng t h e X and Z c ol u mn
i ndex ed b y p assi ng a l i st :
Figure 3.6: Rows of the 'Y' columns
Th i s i s t h e ou t p u t sh ow i ng t h e t y p e of t h e p ai r of
c ol u mn:
Figure 3.7: Type of 'Y' column
Note
For m ore th a n one c olu m n, th e obje c t tu rns into a
Da ta Fra m e . Bu t for a s ing le c olu m n, it is a
p a nd a s s e rie s obje c t.
INDEXING AND SLICING ROWS

Index i ng and sl i c i ng r ow s i n a Dat aFr ame c an
al so b e done u si ng f ol l ow i ng met h ods:
Label-based 'loc' met hod
Index based 'iloc' met hod
Th e loc met h od i s i nt u i t i v e and easy t o f ol l ow . In

t h i s met h od, y ou c an ac c ess t h e dat a b y t h e
gener i c name of t h e r ow . On t h e ot h er h and, t h e
iloc met h od al l ow s y ou t o ac c ess t h e r ow s b y
t h ei r nu mer i c al i ndex . It c an b e v er y u sef u l f or
a l ar ge t ab l e w i t h t h ou sands of r ow s, esp ec i al l y
w h en y ou w ant t o i t er at e ov er t h e t ab l e i n a l oop
w i t h a nu mer i c al c ou nt er . Th e f ol l ow i ng c ode
i l l u st r at e t h e c onc ep t s of iloc:
matrix_data =
np.random.randint(1,10,size=20).reshape(5,
4)
row_labels = ['A','B','C','D','E']
column_headings = ['W','X','Y','Z']
df = pd.DataFrame(data=matrix_data,
index=row_labels,
columns=column_headings)
print("\nLabel-based 'loc' method for

selecting row(s)\n",'-'*60, sep='')
print("\nSingle row\n")
print(df.loc['C'])
print("\nMultiple rows\n")
print(df.loc[['B','C']])
print("\nIndex position based 'iloc'

method for selecting row(s)\n",'-'*70,
sep='')
print("\nSingle row\n")
print(df.iloc[2])
print("\nMultiple rows\n")
print(df.iloc[[1,2]])
Th e samp l e ou t p u t i s as f ol l ow s:
Figure 3.8: Output of the loc and iloc methods
EXERCISE 41: CREATING AND

DELETING A NEW COLUMN OR
ROW
One of t h e most c ommon t ask s i n dat a w r angl i ng
i s c r eat i ng or del et i ng c ol u mns or r ow s of dat a
f r om y ou r Dat aFr ame. Somet i mes, y ou w ant t o
c r eat e a new c ol u mn b ased on some mat h emat i c al
op er at i on or t r ansf or mat i on i nv ol v i ng t h e
ex i st i ng c ol u mns. Th i s i s si mi l ar t o
mani p u l at i ng dat ab ase r ec or ds and i nser t i ng a
new c ol u mn b ased on si mp l e t r ansf or mat i ons.
W e sh ow some of t h ese c onc ep t s i n t h e f ol l ow i ng
c ode b l oc k s:
1 . Cr eate a new colu m n u sing

th e follow ing snippet:
print("\nA column is
created by assigning
it in relation\n",'-
'*75, sep='')
df['New'] =
df['X']+df['Z']
df['New (Sum of X and

Z)'] = df['X']+df['Z']
print(df)

follow s:
Figure 3.9: Output a er adding a new

column
2 . Dr op a colu m n u sing th e
df.drop m eth od:
print("\nA column is
dropped by using
df.drop() method\n",'-
'*55, sep='')
df = df.drop('New',
axis=1) # Notice the
axis=1 option, axis =
0 is #default, so one
has to change it to 1
print(df)

follow s:
Figure 3.10: Output a er dropping a

column
3 . Dr op a specific r ow u sing th e
df.drop m eth od:
df1=df.drop('A')
print("\nA row is
dropped by using
df.drop method and
axis=0\n",'-'*65,
sep='')
print(df1)

follow s:
Figure 3.11: Output a er dropping a row
Dr opping m eth ods cr eates a

copy of th e DataFr am e and
does not ch ange th e or iginal
DataFr am e.
4 . Ch ange th e or iginal
DataFr am e by setting th e
inplace ar gu m ent to True:
print("\nAn in-place
change can be done by
making inplace=True in
the drop method\n",'-
'*75, sep='')
df.drop('New (Sum of X
and Z)', axis=1,
inplace=True)
print(df)
A sam ple ou tpu t is as follow s:
Figure 3.12: Output a er using the inplace argument
Note
All th e norm a l op e ra tions a re not in-p la c e , th a t is ,
th e y d o not im p a c t th e orig ina l Da ta Fra m e obje c t
bu t re tu rn a c op y of th e orig ina l w ith a d d ition (or
d e le tion). Th e la s t bit of c od e s h ow s h ow to m a k e
a c h a ng e in th e e xis ting Da ta Fra m e w ith th e
inplace=True a rg u m e nt. Ple a s e note th a t th is
c h a ng e is irre v e rs ible a nd s h ou ld be u s e d w ith
c a u tion.
Statistics and
Visualization with
NumPy and Pandas
One of t h e gr eat adv ant ages of u si ng l i b r ar i es
su c h as N u mPy and p andas i s t h at a p l et h or a of
b u i l t -i n st at i st i c al and v i su al i zat i on met h ods
ar e av ai l ab l e, f or w h i c h w e don't h av e t o sear c h
f or and w r i t e new c ode. Fu r t h er mor e, most of
t h ese su b r ou t i nes ar e w r i t t en u si ng C or For t r an
c ode (and p r e-c omp i l ed), mak i ng t h em ex t r emel y
f ast t o ex ec u t e.
REFRESHER OF BASIC
DESCRIPTIVE STATISTICS
(AND THE MATPLOTLIB
LIBRARY FOR
VISUALIZATION)
For any dat a w r angl i ng t ask , i t i s qu i t e u sef u l t o
ex t r ac t b asi c desc r i p t i v e st at i st i c s f r om t h e dat a
and c r eat e some si mp l e v i su al i zat i ons/p l ot s.
Th ese p l ot s ar e of t en t h e f i r st st ep i n i dent i f y i ng
f u ndament al p at t er ns as w el l as oddi t i es (i f
p r esent ) i n t h e dat a. In any st at i st i c al anal y si s,
desc r i p t i v e st at i st i c s i s t h e f i r st st ep , f ol l ow ed
b y i nf er ent i al st at i st i c s, w h i c h t r i es t o i nf er t h e
u nder l y i ng di st r i b u t i on or p r oc ess f r om w h i c h
t h e dat a mi gh t h av e b een gener at ed.
A s t h e i nf er ent i al st at i st i c s ar e i nt i mat el y
c ou p l ed w i t h t h e mac h i ne l ear ni ng/p r edi c t i v e
model i ng st age of a dat a sc i enc e p i p el i ne,
desc r i p t i v e st at i st i c s nat u r al l y b ec omes
assoc i at ed w i t h t h e dat a w r angl i ng asp ec t .
Th er e ar e t w o b r oad ap p r oac h es f or desc r i p t i v e

st at i st i c al anal y si s:
Gr aph ical tech niqu es: Bar

plots, scatter plots, line
ch ar ts, box plots,
h istogr am s, and so on
Calcu lation of centr al

tendency and spr ead: Mean,
m edian, m ode, v ar iance,
standar d dev iation, r ange,
and so on
In t h i s t op i c , w e w i l l demonst r at e h ow y ou c an
ac c omp l i sh b ot h of t h ese t ask s u si ng Py t h on.
A p ar t f r om N u mPy and p andas, w e w i l l need t o
l ear n t h e b asi c s of anot h er gr eat p ac k age –
matplotlib – w h i c h i s t h e most p ow er f u l and
v er sat i l e v i su al i zat i on l i b r ar y i n Py t h on.

TO MATPLOTLIB THROUGH A
SCATTER PLOT
In t h i s ex er c i se, w e w i l l demonst r at e t h e p ow er
and si mp l i c i t y of mat p l ot l i b b y c r eat i ng a
si mp l e sc at t er p l ot f r om some dat a ab ou t t h e age,
w ei gh t , and h ei gh t of a f ew p eop l e:
1 . Fir st, w e define sim ple lists of

nam es, age, w eigh t (in kgs),
and h eigh t (in centim eter s):
people =
['Ann','Brandon','Chen
','David','Emily','Far
ook',
'Gagan','Hamish','Imra
n','Joseph','Katherine
','Lily']
age =
[21,12,32,45,37,18,28,
52,5,40,48,15]
weight =
[55,35,77,68,70,60,72,
69,18,65,82,48]
height =
[160,135,170,165,173,1
68,175,159,105,171,155
,158]
2 . Im por t th e m ost im por tant

m odu le fr om m atplotlib,
called pyplot:
import
matplotlib.pyplot as
plt
3 . Cr eate sim ple scatter plots of

age v er su s w eigh t:
plt.scatter(age,weight
)
plt.show()
Figure 3.13: A screenshot of a scatter plot
containing age and weight
Th e plot can be im pr ov ed by
enlar ging th e figu r e size,
cu stom izing th e aspect r atio,
adding a title w ith a pr oper
font size, adding X-axis and
Y-axis labels w ith a
cu stom ized font size, adding
gr id lines, ch anging th e Y-
axis lim it to be betw een 0
and 1 00, adding X and Y-tick
m ar ks, cu stom izing th e
scatter plot's color , and
ch anging th e size of th e
scatter dots.
4 . Th e code for th e im pr ov ed
plot is as follow s:
plt.figure(figsize=
(8,6))
plt.title("Plot of Age
vs. Weight (in
kgs)",fontsize=20)
plt.xlabel("Age
(years)",fontsize=16)
plt.ylabel("Weight
(kgs)",fontsize=16)
plt.grid (True)
plt.ylim(0,100)
plt.xticks([i*5 for i
in
range(12)],fontsize=15
)
plt.yticks(fontsize=15
)
plt.scatter(x=age,y=we
ight,c='orange',s=150,
edgecolors='k')
plt.text(x=20,y=85,s="
Weights after 18-20
years of
age",fontsize=15)
plt.vlines(x=20,ymin=0
,ymax=80,linestyles='d
ashed',color='blue',lw
=3)
plt.legend(['Weight in
kgs'],loc=2,fontsize=1
2)
plt.show()
Figure 3.14: A screenshot of a scatter plot showing age

versus weight
Ob ser v e t h e f ol l ow i ng:
A tuple (8,6) is passed as

an ar gu m ent for th e figu r e
size.
A list com pr eh ension is u sed

inside Xticks to cr eate a
cu stom ized list of 5-1 0-1 5-
…-55.
A new line (\n) ch ar acter is

u sed inside th e plt.text()
fu nction to br eak u p and
distr ibu te th e text in tw o
lines.
Th e plt.show() fu nction is
u sed at th e v er y end. Th e
idea is to keep on adding
v ar iou s gr aph ics pr oper ties
(font, color , axis lim its, text,
legend, gr id, and so on) u ntil
y ou ar e satisfied and th en
sh ow th e plot w ith one
fu nction. Th e plot w ill not be
display ed w ith ou t th is last
fu nction call.
DEFINITION OF STATISTICAL
MEASURES – CENTRAL
TENDENCY AND SPREAD
A measu r e of c ent r al t endenc y i s a si ngl e v al u e
t h at at t emp t s t o desc r i b e a set of dat a b y
i dent i f y i ng t h e c ent r al p osi t i on w i t h i n t h at set
of dat a. Th ey ar e al so c at egor i zed as su mmar y
st at i st i c s:
Mean: Mean is th e su m of all

v alu es div ided by th e total
nu m ber of v alu es.
Median: Th e m edian is th e
m iddle v alu e. It is th e v alu e
th at splits th e dataset in
h alf. To find th e m edian,
or der y ou r data fr om
sm allest to lar gest, and th en
find th e data point th at h as
an equ al am ou nt of v alu es
abov e it and below it.
Mode: Th e m ode is th e v alu e
th at occu r s th e m ost
fr equ ently in y ou r dataset.
On a bar ch ar t, th e m ode is
th e h igh est bar .
Gener al l y , t h e mean i s a b et t er measu r e t o u se

f or sy mmet r i c dat a and medi an i s a b et t er
measu r e f or dat a w i t h a sk ew ed (l ef t or r i gh t
h eav y ) di st r i b u t i on. For c at egor i c al dat a, y ou
h av e t o u se t h e mode:
Figure 3.15: A screenshot of a curve showing the mean,

median, and mode
Th e sp r ead of t h e dat a i s a measu r e of b y h ow

mu c h t h e v al u es i n t h e dat aset ar e l i k el y t o
di f f er f r om t h e mean of t h e v al u es. If al l t h e
v al u es ar e c l ose t oget h er t h en t h e sp r ead i s l ow ;
on t h e ot h er h and, i f some or al l of t h e v al u es
di f f er b y a l ar ge amou nt f r om t h e mean (and eac h
ot h er ), t h en t h er e i s a l ar ge sp r ead i n t h e dat a:
Variance: Th is is th e m ost
com m on m easu r e of spr ead.
V ar iance is th e av er age of
th e squ ar es of th e dev iations
fr om th e m ean. Squ ar ing th e
dev iations ensu r es th at
negativ e and positiv e
dev iations do not cancel each
oth er ou t.
St andard Dev iat ion:
Becau se v ar iance is pr odu ced
by squ ar ing th e distance
fr om th e m ean, its u nit does
not m atch th at of th e
or iginal data. Standar d
dev iation is a m ath em atical
tr ick to br ing back th e
par ity . It is th e positiv e
squ ar e r oot of th e v ar iance.
RANDOM VARIABLES AND

PROBABILITY DISTRIBUTION
A random v ariable i s def i ned as t h e v al u e of a
gi v en v ar i ab l e t h at r ep r esent s t h e ou t c ome of a
st at i st i c al ex p er i ment or p r oc ess.
A l t h ou gh i t sou nds v er y f or mal , p r et t y mu c h

ev er y t h i ng ar ou nd u s t h at w e c an measu r e c an
b e t h ou gh t of as a r andom v ar i ab l e.
Th e r eason b eh i nd t h i s i s t h at al most al l nat u r al ,

soc i al , b i ol ogi c al , and p h y si c al p r oc esses ar e t h e
f i nal ou t c ome of a l ar ge nu mb er of c omp l ex
p r oc esses, and w e c annot k now t h e det ai l s of
t h ose f u ndament al p r oc esses. A l l w e c an do i s
ob ser v e and measu r e t h e f i nal ou t c ome.
Ty p i c al ex amp l es of r andom v ar i ab l es t h at ar e
ar ou nd u s ar e as f ol l ow s:
Th e econom ic ou tpu t of a
nation
Th e blood pr essu r e of a
patient
Th e tem per atu r e of a

ch em ical pr ocess in a factor y
Nu m ber of fr iends of a per son

on Facebook
Th e stock m ar ket pr ice of a

com pany
Th ese v al u es c an t ak e any di sc r et e or c ont i nu ou s

v al u e and t h ey f ol l ow a p ar t i c u l ar p at t er n
(al t h ou gh t h e p at t er n may v ar y ov er t i me).
Th er ef or e, t h ey c an al l b e c l assi f i ed as r andom
v ar i ab l es.
WHAT IS A PROBABILITY
DISTRIBUTION?
A probability distribution i s a f u nc t i on t h at
desc r i b es t h e l i k el i h ood of ob t ai ni ng t h e
p ossi b l e v al u es t h at a r andom v ar i ab l e c an
assu me. In ot h er w or ds, t h e v al u es of a v ar i ab l e
v ar y b ased on t h e u nder l y i ng p r ob ab i l i t y
di st r i b u t i on.
Su p p ose y ou go t o a sc h ool and measu r e t h e

h ei gh t s of st u dent s w h o h av e b een sel ec t ed
r andoml y . H ei gh t i s an ex amp l e of a r andom
v ar i ab l e h er e. A s y ou measu r e h ei gh t , y ou c an
c r eat e a di st r i b u t i on of h ei gh t . Th i s t y p e of
di st r i b u t i on i s u sef u l w h en y ou need t o k now
w h i c h ou t c omes ar e most l i k el y , t h e sp r ead of
p ot ent i al v al u es, and t h e l i k el i h ood of di f f er ent
r esu l t s.
Th e c onc ep t s of c ent r al t endenc y and sp r ead ar e

ap p l i c ab l e t o a di st r i b u t i on and ar e u sed t o
desc r i b e t h e p r op er t i es and b eh av i or of a
di st r i b u t i on.
St at i st i c i ans gener al l y di v i de al l di st r i b u t i ons

i nt o t w o b r oad c at egor i es:
Discr ete distr ibu tions
Continu ou s distr ibu tions
DISCRETE DISTRIBUTIONS
Discre te probability functions ar e al so k now n
as probability mass functions and c an assu me a
di sc r et e nu mb er of v al u es. For ex amp l e, c oi n
t osses and c ou nt s of ev ent s ar e di sc r et e
f u nc t i ons. You c an h av e onl y h eads or t ai l s i n a
c oi n t oss. Si mi l ar l y , i f y ou 'r e c ou nt i ng t h e
nu mb er of t r ai ns t h at ar r i v e at a st at i on p er
h ou r , y ou c an c ou nt 1 1 or 1 2 t r ai ns, b u t not h i ng
i n-b et w een.
Some p r omi nent di sc r et e di st r i b u t i ons ar e as

f ol l ow s:
Binomial dist ribut ion to

m odel binar y data, su ch as
coin tosses
Poisson dist ribut ion to

m odel cou nt data, su ch as
th e cou nt of libr ar y book
ch eckou ts per h ou r
Uniform dist ribut ion to

m odel m u ltiple ev ents w ith
th e sam e pr obability , su ch as
r olling a die
CONTINUOUS
DISTRIBUTIONS
Continuous probability functions ar e al so
k now n as probability de nsity functions. You
h av e a c ont i nu ou s di st r i b u t i on i f t h e v ar i ab l e
c an assu me an i nf i ni t e nu mb er of v al u es
b et w een any t w o v al u es. Cont i nu ou s v ar i ab l es
ar e of t en measu r ement s on a r eal nu mb er sc al e,
su c h as h ei gh t , w ei gh t , and t emp er at u r e.
Th e most w el l -k now n c ont i nu ou s di st r i b u t i on i s

t h e normal distribution, w h i c h i s al so k now n as
t h e Gaussian distribution or t h e be ll curv e . Th i s
sy mmet r i c di st r i b u t i on f i t s a w i de v ar i et y of
p h enomena, su c h as h u man h ei gh t and IQ sc or es.
Th e nor mal di st r i b u t i on i s l i nk ed t o t h e f amou s

68-95-99.7 rule , w h i c h desc r i b es t h e p er c ent age
of dat a t h at f al l s w i t h i n 1 , 2, or 3 st andar d
dev i at i ons aw ay f r om t h e mean i f t h e dat a
f ol l ow s a nor mal di st r i b u t i on. Th i s means t h at
y ou c an qu i c k l y l ook at some samp l e dat a,
c al c u l at e t h e mean and st andar d dev i at i on, and
c an h av e a c onf i denc e (a st at i st i c al measu r e of
u nc er t ai nt y ) t h at any f u t u r e i nc omi ng dat a w i l l
f al l w i t h i n t h ose 68%-95%-99.7% b ou ndar i es. Th i s
r u l e i s w i del y u sed i n i ndu st r i es, medi c i ne,
ec onomi c s, and soc i al sc i enc e:
Figure 3.16: Curve showing the normal distribution of
the famous 68-95-99.7 rule
DATA WRANGLING IN
STATISTICS AND
VISUALIZATION
A good dat a w r angl i ng p r of essi onal i s ex p ec t ed
t o enc ou nt er a di zzy i ng ar r ay of di v er se dat a
sou r c es eac h day . A s w e ex p l ai ned p r ev i ou sl y ,
du e t o a mu l t i t u de of c omp l ex su b -p r oc esses and
mu t u al i nt er ac t i ons t h at gi v e r i se t o su c h dat a,
t h ey al l f al l i nt o t h e c at egor y of di sc r et e or
c ont i nu ou s r andom v ar i ab l es.
It w i l l b e ex t r emel y di f f i c u l t and c onf u si ng t o

t h e dat a w r angl er or dat a sc i enc e t eam i f al l of
t h i s dat a c ont i nu es t o b e t r eat ed as c omp l et el y
r andom and w i t h ou t any sh ap e or p at t er n. A
f or mal st at i st i c al b asi s mu st b e gi v en t o su c h
r andom dat a st r eams, and one of t h e si mp l est
w ay s t o st ar t t h at p r oc ess i s t o measu r e t h ei r
desc r i p t i v e st at i st i c s.
A ssi gni ng a st r eam of dat a t o a p ar t i c u l ar

di st r i b u t i on f u nc t i on (or a c omb i nat i on of many
di st r i b u t i ons) i s ac t u al l y p ar t of infe re ntial
statistics. H ow ev er , i nf er ent i al st at i st i c s st ar t s
onl y w h en desc r i p t i v e st at i st i c s i s done
al ongsi de measu r i ng al l t h e i mp or t ant
p ar amet er s of t h e p at t er n of t h e dat a.
Th er ef or e, as t h e f r ont l i ne of a dat a sc i enc e

p i p el i ne, dat a w r angl i ng mu st deal w i t h
measu r i ng and qu ant i f y i ng su c h desc r i p t i v e
st at i st i c s of t h e i nc omi ng dat a. A l ong w i t h t h e
f or mat t ed and c l eaned-u p dat a, t h e p r i mar y job
of a dat a w r angl er i s t o h and ov er t h ese measu r es
(and somet i mes ac c omp any i ng p l ot s) t o t h e nex t
t eam memb er of anal y t i c s.
Plotting and v isualization al so h el p a dat a

w r angl i ng t eam i dent i f y p ot ent i al ou t l i er s and
mi sf i t s i n t h e i nc omi ng dat a st r eam and h el p
t h em t o t ak e ap p r op r i at e ac t i on. W e w i l l see
some ex amp l es of su c h t ask s i n t h e nex t c h ap t er ,
w h er e w e w i l l i dent i f y odd dat a p oi nt s b y
c r eat i ng sc at t er p l ot s or h i st ogr ams and ei t h er
i mp u t e or omi t t h e dat a p oi nt .
USING NUMPY AND PANDAS

TO CALCULATE BASIC
DESCRIPTIVE STATISTICS ON
THE DATAFRAME
N ow t h at w e h av e some b asi c k now l edge of
N u mPy , p andas, and mat p l ot l i b u nder ou r b el t ,
w e c an ex p l or e a f ew addi t i onal t op i c s r el at ed t o
t h ese l i b r ar i es, su c h as h ow w e c an b r i ng t h em
t oget h er f or adv anc ed dat a gener at i on, anal y si s,
and v i su al i zat i on.
RANDOM NUMBER
GENERATION USING NUMPY
N u mPy of f er s a di zzy i ng ar r ay of r andom
nu mb er gener at i on u t i l i t y f u nc t i ons, al l of
w h i c h c or r esp ond t o v ar i ou s st at i st i c al
di st r i b u t i ons, su c h as u ni f or m, b i nomi al ,
Gau ssi an nor mal , Bet a/Gamma, and c h i -squ ar e.
Most of t h ese f u nc t i ons ar e ex t r emel y u sef u l and
ap p ear c ou nt l ess t i mes i n adv anc ed st at i st i c al
dat a mi ni ng and mac h i ne l ear ni ng t ask s. H av i ng
a sol i d k now l edge of t h em i s st r ongl y enc ou r aged
f or al l t h e st u dent s t ak i ng t h i s b ook .
H er e, w e w i l l di sc u ss t h r ee of t h e most
i mp or t ant di st r i b u t i ons t h at may c ome i n h andy
f or dat a w r angl i ng t ask s – u ni f or m, b i nomi al ,
and gau ssi an nor mal . Th e goal h er e i s t o sh ow an
ex amp l e of si mp l e f u nc t i on c al l s t h at c an
gener at e one or mor e r andom nu mb er s/ar r ay s
w h enev er t h e u ser needs t h em.
Note
Th e re s u lts w ill be d iffe re nt for e a c h s tu d e nt
w h e n th e y u s e th e s e fu nc tions a s th e y a re
s u p p os e d to be ra nd om .

UNIFORM DISTRIBUTION
In t h i s ex er c i se, w e w i l l b e gener at i ng r andom
nu mb er s f r om a u ni f or m di st r i b u t i on:
1 . Gener ate a r andom integer

betw een 1 and 10:
x =
)
print(x)

follow s (y ou r ou tpu t cou ld be
differ ent):
2 . Gener ate a r andom integer

betw een 1 and 1 0 bu t w ith
size= 1 as an ar gu m ent. It
gener ates a Nu m Py ar r ay of
size 1 :
x =
,size=1)
print(x)

follow s (y ou r ou tpu t cou ld be
differ ent du e to r andom
dr aw ):
[8]
Th er efor e, w e can easily

w r ite th e code to gener ate
th e ou tcom e of a dice being
th r ow n (a nor m al 6 -sided
dice) for 1 0 tr ials.
How abou t m ov ing aw ay

fr om th e integer s and
gener ating som e r eal
nu m ber s? Let's say th at w e
w ant to gener ate ar tificial
data for w eigh ts (in kgs) of
2 0 adu lts and w e can
m easu r e th e accu r ate
w eigh ts u p to tw o decim al
places.
3 . Gener ate decim al data u sing

x =
50+50*np.random.random
(size=15)
x= x.round(decimals=2)
print(x)

follow s:
[56.24 94.67 50.66

94.36 77.37 53.81
61.47 71.13 59.3 65.3
63.02 65.
58.21 81.21 91.62]
We ar e not only r estr icted to

one-dim ensional ar r ay s.
4 . Gener ate and sh ow a 3 x3

m atr ix w ith r andom
nu m ber s betw een 0 and 1:
x =
np.random.rand(3,3)
print(x)

specific ou tpu t cou ld be
differ ent du e to
r andom ness):
[[0.99240105 0.9149215
0.04853315]
[0.8425871 0.11617792
0.77983995]
[0.82769081 0.57579771
0.11358125]]

BINOMIAL DISTRIBUTION
AND BAR PLOT
A b i nomi al di st r i b u t i on i s t h e p r ob ab i l i t y
di st r i b u t i on of get t i ng a sp ec i f i c nu mb er of
su c c esses i n a sp ec i f i c nu mb er of t r i al s of an
ev ent w i t h a p r e-det er mi ned c h anc e or
p r ob ab i l i t y .
Th e most ob v i ou s ex amp l e of t h i s i s a c oi n t oss. A

f ai r c oi n may h av e an equ al c h anc e of h eads or
t ai l s, b u t an u nf ai r c oi n may h av e mor e c h anc es
of t h e h ead c omi ng u p or v i c e v er sa. W e c an
si mu l at e a c oi n t oss i n N u mPy i n t h e f ol l ow i ng
manner .
Su p p ose w e h av e a b i ased c oi n w h er e t h e
p r ob ab i l i t y of h eads i s 0.6. W e t oss t h i s c oi n t en
t i mes and not e dow n t h e nu mb er of h eads t u r ni ng
u p eac h t i me. Th at i s one t r i al or ex p er i ment .
N ow , w e c an r ep eat t h i s ex p er i ment (1 0 c oi n
t osses) any nu mb er of t i mes, say 8 t i mes. Eac h
t i me, w e r ec or d t h e nu mb er of h eads:
1 . Th e exper im ent can be

sim u lated u sing th e
follow ing code:
x =
np.random.binomial(10,
0.6,size=8)
print(x)

follow s (note y ou r specific
ou tpu t cou ld be differ ent du e
to r andom ness):
[6 6 5 6 5 8 4 5]
2 . Plot th e r esu lt u sing a bar

ch ar t:
plt.figure(figsize=
(7,4))
plt.title("Number of
successes in coin
toss",fontsize=16)
plt.bar(left=np.arange
(1,9),height=x)
plt.xlabel("Experiment
number",fontsize=15)
plt.ylabel("Number of
successes",fontsize=15
)
plt.show()

follow s:
Figure 3.17: A screenshot of a graph showing the

binomial distribution and the bar plot

RANDOM NUMBERS FROM
NORMAL DISTRIBUTION AND
HISTOGRAMS
W e di sc u ssed t h e nor mal di st r i b u t i on i n t h e l ast
t op i c and ment i oned t h at i t i s t h e most i mp or t ant
p r ob ab i l i t y di st r i b u t i on b ec au se many p i ec es of
nat u r al , soc i al , and b i ol ogi c al dat a f ol l ow t h i s
p at t er n c l osel y w h en t h e nu mb er of samp l es i s
l ar ge. N u mPy p r ov i des an easy w ay t o gener at e
r andom nu mb er s c or r esp ondi ng t o t h i s
di st r i b u t i on:
1 . Dr aw a single sam ple fr om a

nor m al distr ibu tion by u sing
x = np.random.normal()
print(x)

specific ou tpu t cou ld be
differ ent du e to
r andom ness):
-1.2423774071573694
We know th at nor m al
distr ibu tion is ch ar acter ized
by tw o par am eter s – m ean
(µ ) and standar d dev iation
(σ ). In fact, th e defau lt
v alu es for th is par ticu lar
fu nction ar e µ = 0.0 and σ =
1 .0.
Su ppose w e know th at th e
h eigh ts of th e teenage (1 2 -1 6
y ear s) stu dents in a
par ticu lar sch ool is
distr ibu ted nor m ally w ith a
m ean h eigh t of 1 55 cm and a
standar d dev iation of 1 0 cm .
2 . Gener ate a h istogr am of 1 00

stu dents by u sing th e
# Code to generate the

100 samples (heights)
heights =
np.random.normal(loc=1
55,scale=10,size=100)
# Plotting code
#---------------------
--
plt.figure(figsize=
(7,5))
plt.hist(heights,color
='orange',edgecolor='k
')
plt.title("Histogram
of teen aged
students's
height",fontsize=18)
plt.xlabel("Height in
cm",fontsize=15)
plt.xticks(fontsize=15
)
)
plt.show()

follow s:
Figure 3.18: Histogram of teenage student's height
N ot e t h e u se of t h e loc p ar amet er f or t h e mean

(=1 55) and t h e scale p ar amet er f or st andar d
dev i at i on (=1 0). Th e si ze p ar amet er i s set t o 1 00
f or t h at may samp l es' gener at i on.
EXERCISE 46: CALCULATION
OF DESCRIPTIVE STATISTICS
FROM A DATAFRAME
Rec ol l ec t t h e age, weight, and height p ar amet er s
t h at w e def i ned f or t h e p l ot t i ng ex er c i se. Let 's
p u t t h at dat a i n a Dat aFr ame t o c al c u l at e v ar i ou s
desc r i p t i v e st at i st i c s ab ou t t h em.
Th e b est p ar t of w or k i ng w i t h a p andas
Dat aFr ame i s t h at i t h as a b u i l t -i n u t i l i t y
f u nc t i on t o sh ow al l of t h ese desc r i p t i v e
st at i st i c s w i t h a si ngl e l i ne of c ode. It does t h i s
b y u si ng t h e describe met h od:
1 . Constr u ct a dictionar y w ith

th e av ailable ser ies data by
com m and:
people_dict=
{'People':people,'Age'
:age,'Weight':weight,'
Height':height}
people_df=pd.DataFrame
(data=people_dict)
people_df
Figure 3.19: Output of the created
dictionary
2 . Find th e nu m ber of r ow s and

colu m ns of th e DataFr am e
by execu ting th e follow ing
com m and:
print(people_df.shape)
(12, 4)
3 . Obtain a sim ple count (any

colu m n can be u sed for th is
pu r pose) by execu ting th e
print(people_df['Age']
.count())
12
4 . Calcu late th e sum total of age

com m and:
.sum())
353
5. Calcu late th e mean age by

com m and:
.mean())
29.416666666666668
6 . Calcu late th e median w eigh t

com m and:
print(people_df['Weigh
t'].median())
66.5
7 . Calcu late th e maximum
h eigh t by u sing th e follow ing
com m and:
print(people_df['Heigh
t'].max())
175
8. Calcu late th e standard

deviation of th e w eigh ts by
com m and:
print(people_df['Weigh
t'].std())
18.45120510148239
Note h ow w e ar e calling th e
statistical fu nctions dir ectly
fr om a DataFr am e object.
9 . To calcu late percentile, w e

can call a fu nction fr om
Nu m Py and pass on th e
par ticu lar colu m n (a pandas
ser ies). For exam ple, to
calcu late th e 7 5th and 2 5th
per centiles of age
distr ibu tion and th eir
differ ence (called th e inter -
qu ar tile r ange), u se th e
follow ing code:
pcnt_75 =
np.percentile(people_d
f['Age'],75)
pcnt_25 =
np.percentile(people_d
f['Age'],25)
print("Inter-quartile
range: ",pcnt_75-
pcnt_25)
Inter-quartile range:
24.0
1 0. Use th e describe com m and
to find a detailed descr iption
of th e DataFr am e:
print(people_df.descri
be())
Figure 3.20: Output of the DataFrame using the

describe method
Note
Th is fu nc tion w ork s only on th e c olu m ns w h e re
nu m e ric d a ta is p re s e nt. I t h a s no im p a c t on th e
non-nu m e ric c olu m ns , for e xa m p le , Pe op le in th is
Da ta Fra m e .
EXERCISE 47: BUILT-IN

PLOTTING UTILITIES
Dat aFr ame al so h as b u i l t -i n p l ot t i ng u t i l i t i es
t h at w r ap ar ou nd mat p l ot l i b f u nc t i ons and
c r eat e b asi c p l ot s of nu mer i c dat a:
1 . Find th e h istogr am of th e
w eigh ts by u sing th e hist
fu nction:
people_df['Weight'].hi
st()
plt.show()
Figure 3.21: Histogram of the weights
2 . Cr eate a sim ple scatter plot

dir ectly fr om th e DataFr am e
to plot th e r elationsh ip
betw een w eigh t and h eigh ts
com m and:
people_df.plot.scatter
('Weight','Height',s=1
50,
c='orange',edgecolor='
k')
plt.grid(True)
plt.title("Weight vs.
Height scatter
plot",fontsize=18)
plt.xlabel("Weight (in
kg)",fontsize=15)
plt.ylabel("Height (in
cm)",fontsize=15)
plt.show()
Figure 3.22: Weight versus Height scatter plot
Note
You c a n try re g u la r m a tp lotlib m e th od s a rou nd
th is fu nc tion c a ll to m a k e y ou r p lot p re tty .
ACTIVITY 5: GENERATING
STATISTICS FROM A CSV FILE
Su p p ose y ou ar e w or k i ng w i t h t h e f amou s Bost on
h ou si ng p r i c e (f r om 1 960) dat aset . Th i s dat aset i s
f amou s i n t h e mac h i ne l ear ni ng c ommu ni t y .
Many r egr essi on p r ob l ems c an b e f or mu l at ed,
and mac h i ne l ear ni ng al gor i t h ms c an b e r u n on
t h i s dat aset . You w i l l do p er f or m a b asi c dat a
w r angl i ng ac t i v i t y (i nc l u di ng p l ot t i ng some
t r ends) on t h i s dat aset b y r eadi ng i t as a p andas
Dat aFr ame.
Note
Th e p a nd a s fu nc tion for re a d ing a CS V file is
read_csv.
Th ese st ep s w i l l h el p y ou c omp l et e t h i s ac t i v i t y :
1 . Load th e necessar y libr ar ies.
2 . Read in th e Boston h ou sing

dataset (giv en as a .csv file)
fr om th e local dir ector y .
3 . Ch eck th e fir st 1 0 r ecor ds.

Find th e total nu m ber of
r ecor ds.
4 . Cr eate a sm aller DataFr am e
w ith colu m ns th at do not
inclu de CHAS, NOX, B, and
LSTAT.
5. Ch eck th e last sev en r ecor ds

of th e new DataFr am e y ou
ju st cr eated.
6 . Plot th e h istogr am s of all th e

v ar iables (colu m ns) in th e
new DataFr am e.
7 . Plot th em all at once u sing a

for loop. Tr y to add a u niqu e
title to a plot.
8. Cr eate a scatter plot of cr im e

r ate v er su s pr ice.
9 . Plot u sing log10(crime)

v er su s price.
1 0. Calcu late som e u sefu l

statistics, su ch as m ean
r oom s per dw elling, m edian
age, m ean distances to fiv e
Boston em ploy m ent center s,
and th e per centage of h ou ses
w ith a low pr ice (<
$2 0,000).
Note

Summary
In t h i s c h ap t er , w e st ar t ed w i t h t h e b asi c s of
N u mPy ar r ay s, i nc l u di ng h ow t o c r eat e t h em and
t h ei r essent i al p r op er t i es. W e di sc u ssed and
sh ow ed h ow a N u mPy ar r ay i s op t i mi zed f or
v ec t or i zed el ement -w i se op er at i ons and di f f er s
f r om a r egu l ar Py t h on l i st . Th en, w e mov ed on t o
p r ac t i c i ng v ar i ou s op er at i ons on N u mPy ar r ay s
su c h as i ndex i ng, sl i c i ng, f i l t er i ng, and
r esh ap i ng. W e al so c ov er ed sp ec i al one-
di mensi onal and t w o-di mensi onal ar r ay s, su c h as
zer os, ones, i dent i t y mat r i c es, and r andom ar r ay s.
In t h e sec ond major t op i c of t h i s c h ap t er , w e

st ar t ed w i t h p andas ser i es ob jec t s and qu i c k l y
mov ed on t o a c r i t i c al l y i mp or t ant ob jec t –
p andas Dat aFr ames. It i s anal ogou s t o Ex c el or
MA TLA B or a dat ab ase t ab , b u t w i t h many u sef u l
p r op er t i es f or dat a w r angl i ng. W e demonst r at ed
some b asi c op er at i ons on Dat aFr ames, su c h as
i ndex i ng, su b set t i ng, r ow and c ol u mn addi t i on,
and del et i on.
N ex t , w e c ov er ed t h e b asi c s of p l ot t i ng w i t h
mat p l ot l i b , t h e most w i del y u sed and p op u l ar
Py t h on l i b r ar y f or v i su al i zat i on. A l ong w i t h
p l ot t i ng ex er c i ses, w e t ou c h ed u p on r ef r esh er
c onc ep t s of desc r i p t i v e st at i st i c s (su c h as
c ent r al t endenc y and measu r e of sp r ead) and
p r ob ab i l i t y di st r i b u t i ons (su c h as u ni f or m,
b i nomi al , and nor mal ).
In t h e nex t c h ap t er , w e w i l l c ov er mor e
adv anc ed op er at i on w i t h p andas Dat aFr ames
t h at w i l l c ome i n v er y h andy f or day -t o-day
w or k i ng i n a dat a w r angl i ng job .
Chapter 4
A Deep Dive into Data
Learning Objectives
Per for m su bsetting, filter ing,

and gr ou ping on pandas
DataFr am es
A pply Boolean filter ing and

indexing fr om a DataFr am e
to ch oose specific elem ents
Per for m JOIN oper ations in

pandas th at ar e analogou s to
th e SQL com m and
Identify m issing or cor r u pted

data and ch oose to dr op or
apply im pu tation tech niqu es
on m issing or cor r u pted data
In t h i s c h ap t er , w e w i l l l ear n ab ou t p andas
Dat aFr ames i n det ai l .
Introduction
In t h i s c h ap t er , w e w i l l l ear n ab ou t sev er al
adv anc ed op er at i ons i nv ol v i ng p andas
Dat aFr ames and N u mPy ar r ay s. On c omp l et i ng
t h e det ai l ed ac t i v i t y f or t h i s c h ap t er , y ou w i l l
h av e h andl ed r eal -l i f e dat aset s and u nder st ood
t h e p r oc ess of dat a w r angl i ng.
Subsetting, Filtering,
and Grouping
One of t h e most i mp or t ant asp ec t s of dat a
w r angl i ng i s t o c u r at e t h e dat a c ar ef u l l y f r om
t h e del u ge of st r eami ng dat a t h at p ou r s i nt o an
or gani zat i on or b u si ness ent i t y f r om v ar i ou s
sou r c es. Lot s of dat a i s not al w ay s a good t h i ng;
r at h er , dat a needs t o b e u sef u l and of h i gh -
qu al i t y t o b e ef f ec t i v el y u sed i n dow nst r eam
ac t i v i t i es of a dat a sc i enc e p i p el i ne su c h as
mac h i ne l ear ni ng and p r edi c t i v e model
b u i l di ng. Mor eov er , one dat a sou r c e c an b e u sed
f or mu l t i p l e p u r p oses and t h i s of t en r equ i r es
di f f er ent su b set s of dat a t o b e p r oc essed b y a dat a
w r angl i ng modu l e. Th i s i s t h en p assed on t o
sep ar at e anal y t i c s modu l es.
For ex amp l e, l et 's say y ou ar e doi ng dat a

w r angl i ng on U S St at e l ev el ec onomi c ou t p u t . It
i s a f ai r l y c ommon sc enar i o t h at one mac h i ne
l ear ni ng model may r equ i r e dat a f or l ar ge and
p op u l ou s st at es (su c h as Cal i f or ni a, Tex as, and so
on), w h i l e anot h er model demands p r oc essed dat a
f or smal l and sp ar sel y p op u l at ed st at es (su c h as
Mont ana or N or t h Dak ot a). A s t h e f r ont l i ne of
t h e dat a sc i enc e p r oc ess, i t i s t h e r esp onsi b i l i t y
of t h e dat a w r angl i ng modu l e t o sat i sf y t h e
r equ i r ement s of b ot h t h ese mac h i ne l ear ni ng
model s. Th er ef or e, as a dat a w r angl i ng engi neer ,
y ou h av e t o f i l t er and gr ou p dat a ac c or di ngl y
(b ased on t h e p op u l at i on of t h e st at e) b ef or e
p r oc essi ng t h em and p r odu c i ng sep ar at e dat aset s
as t h e f i nal ou t p u t f or sep ar at e mac h i ne
l ear ni ng model s.
A l so, i n some c ases, dat a sou r c es may b e b i ased, or

t h e measu r ement may c or r u p t t h e i nc omi ng dat a
oc c asi onal l y . It i s a good i dea t o t r y t o f i l t er onl y
t h e er r or -f r ee, good dat a f or dow nst r eam
model i ng. Fr om t h ese ex amp l es and di sc u ssi ons,
i t i s c l ear t h at f i l t er i ng and gr ou p i ng/b u c k et i ng
dat a i s an essent i al sk i l l t o h av e f or any engi neer
t h at 's engaged i n t h e t ask of dat a w r angl i ng. Let 's
p r oc eed t o l ear n ab ou t a f ew of t h ese sk i l l s w i t h
p andas.
EXERCISE 48: LOADING AND

EXAMINING A SUPERSTORE'S
SALES DATA FROM AN EXCEL
FILE
In t h i s ex er c i se, w e w i l l l oad and ex ami ne an
Ex c el f i l e.
1 . To r ead an Excel file into

pandas, y ou w ill need a
sm all package called xlrd to
be installed on y ou r sy stem .
If y ou ar e w or king fr om
inside th is book's Docker
container , th en th is package
m ay not be av ailable next
tim e y ou star t y ou r
container , and y ou h av e to
follow th e sam e step. Use th e
follow ing code to install th e
xlr d package:
!pip install xlrd

2 . Load th e Excel file fr om
GitHu b by u sing th e sim ple
pandas m eth od read_excel:
import numpy as np
import pandas as pd
import
plt
df =
pd.read_excel("Sample
- Superstore.xls")
df.head()
Exam ine all th e colu m ns and

ch eck if th ey ar e u sefu l for
analy sis:
Figure 4.1 Output of the Excel file in a

DataFrame
On exam ining th e file, w e
can see th at th e fir st colu m n,
called Row ID, is not v er y
u sefu l.
3 . Dr op th is colu m n altogeth er
fr om th e DataFr am e by
u sing th e drop m eth od:
df.drop('Row
ID',axis=1,inplace=Tru
e)
4 . Ch eck th e nu m ber of r ow s
and colu m ns in th e new ly
cr eated dataset. We w ill u se
th e shape fu nction h er e:
df.shape
(9994, 20)
We can see th at th e dataset

h as 9 ,9 9 4 r ow s and 2 0
colu m ns.
SUBSETTING THE
DATAFRAME
Subse tting i nv ol v es t h e ex t r ac t i on of p ar t i al
dat a b ased on sp ec i f i c c ol u mns and r ow s, as p er
b u si ness needs. Su p p ose w e ar e i nt er est ed onl y
i n t h e f ol l ow i ng i nf or mat i on f r om t h i s dat aset :
Cu st omer ID, Cu st omer N ame, Ci t y , Post al Code,
and Sal es. For demonst r at i on p u r p oses, l et 's
assu me t h at w e ar e onl y i nt er est ed i n 5 r ec or ds –
r ow s 5-9. W e c an su b set t h e Dat aFr ame t o ex t r ac t
onl y t h i s mu c h i nf or mat i on u si ng a si ngl e l i ne of
Py t h on c ode.
U se t h e loc met h od t o i ndex t h e dat aset b y name

of t h e c ol u mns and i ndex of t h e r ow s:
df_subset = df.loc[
[i for i in range(5,10)],
['Customer ID','Customer
Name','City','Postal Code',
'Sales']]
df_subset
Figure 4.2: DataFrame indexed by name of the
columns
W e need t o p ass on t w o ar gu ment s t o t h e loc

met h od – one f or i ndi c at i ng t h e r ow s, and
anot h er f or i ndi c at i ng t h e c ol u mns. Th ese sh ou l d
b e Py t h on l i st s.
For t h e r ow s, w e h av e t o p ass a l i st [5,6,7 ,8,9], b u t

i nst ead of w r i t i ng t h at ex p l i c i t l y , w e u se a l i st
c omp r eh ensi on, t h at i s, [i for i in
range(5,10)].
Bec au se t h e c ol u mns w e ar e i nt er est ed i n ar e not

c ont i gu ou s, w e c annot ju st p u t a c ont i nu ou s
r ange and need t o p ass on a l i st c ont ai ni ng t h e
sp ec i f i c names. So, t h e sec ond ar gu ment i s ju st a
si mp l e l i st w i t h sp ec i f i c c ol u mn names.
Th e dat aset sh ow s t h e f u ndament al c onc ep t s of

t h e p r oc ess of su b set t i ng a Dat aFr ame b ased on
b u si ness r equ i r ement s.
AN EXAMPLE USE CASE:

DETERMINING STATISTICS ON
SALES AND PROFIT
Th i s qu i c k sec t i on sh ow s a t y p i c al u se c ase of
su b set t i ng. Su p p ose w e w ant t o c al c u l at e
desc r i p t i v e st at i st i c s (mean, medi an, st andar d
dev i at i on, and so on) of r ec or ds 1 00-1 99 f or sal es
and p r of i t . Th i s i s h ow su b set t i ng h el p s u s t o
ac h i ev e t h at :
df_subset = df.loc[[i for i in

range(100,200)],['Sales','Profit']]
df_subset.describe()
Figure 4.3 Output of descriptive statistics of data
Fu r t h er mor e, w e c an c r eat e b ox p l ot s of sal es and

p r of i t f i gu r es f r om t h i s f i nal dat a.
W e si mp l y ex t r ac t r ec or ds 1 00-1 99 and r u n t h e
describe f u nc t i on on i t b ec au se w e don't w ant t o
p r oc ess al l t h e dat a! For t h i s p ar t i c u l ar b u si ness
qu est i on, w e ar e onl y i nt er est ed i n sal es and
p r of i t nu mb er s and t h er ef or e w e sh ou l d not t ak e
t h e easy r ou t e and r u n a desc r i b e f u nc t i on on al l
t h e dat a. For a r eal -l i f e dat aset , t h e nu mb er of
r ow s and c ol u mns c ou l d of t en b e i n t h e mi l l i ons,
and w e don't w ant t o c omp u t e any t h i ng t h at i s
not ask ed f or i n t h e dat a w r angl i ng t ask . W e
al w ay s ai m t o su b set t h e ex ac t dat a t h at i s needed
t o b e p r oc essed and r u n st at i st i c al or p l ot t i ng
f u nc t i ons on t h at p ar t i al dat a:
Figure 4.4: Boxplot of sales and profit
EXERCISE 49: THE UNIQUE

FUNCTION
Bef or e c ont i nu i ng f u r t h er w i t h f i l t er i ng
met h ods, l et 's t ak e a qu i c k det ou r and ex p l or e a
su p er u sef u l f u nc t i on c al l ed unique. A s t h e name
su ggest s, t h i s f u nc t i on i s u sed t o sc an t h r ou gh t h e
dat a qu i c k l y and ex t r ac t onl y t h e u ni qu e v al u es
i n a c ol u mn or r ow .
A f t er l oadi ng t h e su p er st or e sal es dat a, y ou w i l l

not i c e t h at t h er e ar e c ol u mns l i k e "Cou nt r y ",
"St at e", and "Ci t y ". A nat u r al qu est i on w i l l b e t o
ask h ow many c ou nt r i es/st at es/c i t i es ar e
p r esent i n t h e dat aset :
1 . Extr act th e
cou ntr ies/states/cities for
w h ich th e infor m ation is in
th e database, w ith one
sim ple line of code, as follow s:
df['State'].unique()
Figure 4.5: Different states present in the
dataset
You w ill see a list of all th e

states w h ose data is pr esent
in th e dataset.
2 . Use th e nunique m eth od to

cou nt th e nu m ber of u niqu e
v alu es, like so:
df['State'].nunique()
49
Th is r etu r ns 4 9 for th is
dataset. So, one ou t of 50
states in th e US does not
appear in th is dataset.
Si mi l ar l y , i f w e r u n t h i s f u nc t i on on t h e
Cou nt r y c ol u mn, w e get an ar r ay w i t h onl y one
el ement , United States. Immedi at el y , w e c an see
t h at w e don't need t o k eep t h e c ou nt r y c ol u mn at
al l , b ec au se t h er e i s no u sef u l i nf or mat i on i n
t h at c ol u mn ex c ep t t h at al l t h e ent r i es ar e t h e
same. Th i s i s h ow a si mp l e f u nc t i on h el p ed u s t o
dec i de ab ou t dr op p i ng a c ol u mn al t oget h er –
t h at i s, r emov i ng 9,994 p i ec es of u nnec essar y
dat a!
CONDITIONAL SELECTION
AND BOOLEAN FILTERING
Of t en, w e don't w ant t o p r oc ess t h e w h ol e dat aset
and w ou l d l i k e t o sel ec t onl y a p ar t i al dat aset
w h ose c ont ent s sat i sf y a p ar t i c u l ar c ondi t i on.
Th i s i s p r ob ab l y t h e most c ommon u se c ase of any
dat a w r angl i ng t ask .
In t h e c ont ex t of ou r su p er st or e sal es dat aset ,

t h i nk of t h ese c ommon qu est i ons t h at may ar i se
f r om t h e dai l y ac t i v i t y of t h e b u si ness anal y t i c s
t eam:
Wh at ar e th e av er age sales
and pr ofit figu r es in
Califor nia?
Wh ich states h av e th e
h igh est and low est total
sales?
Wh at consu m er segm ent h as

th e m ost v ar iance in
sales/pr ofit?
A m ong th e top 5 states in

sales, w h ich sh ipping m ode
and pr odu ct categor y ar e th e
m ost popu lar ch oices?
Cou nt l ess ex amp l es c an b e gi v en w h er e t h e

b u si ness anal y t i c s t eam or t h e ex ec u t i v e
management w ant t o gl ean i nsi gh t f r om a
p ar t i c u l ar su b set of dat a t h at meet c er t ai n
c r i t er i a.
If y ou h av e any p r i or ex p er i enc e w i t h SQL, y ou

w i l l k now t h at t h ese k i nds of qu est i ons r equ i r e
f ai r l y c omp l ex SQL qu er y w r i t i ng. Rememb er
t h e W H ERE c l au se?
W e w i l l sh ow y ou h ow t o u se c ondi t i onal
su b set t i ng and Bool ean f i l t er i ng t o answ er su c h
qu est i ons.
Fi r st , w e need t o u nder st and t h e c r i t i c al c onc ep t

of b ool ean i ndex i ng. Th i s p r oc ess essent i al l y
ac c ep t s a c ondi t i onal ex p r essi on as an ar gu ment
and r et u r ns a dat aset of b ool eans i n w h i c h t h e
TRUE v al u e ap p ear s i n p l ac es w h er e t h e
c ondi t i on w as sat i sf i ed. A si mp l e ex amp l e i s
sh ow n i n t h e f ol l ow i ng c ode. For demonst r at i on
p u r p oses, w e su b set a smal l dat aset of 1 0 r ec or ds
and 3 c ol u mns:
df_subset = df.loc[[i for i in range

(10)],['Ship Mode','State','Sales']]
df_subset
Figure 4.6: Sample dataset
N ow , i f w e ju st w ant t o k now t h e r ec or ds w i t h
sal es h i gh er t h an $1 00, t h en w e c an w r i t e t h e
f ol l ow i ng:
df_subset>100
Th i s p r odu c es t h e f ol l ow i ng b ool ean Dat aFr ame:

Figure 4.7: Records with sales higher than $100
N ot e t h e Tr u e and Fal se ent r i es i n t h e Sale s

c ol u mn. V al u es i n t h e Ship Mode and State
c ol u mns w er e not i mp ac t ed b y t h i s c ode b ec au se
t h e c omp ar i son w as w i t h a nu mer i c al qu ant i t y ,
and t h e onl y nu mer i c c ol u mn i n t h e or i gi nal
Dat aFr ame w as Sale s.
N ow , l et 's see w h at h ap p ens i f w e p ass t h i s

b ool ean Dat aFr ame as an i ndex t o t h e or i gi nal
Dat aFr ame:
df_subset[df_subset>100]
Figure 4.8: Results a er passing the boolean
DataFrame as an index to the original DataFrame
Th e N aN v al u es c ame f r om t h e f ac t t h at t h e
p r ec edi ng c ode t r i ed t o c r eat e a Dat aFr ame w i t h
TRU E i ndi c es (i n t h e Bool ean Dat aFr ame) onl y .
Th e v al u es w h i c h w er e TRU E i n t h e b ool en
Dat aFr ame w er e r et ai ned i n t h e f i nal ou t p u t
Dat aFr ame.
Th e p r ogr am i nser t ed NaN v al u es f or t h e r ow s

w h er e dat a w as not av ai l ab l e (b ec au se t h ey w er e
di sc ar ded du e t o t h e Sal es v al u e b ei ng < $1 00).
N ow , w e p r ob ab l y don't w ant t o w or k w i t h t h i s
r esu l t i ng Dat aFr ame w i t h NaN v al u es. W e
w ant ed a smal l er Dat aFr ame w i t h onl y t h e r ow s
w h er e Sales > $100. W e c an ac h i ev e t h at b y
si mp l y p assi ng onl y t h e Sales c ol u mn:
df_subset[df_subset['Sales']>100]
Th i s p r odu c es t h e ex p ec t ed r esu l t :
Figure 4.9: Results a er removing the NaN values
W e ar e not l i mi t ed t o c ondi t i onal ex p r essi ons

i nv ol v i ng nu mer i c qu ant i t i es onl y . Let 's t r y t o
ex t r ac t h i gh sal es v al u es (> $1 00) f or ent r i es
t h at do not i nv ol v e Col or ado.
W e c an w r i t e t h e f ol l ow i ng c ode t o ac c omp l i sh
t h at :
df_subset[(df_subset['State']!='Colorado')
& (df_subset['Sales']>100)]
N ot e t h e u se of a c ondi t i onal i nv ol v i ng st r i ng. In

t h i s ex p r essi on, w e ar e joi ni ng t w o c ondi t i onal s
b y an & op er at or . Bot h c ondi t i ons mu st b e
w r ap p ed i nsi de p ar ent h eses.
Th e f i r st c ondi t i onal ex p r essi on si mp l y mat c h es

t h e ent r i es i n t h e State c ol u mn t o t h e st r i ng
Colorado and assi gns TRU E/FA LSE ac c or di ngl y .
Th e sec ond c ondi t i onal i s t h e same as b ef or e.
Toget h er , joi ned b y t h e & op er at or , t h ey ex t r ac t
onl y t h ose r ow s f or w h i c h State i s not Colorado
and Sales i s > $100. W e get t h e f ol l ow i ng r esu l t :
Figure 4.10: Results where State is not California and

Sales is higher than $100
Note
Alth ou g h , in th e ory , th e re is no lim it on h ow
c om p le x a c ond itiona l y ou c a n bu ild u s ing
ind iv id u a l e xp re s s ions a nd & (LOGI CAL AND)
a nd | (LOGI CAL OR) op e ra tors , it is a d v is a ble to
c re a te inte rm e d ia te boole a n Da ta Fra m e s w ith
lim ite d c ond itiona l e xp re s s ions a nd bu ild y ou r
fina l Da ta Fra m e s te p by s te p . Th is k e e p s th e c od e
le g ible a nd s c a la ble .
EXERCISE 50: SETTING AND

RESETTING THE INDEX
Somet i mes, w e may need t o r eset or el i mi nat e t h e
def au l t i ndex of a Dat aFr ame and assi gn a new
c ol u mn as an i ndex :
1 . Cr eate th e matrix_data,
row_labels, and
column_headings fu nctions
com m and:
matrix_data =
np.matrix(
'22,66,140;42,70,148;3
0,62,125;35,68,160;25,
62,152')
row_labels =
['A','B','C','D','E']
column_headings =
['Age', 'Height',
'Weight']
2 . Cr eate a DataFr am e u sing

th e matrix_data,
row_labels, and
column_headings fu nctions:
df1 =
ix_data,
index=row_labels,
columns=column_heading
s)
print("\nThe
DataFrame\n",'-'*25,
sep='')
print(df1)
Figure 4.11: The original DataFrame
3 . Reset th e index, as follow s:
print("\nAfter
resetting index\n",'-
'*35, sep='')
print(df1.reset_index(
))
Figure 4.12: DataFrame a er resetting
the index
4 . Reset th e index w ith drop set

to True, as follow s:
print("\nAfter
resetting index with
'drop' option
TRUE\n",'-'*45,
sep='')
print(df1.reset_index(
drop=True))
Figure 4.13: DataFrame a er resetting

the index with the drop option set to true
5. A dd a new colu m n u sing th e

print("\nAdding a new
column
'Profession'\n",'-
'*45, sep='')
df1['Profession'] =
"Student Teacher
Engineer Doctor
Nurse".split()
print(df1)
Figure 4.14: DataFrame a er adding a

new column called Profession
6 . Now , set th e Profession

colu m n as an index u sing
th e follow ing code:
print("\nSetting
'Profession' column as
index\n",'-'*45,
sep='')
print
(df1.set_index('Profes
sion'))
Figure 4.15: DataFrame a er setting the Profession as

an index
EXERCISE 51: THE GROUPBY

METHOD
Gr ou p b y r ef er s t o a p r oc ess i nv ol v i ng one or
mor e of t h e f ol l ow i ng st ep s:
Splitting th e data into
gr ou ps based on som e
cr iter ia
A pply ing a fu nction to each

gr ou p independently
Com bining th e r esu lts into a

data str u ctu r e
In many si t u at i ons, w e c an sp l i t t h e dat aset i nt o

gr ou p s and do somet h i ng w i t h t h ose gr ou p s. In
t h e ap p l y st ep , w e mi gh t w i sh t o do one of t h e
f ol l ow i ng:
Aggregat ion: Com pu te a

su m m ar y statistic (or
statistics) for each gr ou p –
su m , m ean, and so on
Transformat ion: Per for m

a gr ou p-specific com pu tation
and r etu r n a like-indexed
object – z-tr ansfor m ation or
filling m issing data w ith a
v alu e
Filt rat ion: Discar d few

gr ou ps, accor ding to a gr ou p-
w ise com pu tation th at
ev alu ates TRUE or FA LSE
Th er e i s, of c ou r se, a desc r i b e met h od t o t h i s

GroupBy ob jec t , w h i c h p r odu c es t h e su mmar y
st at i st i c s i n t h e f or m of a Dat aFr ame.
GroupBy i s not l i mi t ed t o a si ngl e v ar i ab l e. If y ou

p ass on mu l t i p l e v ar i ab l es (as a l i st ), t h en y ou
w i l l get b ac k a st r u c t u r e essent i al l y si mi l ar t o a
Pi v ot Tab l e (f r om Ex c el ). Th e f ol l ow i ng i s an
ex amp l e w h er e w e gr ou p t oget h er al l t h e st at es
and c i t i es f r om t h e w h ol e dat aset (t h e snap sh ot i s
a p ar t i al v i ew onl y ).
Note
Th e na m e GroupBy s h ou ld be q u ite fa m ilia r to
th os e w h o h a v e u s e d a S QL-ba s e d tool be fore .
1 . Cr eate a 1 0-r ecor d su bset

com m and:
df_subset = df.loc[[i
for i in range (10)],
['Ship
Mode','State','Sales']
]
2 . Cr eate a pandas DataFr am e
u sing th e groupby object, as
follow s:
byState =
df_subset.groupby('Sta
te')
3 . Calcu late th e m ean sales

figu r e by state by u sing th e
print("\nGrouping by
'State' column and
listing mean
sales\n",'-'*50,
sep='')
print(byState.mean())
Figure 4.16: Output a er grouping the

state with the listing mean sales
4 . Calcu late th e total sales

figu r e by state by u sing th e
print("\nGrouping by
'State' column and
listing total sum of
sales\n",'-'*50,
sep='')
print(byState.sum())
Figure 4.17: The output a er grouping the
state with the listing sum of sales
5. Su bset th at DataFr am e for a

par ticu lar state and sh ow
th e statistics:
pd.DataFrame(byState.d
escribe().loc['Califor
nia'])
Figure 4.18: Checking the statistics of a

particular state
6 . Per for m a sim ilar

su m m ar ization by u sing th e
Ship Mode attr ibu te:
df_subset.groupby('Shi
p
Mode').describe().loc[
['Second
Class','Standard
Class']]
Figure 4.19: Checking the sales by

summarizing the Ship Mode attribute
Note h ow pandas h as
gr ou ped th e data by State
fir st and th en by cities u nder
each state.
7 . Display th e com plete

su m m ar y statistics of sales
by ev er y city in each state –
all by tw o lines of code by
com m and:
byStateCity=df.groupby
(['State','City'])
byStateCity.describe()
['Sales']
Figure 4.20: Checking the summary statistics of sales
Detecting Outliers and

Handling Missing Values
Ou t l i er det ec t i on and h andl i ng mi ssi ng v al u es
f al l u nder t h e su b t l e ar t of dat a qu al i t y
c h ec k i ng. A model i ng or dat a mi ni ng p r oc ess i s
f u ndament al l y a c omp l ex ser i es of c omp u t at i ons
w h ose ou t p u t qu al i t y l ar gel y dep ends on t h e
qu al i t y and c onsi st enc y of t h e i np u t dat a b ei ng
f ed. Th e r esp onsi b i l i t y of mai nt ai ni ng and gat e
k eep i ng t h at qu al i t y of t en f al l s on t h e sh ou l der s
of a dat a w r angl i ng t eam.
A p ar t f r om t h e ob v i ou s i ssu e of p oor qu al i t y
dat a, mi ssi ng dat a c an somet i mes w r eak h av oc
w i t h t h e mac h i ne l ear ni ng (ML) model
dow nst r eam. A f ew ML model s, l i k e Bay esi an
l ear ni ng, ar e i nh er ent l y r ob u st t o ou t l i er s and
mi ssi ng dat a, b u t c ommonl y t ec h ni qu es l i k e
Dec i si on Tr ees and Random For est h av e an i ssu e
w i t h mi ssi ng dat a b ec au se t h e f u ndament al
sp l i t t i ng st r at egy emp l oy ed b y t h ese t ec h ni qu es
dep ends on an i ndi v i du al p i ec e of dat a and not a
c l u st er . Th er ef or e, i t i s al most al w ay s
i mp er at i v e t o i mp u t e mi ssi ng dat a b ef or e
h andi ng i t ov er t o su c h a ML model .
Ou t l i er det ec t i on i s a su b t l e ar t . Of t en, t h er e i s
no u ni v er sal l y agr eed def i ni t i on of an ou t l i er . In
a st at i st i c al sense, a dat a p oi nt t h at f al l s ou t si de
a c er t ai n r ange may of t en b e c l assi f i ed as an
ou t l i er , b u t t o ap p l y t h at def i ni t i on, y ou need t o
h av e a f ai r l y h i gh degr ee of c er t ai nt y ab ou t t h e
assu mp t i on of t h e nat u r e and p ar amet er s of t h e
i nh er ent st at i st i c al di st r i b u t i on ab ou t t h e dat a.
It t ak es a l ot of dat a t o b u i l d t h at st at i st i c al
c er t ai nt y and ev en af t er t h at , an ou t l i er may not
b e ju st an u ni mp or t ant noi se b u t a c l u e t o
somet h i ng deep er . Let 's t ak e an ex amp l e w i t h
some f i c t i t i ou s sal es dat a f r om an A mer i c an f ast
f ood c h ai n r est au r ant . If w e w ant t o model t h e
dai l y sal es dat a as a t i me ser i es, w e ob ser v e an
u nu su al sp i k e i n t h e dat a somew h er e ar ou nd
mi d-A p r i l :
Figure 4.21: Fictitious sales data of an American fast
food chain restaurant
A good dat a sc i ent i st or dat a w r angl er sh ou l d

dev el op c u r i osi t y ab ou t t h i s dat a p oi nt r at h er
t h an ju st r ejec t i ng i t ju st b ec au se i t f al l s ou t si de
t h e st at i st i c al r ange. In t h e ac t u al anec dot e, t h e
sal es f i gu r e r eal l y sp i k ed t h at day b ec au se of an
u nu su al r eason. So, t h e dat a w as r eal . Bu t ju st
b ec au se i t w as r eal does not mean i t i s u sef u l . In
t h e f i nal goal of b u i l di ng a smoot h l y v ar y i ng
t i me ser i es model , t h i s one p oi nt sh ou l d not
mat t er and sh ou l d b e r ejec t ed. Bu t t h e c h ap t er
h er e i s t h at w e c annot r ejec t ou t l i er s w i t h ou t
p ay i ng some at t ent i on t o t h em.
Th er ef or e, t h e k ey t o ou t l i er s i s t h ei r sy st emat i c
and t i mel y det ec t i on i n an i nc omi ng st r eam of
mi l l i ons of dat a or w h i l e r eadi ng dat a f r om a
c l ou d-b ased st or age. In t h i s t op i c , w e w i l l
qu i c k l y go ov er some b asi c st at i st i c al t est s f or
det ec t i ng ou t l i er s and some b asi c i mp u t at i on
t ec h ni qu es f or f i l l i ng u p mi ssi ng dat a.
MISSING VALUES IN PANDAS

One of t h e most u sef u l f u nc t i ons t o det ec t
mi ssi ng v al u es i s isnull. H er e, w e h av e a
snap sh ot of a DataFrame c al l ed df_missing
(samp l ed p ar t i al l y f r om t h e su p er st or e
Dat aFr ame w e ar e w or k i ng w i t h ) w i t h some
mi ssi ng v al u es:
Figure 4.22: DataFrame with missing values
N ow , i f w e si mp l y r u n t h e f ol l ow i ng c ode, w e
w i l l get a Dat aFr ame t h at 's t h e same si ze as t h e
or i gi nal w i t h b ool ean v al u es as TRU E f or t h e
p l ac es w h er e a NaN w as enc ou nt er ed. Th er ef or e,
i t i s si mp l e t o t est f or t h e p r esenc e of any
NaN/mi ssi ng v al u e f or any r ow or c ol u mn of t h e
Dat aFr ame. You ju st h av e t o add t h e p ar t i c u l ar
r ow and c ol u mn of t h i s b ool ean Dat aFr ame. If t h e
r esu l t i s gr eat er t h an zer o, t h en y ou k now t h er e
ar e some TRU E v al u es (b ec au se FA LSE h er e i s
denot ed b y 0 and TRU E h er e i s denot ed b y 1 ) and
c or r esp ondi ngl y some mi ssi ng v al u es. Tr y t h e
f ol l ow i ng sni p p et :
df_missing=pd.read_excel("Sample -
Superstore.xls",sheet_name="Missing")
df_missing
Figure 4.23: DataFrame with the Excel values
U se t h e isnull f u nc t i on on t h e Dat aFr ame and

ob ser v e t h e r esu l t s:
df_missing.isnull()
Figure 4.24 Output highlighting the missing values
H er e i s an ex amp l e of some v er y si mp l e c ode t o

det ec t , c ou nt , and p r i nt ou t mi ssi ng v al u es i n
ev er y c ol u mn of a Dat aFr ame:
for c in df_missing.columns:
miss = df_missing[c].isnull().sum()
if miss>0:
print("{} has {} missing

value(s)".format(c,miss))
else:
print("{} has NO missing

value!".format(c))
Th i s c ode sc ans ev er y c ol u mn of t h e Dat aFr ame,

c al l s t h e isnull f u nc t i on, and su ms u p t h e
r et u r ned ob jec t (a p andas Ser i es ob jec t , i n t h i s
c ase) t o c ou nt t h e nu mb er of mi ssi ng v al u es. If
t h e mi ssi ng v al u e i s gr eat er t h an zer o, i t p r i nt s
ou t t h e message ac c or di ngl y . Th e ou t p u t l ook s as
f ol l ow s:
Figure 4.25: Output of counting the missing values
EXERCISE 52: FILLING IN THE

MISSING VALUES WITH
FILLNA
To h andl e mi ssi ng v al u es, y ou sh ou l d f i r st l ook
f or w ay s not t o dr op t h em al t oget h er b u t t o f i l l
t h em someh ow . Th e fillna met h od i s a u sef u l
f u nc t i on f or p er f or mi ng t h i s t ask on p andas
Dat aFr ames. Th e fillna met h od may w or k f or
st r i ng dat a, b u t not f or nu mer i c al c ol u mns l i k e
sal es or p r of i t s. So, w e sh ou l d r est r i c t ou r sel v es
i n r egar ds t o t h i s f i x ed st r i ng r ep l ac ement t o
non-nu mer i c t ex t -b ased c ol u mns onl y . Th e Pad or
ffill f u nc t i on i s u sed t o f i l l f or w ar d t h e dat a,
t h at i s, c op y i t f r om t h e p r ec edi ng dat a of t h e
ser i es.
Th e mean f u nc t i on c an b e u sed t o f i l l u si ng t h e
av er age of t h e t w o v al u es:
1 . Fill all m issing v alu es w ith

th e str ing FILL by u sing th e
df_missing.fillna('FIL
L')
Figure 4.26: Missing values replaced with
FILL
2 . Fill in th e specified colu m ns

w ith th e str ing FILL by
com m and:
df_missing[['Customer'
,'Product']].fillna('F
ILL')
Figure 4.27: Specified columns replaced

with FILL
Note
I n all of these cases, the

function w orks on a copy of
the original DataFrame. So, if
you w ant to make the
changes permanent, you have
to assign the DataFrames that
are returned by these
functions to the original
DataFrame object.
3 . Fill in th e v alu es u sing pad

or backfill by u sing th e
df_missing['Sales'].fi
llna(method='ffill')
4 . Use backfill or bfill to fill
backw ar d, th at is, copy fr om
th e next data in th e ser ies:
llna(method='bfill')
Figure 4.28: Using forward fill and

backward fill to fill in missing data
5. You can also fill by u sing a

fu nction av er age of
DataFr am es. For exam ple,
w e m ay w ant to fill th e
m issing v alu es in Sales by
th e av er age sales am ou nt.
Her e is h ow w e can do th at:
llna(df_missing.mean()
['Sales'])
Figure 4.29: Using average to fill in missing data
EXERCISE 53: DROPPING

MISSING VALUES WITH
DROPNA
Th i s f u nc t i on i s u sed t o si mp l y dr op t h e r ow s or
c ol u mns t h at c ont ai n N aN /mi ssi ng v al u es.
H ow ev er , t h er e i s some c h oi c e i nv ol v ed.
If t h e ax i s p ar amet er i s set t o zer o, t h en r ow s

c ont ai ni ng mi ssi ng v al u es ar e dr op p ed; i f t h e
ax i s p ar amet er i s set t o one, t h en c ol u mns
c ont ai ni ng mi ssi ng v al u es ar e dr op p ed. Th ese ar e
u sef u l i f w e don't w ant t o dr op a p ar t i c u l ar
r ow /c ol u mn i f t h e N aN v al u es do not ex c eed a
c er t ai n p er c ent age.
Tw o ar gu ment s t h at ar e u sef u l f or t h e dropna()

met h od ar e as f ol l ow s:
Th e how ar gu m ent
deter m ines if a r ow or
colu m n is r em ov ed fr om a
DataFr am e, w h en w e h av e
at least one NaN or all NaNs
Th e thresh ar gu m ent
r equ ir es th at m any non-
NaN v alu es to keep th e
r ow /colu m n
1 . To set th e axis par am eter to

zer o and dr op all m issing
r ow s, u se th e follow ing
com m and:
df_missing.dropna(axis
=0)
2 . To set th e axis par am eter to

one and dr op all m issing
r ow s, u se th e follow ing
com m and:
=1)
Figure 4.30: Dropping rows or columns to
handle missing data
3 . Dr op th e v alu es w ith th e axis

set to one and th r esh set to
1 0:
=1,thresh=10)
Figure 4.31: DataFrame with values dropped with

axis=1 and thresh=10
A l l of t h ese met h ods w or k on a t emp or ar y c op y .

To mak e a p er manent c h ange, y ou h av e t o set
inplace=True or assi gn t h e r esu l t t o t h e or i gi nal
Dat aFr ame, t h at i s, ov er w r i t e i t .
OUTLIER DETECTION USING A

SIMPLE STATISTICAL TEST
A s w e'v e al r eady di sc u ssed, ou t l i er s i n a dat aset
c an oc c u r du e t o many f ac t or s and i n many w ay s:
Data entr y er r or s
Exper im ental er r or s (data

extr action r elated)
Measu r em ent er r or s du e to
noise or instr u m ental failu r e
Data pr ocessing er r or s (data

m anipu lation or m u tations
du e to coding er r or )
Sam pling er r or s (extr acting

or m ixing data fr om w r ong
or v ar iou s sou r ces)
It i s i mp ossi b l e t o p i n-p oi nt one u ni v er sal
met h od f or ou t l i er det ec t i on. H er e, w e w i l l sh ow
y ou some si mp l e t r i c k s f or nu mer i c dat a u si ng
st andar d st at i st i c al t est s.
Box p l ot s may sh ow u nu su al v al u es. Cor r u p t t w o

sal es v al u es b y assi gni ng negat i v e, as f ol l ow s:
df_sample = df[['Customer
Name','State','Sales','Profit']].sample(n=
50).copy()
df_sample['Sales'].iloc[5]=-1000.0
df_sample['Sales'].iloc[15]=-500.0
To p l ot t h e b ox p l ot , u se t h e f ol l ow i ng c ode:
df_sample.plot.box()
plt.title("Boxplot of sales and profit",

fontsize=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.grid(True)
Figure 4.32: Boxplot of sales and profit
W e c an c r eat e si mp l e b ox p l ot s t o c h ec k f or any
u nu su al /nonsensi c al v al u es. For ex amp l e, i n t h e
p r ec edi ng ex amp l e, w e i nt ent i onal l y c or r u p t ed
t w o sal es v al u es t o b e negat i v e and t h ey w er e
r eadi l y c au gh t i n a b ox p l ot .
N ot e t h at p r of i t may b e negat i v e, so t h ose

negat i v e p oi nt s ar e gener al l y not su sp i c i ou s. Bu t
sal es c annot b e negat i v e i n gener al , so t h ey ar e
det ec t ed as ou t l i er s.
W e c an c r eat e a di st r i b u t i on of a nu mer i c al
qu ant i t y and c h ec k f or v al u es t h at l i e at t h e
ex t r eme end t o see i f t h ey ar e t r u l y p ar t of t h e
dat a or ou t l i er . For ex amp l e, i f a di st r i b u t i on i s
al most nor mal , t h en any v al u e mor e t h an 4 or 5
st andar d dev i at i ons aw ay may b e a su sp ec t :
Figure 4.33: Value away from the main outliers
Concatenating, Merging,
and Joining
Mer gi ng and joi ni ng t ab l es or dat aset s ar e h i gh l y
c ommon op er at i ons i n t h e day -t o-day job of a dat a
w r angl i ng p r of essi onal . Th ese op er at i ons ar e
ak i n t o t h e JOIN qu er y i n SQL f or r el at i onal
dat ab ase t ab l es. Of t en, t h e k ey dat a i s p r esent i n
mu l t i p l e t ab l es, and t h ose r ec or ds need t o b e
b r ou gh t i nt o one c omb i ned t ab l e t h at 's mat c h i ng
on t h at c ommon k ey . Th i s i s an ex t r emel y
c ommon op er at i on i n any t y p e of sal es or
t r ansac t i onal dat a, and t h er ef or e mu st b e
mast er ed b y a dat a w r angl er . Th e p andas l i b r ar y
of f er s ni c e and i nt u i t i v e b u i l t -i n met h ods t o
p er f or m v ar i ou s t y p es of JOIN qu er i es
i nv ol v i ng mu l t i p l e Dat aFr ame ob jec t s.
EXERCISE 54:
CONCATENATION
W e w i l l st ar t b y l ear ni ng t h e c onc at enat i on of
Dat aFr ames al ong v ar i ou s ax es (r ow s or
c ol u mns). Th i s i s a v er y u sef u l op er at i on as i t
al l ow s y ou t o gr ow a Dat aFr ame as t h e new dat a
c omes i n or new f eat u r e c ol u mns need t o b e
i nser t ed i n t h e t ab l e:
1 . Sam ple 4 r ecor ds each to

cr eate th r ee DataFr am es at
r andom fr om th e or iginal
sales dataset w e ar e w or king
w ith :
df_1 = df[['Customer
Name','State','Sales',
'Profit']].sample(n=4)
2 . Cr eate a com bined

DataFr am e w ith all th e r ow s
concatenated by u sing th e
follow ing code:
df_cat1 =
pd.concat([df_1,df_2,d
f_3], axis=0)
df_cat1
Figure 4.34: Concatenating DataFrames
together
3 . You can also tr y

concatenating along th e
colu m ns, alth ou gh th at does
not m ake any pr actical sense
for th is par ticu lar exam ple.
How ev er , pandas fills in th e
u nav ailable v alu es w ith
NaN for th at oper ation:
df_cat2 =
pd.concat([df_1,df_2,d
f_3], axis=1)
df_cat2
Figure 4.35: Output a er concatenating the

DataFrames
EXERCISE 55: MERGING BY A

COMMON KEY
Mer gi ng b y a c ommon k ey i s an ex t r emel y
c ommon op er at i on f or dat a t ab l es as i t al l ow s
y ou t o r at i onal i ze mu l t i p l e sou r c es of dat a i n one
mast er dat ab ase – t h at i s, i f t h ey h av e some
c ommon f eat u r es/k ey s.
Th i s i s of t en t h e f i r st st ep i n b u i l di ng a l ar ge
dat ab ase f or mac h i ne l ear ni ng t ask s w h er e dai l y
i nc omi ng dat a may b e p u t i nt o sep ar at e t ab l es.
H ow ev er , at t h e end of t h e day , t h e most r ec ent
t ab l e needs t o b e mer ged w i t h t h e mast er dat a
t ab l e t o b e f ed i nt o t h e b ac k end mac h i ne
l ear ni ng ser v er , w h i c h w i l l t h en u p dat e t h e
model and i t s p r edi c t i on c ap ac i t y .
H er e, w e w i l l sh ow a si mp l e ex amp l e of an i nner
joi n w i t h Cu st omer N ame as t h e k ey :
1 . One DataFr am e, df_1, h ad

sh ipping infor m ation
associated w ith th e cu stom er
nam e, and anoth er table,
df_2, h ad th e pr odu ct
infor m ation tabu lated. Ou r
goal is to m er ge th ese tables
into one DataFr am e on th e
com m on cu stom er nam e:
df_1=df[['Ship
Date','Ship
Mode','Customer
Name']][0:4]
df_1
Figure 4.36: Entries in table df_1
Th e second DataFr am e is as
follow s:
df_2=df[['Customer
Name','Product
Name','Quantity']]
[0:4]
df_2
Figure 4.37: Entries in table df_2

2 . Join th ese tw o tables by
inner join by u sing th e
pd.merge(df_1,df_2,on=
'Customer
Name',how='inner')
Figure 4.38: Inner join on table df_1 and

table df_2
3 . Dr op th e du plicates by u sing
th e follow ing com m and.
'Customer
Name',how='inner').dro
p_duplicates()

table df_2 a er dropping the duplicates
4 . Extr act anoth er sm all table

called df_3 to sh ow th e
concept of an ou ter join:
df_3=df[['Customer
Name','Product
Name','Quantity']]
[2:6]
df_3
Figure 4.40: Creating table df_3
5. Per for m an inner join on

df_1 and df_3 by u sing th e
'Customer
Name',how='inner').dro
p_duplicates()
Figure 4.41: Merging table df_1 and table

df_3 and dropping duplicates
6 . Per for m an ou ter join on

'Customer
Name',how='outer').dro
p_duplicates()
Figure 4.42: Outer join on table df_1 and table df_2 and
dropping the duplicates
N ot i c e h ow some NaN and NaT v al u es ar e i nser t ed

au t omat i c al l y b ec au se no c or r esp ondi ng ent r i es
c ou l d b e f ou nd f or t h ose r ec or ds, as t h ose ar e t h e
ent r i es w i t h u ni qu e c u st omer names f r om t h ei r
r esp ec t i v e t ab l es. NaT r ep r esent s a N ot a Ti me
ob jec t , as t h e ob jec t s i n t h e Sh i p Dat e c ol u mn ar e
of t h e nat u r e of Ti mest amp ob jec t s.
EXERCISE 56: THE JOIN

METHOD
Joi ni ng i s p er f or med b ased on inde x ke y s and i s
done b y c omb i ni ng t h e c ol u mns of t w o
p ot ent i al l y di f f er ent l y i ndex ed Dat aFr ames i nt o
a si ngl e one. It of f er s a f ast er w ay t o ac c omp l i sh
mer gi ng b y r ow i ndi c es. Th i s i s u sef u l i f t h e
r ec or ds i n di f f er ent t ab l es ar e i ndex ed
di f f er ent l y b u t r ep r esent t h e same i nh er ent dat a
and y ou w ant t o mer ge t h em i nt o a si ngl e t ab l e:
1 . Cr eate th e follow ing tables

w ith cu stom er nam e as th e
index by u sing th e follow ing
com m and:
df_1=df[['Customer
Name','Ship
Date','Ship Mode']]
[0:4]
df_1.set_index(['Custo
mer
Name'],inplace=True)
df_1
df_2=df[['Customer
Name','Product
Name','Quantity']]
[2:6]
df_2.set_index(['Custo
mer
Name'],inplace=True)
df_2
Th e ou tpu ts is as follow s:
Figure 4.43: DataFrames df_1 and df_2
2 . Per for m a left join on df_1

and df_2 by u sing th e
df_1.join(df_2,how='le
ft').drop_duplicates()
Figure 4.44: Le join on table df_1 and

3 . Per for m a r igh t join on df_1

and df_2 by u sing th e
df_1.join(df_2,how='ri
ght').drop_duplicates(
)
Figure 4.45: Right join on table df_1 and
4 . Per for m an inner join on

df_1.join(df_2,how='in
ner').drop_duplicates(
)

5. Per for m an ou ter join on

df_1.join(df_2,how='ou
ter').drop_duplicates(
)
Figure 4.47: Outer join on table df_1 and table df_2
a er dropping the duplicates
Useful Methods of
Pandas
In t h i s t op i c , w e w i l l di sc u ss some smal l u t i l i t y
f u nc t i ons t h at ar e of f er ed b y p andas so t h at w e
c an w or k ef f i c i ent l y w i t h Dat aFr ames. Th ey
don't f al l u nder any p ar t i c u l ar gr ou p of
f u nc t i on, so t h ey ar e ment i oned h er e u nder t h e
Mi sc el l aneou s c at egor y .
EXERCISE 57: RANDOMIZED

SAMPLING
Samp l i ng a r andom f r ac t i on of a b i g Dat aFr ame
i s of t en v er y u sef u l so t h at w e c an p r ac t i c e ot h er
met h ods on t h em and t est ou r i deas. If y ou h av e a
dat ab ase t ab l e of 1 mi l l i on r ec or ds, t h en i t i s not
c omp u t at i onal l y ef f ec t i v e t o r u n y ou r t est
sc r i p t s on t h e f u l l t ab l e.
H ow ev er , y ou may al so not w ant t o ex t r ac t onl y

t h e f i r st 1 00 el ement s as t h e dat a may h av e b een
sor t ed b y a p ar t i c u l ar k ey and y ou may get an
u ni nt er est i ng t ab l e b ac k , w h i c h may not
r ep r esent t h e f u l l st at i st i c al di v er si t y of t h e
p ar ent dat ab ase.
In t h ese si t u at i ons, t h e sample met h od c omes i n

su p er h andy so t h at w e c an r andoml y c h oose a
c ont r ol l ed f r ac t i on of t h e Dat aFr ame:
1 . Specify th e nu m ber of
sam ples th at y ou r equ ir e
fr om th e DataFr am e by
com m and:
df.sample(n=5)
Figure 4.48: DataFrame with 5 samples
2 . Specify a definite fr action

(per centage) of data to be
sam pled by u sing th e
df.sample(frac=0.1)
Figure 4.49: DataFrame with 0.1% data
sampled
You can also ch oose if

sam pling is done w ith
r eplacem ent, th at is,
w h eth er th e sam e r ecor d can
be ch osen m or e th an once.
Th e defau lt r eplace ch oice is
FA LSE, th at is, no r epetition,
and sam pling w ill tr y to
ch oose new elem ents only .
3 . Ch oose th e sam pling by

com m and:
df.sample(frac=0.1,
replace=True)
Figure 4.50: DataFrame with 0.1% data sampled and
repetition enabled
THE VALUE_COUNTS METHOD

W e di sc u ssed t h e unique met h od b ef or e, w h i c h
f i nds and c ou nt s t h e u ni qu e r ec or ds f r om a
Dat aFr ame. A not h er u sef u l f u nc t i on i n a si mi l ar
v ei n i s value_counts. Th i s f u nc t i on r et u r ns an
ob jec t c ont ai ni ng c ou nt s of u ni qu e v al u es. In t h e
ob jec t t h at i s r et u r ned, t h e f i r st el ement i s t h e
most f r equ ent l y u sed ob jec t . Th e el ement s ar e
ar r anged i n desc endi ng or der .
Let 's c onsi der a p r ac t i c al ap p l i c at i on of t h i s

met h od t o i l l u st r at e t h e u t i l i t y . Su p p ose y ou r
manager ask s y ou t o l i st t h e t op 1 0 c u st omer s
f r om t h e b i g sal es dat ab ase t h at y ou h av e. So, t h e
b u si ness qu est i on i s: w h i c h 1 0 c u st omer s' names
oc c u r t h e most f r equ ent l y i n t h e sal es t ab l e? You
c an ac h i ev e t h e same w i t h an SQL qu er y i f t h e
dat a i s i n a RDBMS, b u t i n p andas, t h i s c an b e
done b y u si ng one si mp l e f u nc t i on:
df['Customer Name'].value_counts()[:10]
Figure 4.51: List of top 10 customers
Th e value_counts met h od r et u r ns a ser i es of t h e

c ou nt s of al l u ni qu e c u st omer names sor t ed b y
t h e f r equ enc y of t h e c ou nt . By ask i ng f or onl y
t h e f i r st 1 0 el ement s of t h at l i st , t h i s c ode
r et u r ns a ser i es of t h e most f r equ ent l y oc c u r r i ng
t op 1 0 c u st omer names.
PIVOT TABLE FUNCTIONALITY

Si mi l ar t o gr ou p b y , p andas al so of f er p i v ot t ab l e
f u nc t i onal i t y , w h i c h w or k s t h e same as a p i v ot
t ab l e i n sp r eadsh eet p r ogr ams l i k e MS Ex c el . For
ex amp l e, i n t h i s sal es dat ab ase, y ou w ant t o k now
t h e av er age sal es, p r of i t , and qu ant i t y sol d, b y
Regi on and St at e (t w o l ev el s of i ndex ).
W e c an ex t r ac t t h i s i nf or mat i on b y u si ng one
si mp l e p i ec e of c ode (w e samp l e 1 00 r ec or ds f i r st
f or k eep i ng t h e c omp u t at i on f ast and t h en ap p l y
t h e c ode):
df_sample = df.sample(n=100)
df_sample.pivot_table(values=
['Sales','Quantity','Profit'],index=
['Region','State'],aggfunc='mean')
Th e ou t p u t i s as f ol l ow s (not e t h at y ou r sp ec i f i c
ou t p u t may b e di f f er ent du e t o r andom
samp l i ng):
Figure 4.52: Sample of 100 records
EXERCISE 58: SORTING BY

COLUMN VALUES – THE
SORT_VALUES METHOD
Sor t i ng a t ab l e b y a p ar t i c u l ar c ol u mn i s one of
t h e most f r equ ent l y u sed op er at i ons i n t h e dai l y
w or k of an anal y st . N ot su r p r i si ngl y , p andas
p r ov i de a si mp l e and i nt u i t i v e met h od f or
sor t i ng c al l ed t h e sort_values met h od:
1 . Take a r andom sam ple of 1 5

r ecor ds and th en sh ow h ow
w e can sor t by th e Sales
colu m n and th en by both th e
Sales and State colu m ns
togeth er :
df_sample=df[['Custome
r
'Quantity']].sample(n=
15)
df_sample
Figure 4.53: Sample of 15 records
2 . Sor t th e v alu es w ith r espect

to Sales by u sing th e
df_sample.sort_values(
by='Sales')
Figure 4.54: DataFrame with the Sales
value sorted
3 . Sor t th e v alu es w ith r espect

to Sales and State:
df_sample.sort_values(
by=['State','Sales'])
Figure 4.55: DataFrame sorted with respect to Sales
and State
EXERCISE 59: FLEXIBILITY FOR

USER-DEFINED FUNCTIONS
WITH THE APPLY METHOD
Th e p andas l i b r ar y p r ov i des gr eat f l ex i b i l i t y t o
w or k w i t h u ser -def i ned f u nc t i ons of ar b i t r ar y
c omp l ex i t y t h r ou gh t h e apply met h od. Mu c h l i k e
t h e nat i v e Py t h on apply f u nc t i on, t h i s met h od
ac c ep t s a u ser -def i ned f u nc t i on and addi t i onal
ar gu ment s and r et u r ns a new c ol u mn af t er
ap p l y i ng t h e f u nc t i on on a p ar t i c u l ar c ol u mn
el ement -w i se.
A s an ex amp l e, su p p ose w e w ant t o c r eat e a

c ol u mn of c at egor i c al f eat u r es l i k e
h i gh /medi u m/l ow b ased on t h e sal es p r i c e
c ol u mn. N ot e t h at i t i s a c onv er si on f r om a
nu mer i c v al u e t o a c at egor i c al f ac t or (st r i ng)
b ased on c er t ai n c ondi t i ons (t h r esh ol d v al u es of
sal es):
1 . Cr eate a u ser -defined

fu nction, as follow s:
def
categorize_sales(price
):
if price < 50:
return "Low"
elif price < 200:
return "Medium"
else:
return "High"
2 . Sam ple 1 00 r ecor ds

r andom ly fr om th e database:
df_sample=df[['Custome
r
Name','State','Sales']
].sample(n=100)
df_sample.head(10)
Figure 4.56: 100 sample records from the

database
3 . Use th e apply m eth od to

apply th e categor ization
fu nction onto th e Sales
colu m n:
Note
We need to create a new
column to store the category
string values that are returned
by the function.
df_sample['Sales Price
Category']=df_sample['
Sales'].apply(categori
ze_sales)
df_sample.head(10)
Figure 4.57: DataFrame with 10 rows

a er using the apply function on the Sales
column
4 . Th e apply m eth od also w or ks

w ith th e bu ilt-in nativ e
Py th on fu nctions. For
pr actice, let's cr eate anoth er
colu m n for stor ing th e
length of th e nam e of th e
cu stom er . We can do th at
u sing th e fam iliar len
fu nction:
df_sample['Customer
Name
Length']=df_sample['Cu
stomer
Name'].apply(len)
df_sample.head(10)
Figure 4.58: DataFrame with a new

column
5. Instead of w r iting ou t a
separ ate fu nction, w e can
ev en inser t lam bda
expr essions dir ectly into th e
apply m eth od for sh or t
fu nctions. For exam ple, let's
say w e ar e pr om oting ou r
pr odu ct and w ant to sh ow
th e discou nted sales pr ice if
th e or iginal pr ice is > $200.
We can do th is u sing a
lambda fu nction and th e
apply m eth od:
df_sample['Discounted
Price']=df_sample['Sal
es'].apply(lambda
x:0.85*x if x>200 else
x)
df_sample.head(10)
Figure 4.59: Lambda function
Note
Th e la m bd a fu nc tion c onta ins a c ond itiona l, a nd
a d is c ou nt is a p p lie d to th os e re c ord s w h e re th e
orig ina l s a le s p ric e is > $200.
ACTIVITY 6: WORKING WITH

THE ADULT INCOME DATASET
(UCI)
In t h i s ac t i v i t y , y ou w i l l w or k w i t h t h e A du l t
Inc ome Dat aset f r om t h e U CI mac h i ne l ear ni ng
p or t al . Th e A du l t Inc ome dat aset h as b een u sed i n
many mac h i ne l ear ni ng p ap er s t h at addr ess
c l assi f i c at i on p r ob l ems. You w i l l r ead t h e dat a
f r om a CSV f i l e i nt o a p andas Dat aFr ame and do
some p r ac t i c e on t h e adv anc ed dat a w r angl i ng
y ou l ear ned ab ou t i n t h i s c h ap t er .
Th e ai m of t h i s ac t i v i t y i s t o p r ac t i c e v ar i ou s
adv anc ed p andas Dat aFr ame op er at i ons, f or
ex amp l e, f or su b set t i ng, ap p l y i ng u ser -def i ned
f u nc t i ons, su mmar y st at i st i c s, v i su al i zat i ons,
b ool ean i ndex i ng, gr ou p b y , and ou t l i er
det ec t i on on a r eal -l i f e dat aset . W e h av e t h e dat a
dow nl oaded as a CSV f i l e on t h e di sk f or y ou r
ease. H ow ev er , i t i s r ec ommended t o p r ac t i c e
dat a dow nl oadi ng on y ou r ow n so t h at y ou ar e
f ami l i ar w i t h t h e p r oc ess.
H er e i s t h e U RL f or t h e dat aset :
h t t p s://ar c h i v e.i c s.u c i .edu /ml /mac h i ne-
l ear ni ng-dat ab ases/adu l t /.
H er e i s t h e U RL f or t h e desc r i p t i on of t h e dat aset

and t h e v ar i ab l es:
h t t p s://ar c h i v e.i c s.u c i .edu /ml /mac h i ne-
l ear ni ng-dat ab ases/adu l t /adu l t .names.
Th ese ar e t h e st ep s t h at w i l l h el p y ou sol v e t h i s
ac t i v i t y :
1 . Load th e necessar y libr ar ies.
2 . Read th e adu lt incom e

dataset fr om th e follow ing
URL:
h ttps://gith u b.com /Tr ainin
gBy Packt/Data-Wr angling-
w ith -
Py th on/blob/m aster /Ch apte
r 04 /A ctiv ity 06 /.
3 . Cr eate a scr ipt th at w ill r ead

a text file line by line.
4 . A dd a nam e of Income for th e

r esponse v ar iable to th e
dataset.
5. Find th e m issing v alu es.
6 . Cr eate a DataFr am e w ith

only age, edu cation, and
occu pation by u sing
su bsetting.
7 . Plot a h istogr am of age w ith

a bin size of 2 0.
8. Cr eate a fu nction to str ip th e

w h itespace ch ar acter s.

apply th is fu nction to all th e
colu m ns w ith str ing v alu es,
cr eate a new colu m n, copy
th e v alu es fr om th is new
colu m n to th e old colu m n,
and dr op th e new colu m n.
1 0. Find th e nu m ber of people

w h o ar e aged betw een 3 0
and 50.
1 1 . Gr ou p th e r ecor ds based on
age and edu cation to find
h ow th e m ean age is
distr ibu ted.
1 2 . Gr ou p by occu pation and
sh ow th e su m m ar y statistics
of age. Find w h ich pr ofession
h as th e oldest w or ker s on
av er age and w h ich
pr ofession h as its lar gest
sh ar e of th e w or kfor ce abov e
th e 7 5th per centile.
1 3 . Use su bset and gr ou pby to

find ou tlier s.
1 4 . Plot th e v alu es on a bar

ch ar t.
1 5. Mer ge th e data u sing

com m on key s.
Note

can be found on page 297 .
Summary
In t h i s c h ap t er , w e di v ed deep i nt o t h e p andas
l i b r ar y t o l ear n adv anc ed dat a w r angl i ng
t ec h ni qu es. W e st ar t ed w i t h some adv anc ed
su b set t i ng and f i l t er i ng on Dat aFr ames and
r ou nd t h i s u p b y l ear ni ng ab ou t b ool ean
i ndex i ng and c ondi t i onal sel ec t i on of a su b set of
dat a. W e al so c ov er ed h ow t o set and r eset t h e
i ndex of a Dat aFr ame, esp ec i al l y w h i l e
i ni t i al i zi ng.
N ex t , w e l ear ned ab ou t a p ar t i c u l ar t op i c t h at
h as a deep c onnec t i on w i t h t r adi t i onal
r el at i onal dat ab ase sy st ems – t h e gr ou p b y
met h od. Th en, w e di v ed deep i nt o an i mp or t ant
sk i l l f or dat a w r angl i ng - c h ec k i ng f or and
h andl i ng mi ssi ng dat a. W e sh ow ed y ou h ow
p andas h el p i n h andl i ng mi ssi ng dat a u si ng
v ar i ou s i mp u t at i on t ec h ni qu es. W e al so
di sc u ssed met h ods f or dr op p i ng mi ssi ng v al u es.
Fu r t h er mor e, met h ods and u sage ex amp l es of
c onc at enat i on and mer gi ng of Dat aFr ame ob jec t s
w er e sh ow n. W e saw t h e joi n met h od and h ow i t
c omp ar es t o a si mi l ar op er at i on i n SQL.
Last l y , mi sc el l aneou s u sef u l met h ods on

Dat aFr ames, su c h as r andomi zed samp l i ng,
unique, value_count, sort_values, and p i v ot t ab l e
f u nc t i onal i t y w er e c ov er ed. W e al so sh ow ed an
ex amp l e of r u nni ng an ar b i t r ar y u ser -def i ned
f u nc t i on on a Dat aFr ame u si ng t h e apply met h od.
A f t er l ear ni ng ab ou t t h e b asi c and adv anc ed dat a

w r angl i ng t ec h ni qu es w i t h N u mPy and p andas
l i b r ar i es, t h e nat u r al qu est i on of dat a ac qu i r i ng
r i ses. In t h e nex t c h ap t er , w e w i l l sh ow y ou h ow
t o w or k w i t h a w i de v ar i et y of dat a sou r c es, t h at
i s, y ou w i l l l ear n h ow t o r ead dat a i n t ab u l ar
f or mat i n p andas f r om di f f er ent sou r c es.
Chapter 5
Getting Comfortable
with Different Kinds of
Data Sources
Learning Objectives
Read CSV , Excel, and JSON

files into pandas DataFr am es
Read PDF docu m ents and

HTML tables into pandas
DataFr am es
Per for m basic w eb scr aping

u sing pow er fu l y et easy to
u se libr ar ies su ch as
Beau tifu l Sou p
Extr act str u ctu r ed and

textu al infor m ation fr om
por tals
In t h i s c h ap t er , y ou w i l l b e ex p osed t o r eal -l i f e
dat a w r angl i ng t ec h ni qu es, as ap p l i ed t o w eb
sc r ap i ng.
Introduction
So f ar i n t h i s b ook , w e h av e f oc u sed on l ear ni ng
p andas Dat aFr ame ob jec t s as t h e mai n dat a
st r u c t u r e f or t h e ap p l i c at i on of w r angl i ng
t ec h ni qu es. N ow , w e w i l l l ear n ab ou t v ar i ou s
t ec h ni qu es b y w h i c h w e c an r ead dat a i nt o a
Dat aFr ame f r om ex t er nal sou r c es. Some of t h ose
sou r c es c ou l d b e t ex t -b ased (CSV , H TML, JSON ,
and so on), w h er eas some ot h er s c ou l d b e b i nar y
(Ex c el , PDF, and so on), t h at i s, not i n A SCII
f or mat . In t h i s c h ap t er , w e w i l l l ear n h ow t o deal
w i t h dat a t h at i s p r esent i n w eb p ages or H TML
doc u ment s. Th i s h ol ds v er y h i gh i mp or t anc e i n
t h e w or k of a dat a p r ac t i t i oner .
Note
S inc e w e h a v e g one th rou g h a d e ta ile d e xa m p le
of ba s ic op e ra tions w ith Nu m Py a nd p a nd a s , in
th is c h a p te r, w e w ill ofte n s k ip triv ia l c od e
s nip p e ts s u c h a s v ie w ing a ta ble , s e le c ting a
c olu m n, a nd p lotting . I ns te a d , w e w ill foc u s on
s h ow ing c od e e xa m p le s for th e ne w top ic s w e
a im to le a rn a bou t h e re .
Reading Data from

Different Text-Based
(and Non-Text-Based)
Sources
One of t h e most v al u ed and w i del y u sed sk i l l s of a
dat a w r angl i ng p r of essi onal i s t h e ab i l i t y t o
ex t r ac t and r ead dat a f r om a di v er se ar r ay of
sou r c es i nt o a st r u c t u r ed f or mat . Moder n
anal y t i c s p i p el i nes dep end on t h ei r ab i l i t y t o
sc an and ab sor b a v ar i et y of dat a sou r c es t o b u i l d
and anal y ze a p at t er n-r i c h model . Su c h a f eat u r e-
r i c h , mu l t i -di mensi onal model w i l l h av e h i gh
p r edi c t i v e and gener al i zat i on ac c u r ac y . It w i l l
b e v al u ed b y st ak eh ol der s and end u ser s al i k e f or
any dat a-dr i v en p r odu c t .
In t h e f i r st t op i c of t h i s c h ap t er , w e w i l l go
t h r ou gh v ar i ou s dat a sou r c es and h ow t h ey c an
b e i mp or t ed i nt o p andas Dat aFr ames, t h u s
i mb u i ng w r angl i ng p r of essi onal s w i t h
ex t r emel y v al u ab l e dat a i ngest i on k now l edge.
DATA FILES PROVIDED WITH

THIS CHAPTER
Bec au se t h i s t op i c i s ab ou t r eadi ng f r om v ar i ou s
dat a sou r c es, w e w i l l u se smal l f i l es of v ar i ou s
t y p es i n t h e f ol l ow i ng ex er c i ses. A l l of t h e dat a
f i l es ar e p r ov i ded al ong w i t h t h e Ju p y t er
not eb ook i n t h e c ode r ep osi t or y .
LIBRARIES TO INSTALL FOR

THIS CHAPTER
Bec au se t h i s c h ap t er deal s w i t h r eadi ng v ar i ou s
f i l e f or mat s, w e need t o h av e t h e su p p or t of
addi t i onal l i b r ar i es and sof t w ar e p l at f or ms t o
ac c omp l i sh ou r goal s.
Ex ec u t e t h e f ol l ow i ng c odes i n y ou r Ju p y t er
not eb ook c el l s (don't f or get t h e ! b ef or e eac h l i ne
of c ode) t o i nst al l t h e nec essar y l i b r ar i es:
!apt-get update !apt-get install -y

default-jdk
!pip install tabula-py xlrd lxml
EXERCISE 60: READING DATA

FROM A CSV FILE WHERE
HEADERS ARE MISSING
Th e p andas l i b r ar y p r ov i des a si mp l e di r ec t
met h od c al l ed read_csv t o r ead dat a i n a t ab u l ar
f or mat f r om a c omma-sep ar at ed t ex t f i l e, or CSV .
Th i s i s p ar t i c u l ar l y u sef u l b ec au se a CSV i s a
l i gh t w ei gh t y et ex t r emel y h andy dat a ex c h ange
f or mat f or many ap p l i c at i ons, i nc l u di ng su c h
domai ns as mac h i ne-gener at ed dat a. It i s not a
p r op r i et ar y f or mat and t h er ef or e i s u ni v er sal l y
u sed b y a v ar i et y of dat a-gener at i ng sou r c es.
A t t i mes, h eader s may b e mi ssi ng f r om a CSV f i l e

and y ou may h av e t o add p r op er h eader s/c ol u mn
names of y ou r ow n. Let 's h av e a l ook at h ow t h i s
c an b e done:
1 . Read th e exam ple CSV file

(w ith a pr oper h eader ) u sing
th e follow ing code and
exam ine th e r esu lting
DataFr am e, as follow s:
import numpy as np
import pandas as pd
df1 =
pd.read_csv("CSV_EX_1.
csv")
df1
Figure 5.1: Output of example CSV file
2 . Read a .csv file w ith no

h eader u sing a pandas
DataFr am e:
df2 =
csv")
df2
Figure 5.2: Output of the .csv being read
using a DataFrame
Cer tainly , th e top data r ow

h as been m istakenly r ead as
th e colu m n h eader . You can
specify header=None to av oid
th is.
3 . Read th e .csv file by

m entioning th e h eader None,
as follow s:
df2 =
csv",header=None)
df2
How ev er , w ith ou t any

h eader infor m ation, y ou w ill
get back th e follow ing
ou tpu t. Th e defau lt h eader s
w ill be ju st som e defau lt
nu m er ic indices star ting
fr om 0:
Figure 5.3: CSV file with a numeric column

header
Th is m ay be fine for data

analy sis pu r poses, bu t if y ou
w ant th e DataFr am e to tr u ly
r eflect th e pr oper h eader s,
th en y ou w ill h av e to add
th em u sing th e names
ar gu m ent.
4 . A dd th e names ar gu m ent to
get th e cor r ect h eader s:
df2 =
csv",header=None,
names=
['Bedroom','Sq.ft','Lo
cality','Price($)'])
df2
Finally , y ou w ill get a

DataFr am e th at's as follow s:
Figure 5.4: CSV file with correct column header

A CSV FILE WHERE
DELIMITERS ARE NOT
COMMAS
A l t h ou gh CSV st ands f or c omma-sep ar at ed-
v al u es, i t i s f ai r l y c ommon t o enc ou nt er r aw dat a
f i l es w h er e t h e sep ar at or /del i mi t er i s a
c h ar ac t er ot h er t h an a c omma:
1 . Read a .csv file u sing pandas

DataFr am es:
df3 =
csv")
df3
2 . Th e ou tpu t w ill be as follow s:
Figure 5.5: A DataFrame that has a semi-

colon as a separator
3 . Clear ly , th e ; separ ator w as

not expected, and th e
r eading is flaw ed. A sim ple
w or k ar ou nd is to specify th e
separ ator /delim iter
explicitly in th e r ead
fu nction:
df3 =
csv",sep=';')
df3
Figure 5.6: Semicolons removed from the DataFrame
EXERCISE 62: BYPASSING THE

HEADERS OF A CSV FILE
If y ou r CSV f i l e al r eady c omes w i t h h eader s b u t
y ou w ant t o b y p ass t h em and p u t i n y ou r ow n,
y ou h av e t o sp ec i f i c al l y set header = 0 t o mak e i t
h ap p en. If y ou t r y t o set t h e names v ar i ab l e t o
y ou r h eader l i st , u nex p ec t ed t h i ngs c an h ap p en:
1 . A dd nam es to a .csv file th at
h as h eader s, as follow s:
df4 =
csv",names=
['A','B','C','D'])
df4
Figure 5.7: CSV file with headers

overlapped
2 . To av oid th is, set header to

zer o and pr ov ide a nam es
list:
df4 =
csv",header=0,names=
['A','B','C','D'])
df4
Figure 5.8: CSV file with defined headers

EXERCISE 63: SKIPPING
INITIAL ROWS AND FOOTERS
WHEN READING A CSV FILE
Sk i p p i ng i ni t i al r ow s i s a w i del y u sef u l met h od
b ec au se, most of t h e t i me, t h e f i r st f ew r ow s of a
CSV dat a f i l e ar e met adat a ab ou t t h e dat a sou r c e
or si mi l ar i nf or mat i on, w h i c h i s not r ead i nt o
t h e t ab l e:
Figure 5.9: Contents of the CSV file
Note
Th e firs t tw o line s in th e CS V file a re irre le v a nt
d a ta .
1 . Read th e CSV file and

exam ine th e r esu lts:
df5 =
pd.read_csv("CSV_EX_sk
iprows.csv")
df5
Figure 5.10: DataFrame with an
unexpected error
2 . Skip th e fir st tw o r ow s and

r ead th e file:
df5 =
iprows.csv",skiprows=2
)
df5
Figure 5.11: Expected DataFrame a er

skipping two rows
3 . Sim ilar to skipping th e

initial r ow s, it m ay be
necessar y to skip th e footer of
a file. For exam ple, w e do not
w ant to r ead th e data at th e
end of th e follow ing file:
Figure 5.12: Contents of the CSV file
We h av e to u se skipfooter
and th e engine='python'
option to enable th is. Th er e
ar e tw o engines for th ese CSV
r eader fu nctions – based on C
or Py th on, of w h ich only th e
Py th on engine su ppor ts th e
skipfooter option.
4 . Use th e skipfooter option in

Py th on:
df6 =
ipfooter.csv",skiprows
=2,
skipfooter=1,engine='p
ython')
df6
Figure 5.13: DataFrame without a footer

READING ONLY THE FIRST N
ROWS (ESPECIALLY USEFUL
FOR LARGE FILES)
In many si t u at i ons, w e may not w ant t o r ead a
w h ol e dat a f i l e b u t onl y t h e f i r st f ew r ow s. Th i s
i s p ar t i c u l ar l y u sef u l f or ex t r emel y l ar ge dat a
f i l es, w h er e w e may ju st w ant t o r ead t h e f i r st
c ou p l e of h u ndr ed r ow s t o c h ec k an i ni t i al
p at t er n and t h en dec i de t o r ead t h e w h ol e dat a
l at er on. Readi ng t h e ent i r e f i l e c an t ak e a l ong
t i me and sl ow dow n t h e ent i r e dat a w r angl i ng
p i p el i ne.
A si mp l e op t i on, c al l ed nrows, i n t h e read_csv

f u nc t i on enab l es u s t o do ju st t h at :
df7 = pd.read_csv("CSV_EX_1.csv",nrows=2)
df7
Figure 5.14: DataFrame with the first few rows of the

CSV file
EXERCISE 64: COMBINING

SKIPROWS AND NROWS TO
READ DATA IN SMALL
CHUNKS
Cont i nu i ng ou r di sc u ssi on ab ou t r eadi ng a v er y
l ar ge dat a f i l e, w e c an c l ev er l y c omb i ne
skiprows and nrows t o r ead i n su c h a l ar ge f i l e i n
smal l er c h u nk s of p r e-det er mi ned si zes. Th e
f ol l ow i ng c ode demonst r at es ju st t h at :
1 . Cr eate a list w h er e
DataFr am es w ill be stor ed:
list_of_dataframe = []
2 . Stor e th e nu m ber of r ow s to
be r ead into a v ar iable:
rows_in_a_chunk = 10
3 . Cr eate a v ar iable to stor e th e

nu m ber of ch u nks to be r ead:
num_chunks = 5
4 . Cr eate a du m m y DataFr am e
to get th e colu m n nam es:
df_dummy =
pd.read_csv("Boston_ho
using.csv",nrows=2)
colnames =
df_dummy.columns
5. Loop ov er th e CSV file to r ead

only a fixed nu m ber of r ow s
at a tim e:
for i in
range(0,num_chunks*row
s_in_a_chunk,rows_in_a
_chunk):
df =
pd.read_csv("Boston_ho
using.csv",header=0,sk
iprows=i,nrows=rows_in
_a_chunk,names=colname
s)
list_of_dataframe.appe
nd(df)
N ot e h ow t h e iterator v ar i ab l e i s set u p i nsi de

t h e range f u nc t i on t o b r eak i t i nt o c h u nk s. Say
t h e nu mb er of c h u nk s i s 5 and t h e r ow s p er
c h u nk i s 1 0. Th en, t h e i t er at or w i l l h av e a r ange
of (0,5*1 0,1 0), w h er e t h e f i nal 1 0 i s st ep -si ze, t h at
i s, i t w i l l i t er at e w i t h i ndi c es of (0,9,1 9,29,39,49).
SETTING THE
SKIP_BLANK_LINES OPTION
By def au l t , read_csv i gnor es b l ank l i nes. Bu t
somet i mes, y ou may w ant t o r ead t h em i n as N aN
so t h at y ou c an c ou nt h ow many su c h b l ank
ent r i es w er e p r esent i n t h e r aw dat a f i l e. In some
si t u at i ons, t h i s i s an i ndi c at or of t h e def au l t dat a
st r eami ng qu al i t y and c onsi st enc y . For t h i s, y ou
h av e t o di sab l e t h e skip_blank_lines op t i on:
df9 =
pd.read_csv("CSV_EX_blankline.csv",skip_bl
ank_lines=False)
df9
Figure 5.15: DataFrame that has blank rows of a .csv
file
READ CSV FROM A ZIP FILE

Th i s i s an aw esome f eat u r e of p andas, i n t h at i t
al l ow s y ou t o r ead di r ec t l y f r om a c omp r essed
f i l e su c h as .zip, .gz, .bz2, or .xz. Th e onl y
r equ i r ement i s t h at t h e i nt ended dat a f i l e (CSV )
sh ou l d b e t h e onl y f i l e i nsi de t h e c omp r essed
f i l e.
In t h i s ex amp l e, w e c omp r essed t h e ex amp l e CSV

f i l e w i t h a 7 -Zi p p r ogr am and r ead f r om i t
di r ec t l y u si ng t h e read_csv met h od:
df10 = pd.read_csv('CSV_EX_1.zip')
df10
Figure 5.16: DataFrame of a compressed CSV
READING FROM AN EXCEL FILE

USING SHEET_NAME AND
HANDLING A DISTINCT
SHEET_NAME
N ex t , w e w i l l t u r n ou r at t ent i on t o a Mi c r osof t
Ex c el f i l e. It t u r ns ou t t h at most of t h e op t i ons
and met h ods w e l ear ned ab ou t i n t h e p r ev i ou s
ex er c i ses w i t h t h e CSV f i l e ap p l y di r ec t l y t o t h e
r eadi ng of Ex c el f i l es t oo. Th er ef or e, w e w i l l not
r ep eat t h em h er e. Inst ead, w e w i l l f oc u s on t h ei r
di f f er enc es. A n Ex c el f i l e c an c onsi st of mu l t i p l e
w or k sh eet s and w e c an r ead a sp ec i f i c sh eet b y
p assi ng i n a p ar t i c u l ar ar gu ment , t h at i s,
sheet_name.
For ex amp l e, i n t h e assoc i at ed dat a f i l e,

Housing_data.xlsx, w e h av e t h r ee t ab s, and t h e
f ol l ow i ng c ode r eads t h em one b y one i n t h r ee
sep ar at e Dat aFr ames:
df11_1 =
pd.read_excel("Housing_data.xlsx",sheet_na
me='Data_Tab_1')
df11_2 =
me='Data_Tab_2')
df11_3 =
me='Data_Tab_3')
If t h e Ex c el f i l e h as mu l t i p l e di st i nc t sh eet s b u t
t h e sheet_name ar gu ment i s set t o None, t h en an
or der ed di c t i onar y w i l l b e r et u r ned b y t h e
read_excel f u nc t i on. Th er eaf t er , w e c an si mp l y
i t er at e ov er t h at di c t i onar y or i t s k ey s t o
r et r i ev e i ndi v i du al Dat aFr ames.
Let 's c onsi der t h e f ol l ow i ng ex amp l e:
dict_df =
me=None)
dict_df.keys()
odict_keys(['Data_Tab_1', 'Data_Tab_2',
'Data_Tab_3'])
EXERCISE 65: READING A

GENERAL DELIMITED TEXT
FILE
Gener al t ex t f i l es c an b e r ead as easi l y as w e
r ead CSV f i l es. H ow ev er , y ou h av e t o p ass on t h e
p r op er sep ar at or i f i t i s any t h i ng ot h er t h an a
w h i t esp ac e or a t ab :
1 . A com m a-separ ated file,

sav ed w ith th e .txt
extension, w ill r esu lt in th e
follow ing DataFr am e if r ead
w ith ou t explicitly setting th e
separ ator :
df13 =
pd.read_table("Table_E
X_1.txt")
df13
Figure 5.17: DataFrame that has a

comma-separated CSV file
2 . In th is case, w e h av e to set
th e separ ator explicitly , as
follow s:
df13 =
pd.read_table("Table_E
X_1.txt",sep=',')
df13
Th e ou tpu t is follow s:
Figure 5.18: DataFrame read using a comma separator
READING HTML TABLES

DIRECTLY FROM A URL
Th e p andas l i b r ar y al l ow s u s t o r ead H TML
t ab l es di r ec t l y f r om a U RL. Th i s means t h at t h ey
al r eady h av e some k i nd of b u i l t -i n H TML p ar ser
t h at p r oc esses t h e H TML c ont ent of a gi v en p age
and t r i es t o ex t r ac t v ar i ou s t ab l es i n t h e p age.
Note
Th e read_html m e th od re tu rns a lis t of
Da ta Fra m e s (e v e n if th e p a g e h a s a s ing le
Da ta Fra m e ) a nd y ou h a v e to e xtra c t th e re le v a nt
ta ble s from th e lis t.
Consi der t h e f ol l ow i ng ex amp l e:
url =
'http://www.fdic.gov/bank/individual/faile
d/banklist.html'
list_of_df = pd.read_html(url)
df14 = list_of_df[0]
df14.head()
Th ese r esu l t s ar e sh ow n i n t h e f ol l ow i ng
Dat aFr ame:
Figure 5.19: Results of reading HTML tables
EXERCISE 66: FURTHER

WRANGLING TO GET THE
DESIRED DATA
A s di sc u ssed i n t h e p r ec edi ng ex er c i se, t h i s
H TML-r eadi ng f u nc t i on al most al w ay s r et u r ns
mor e t h an one t ab l e f or a gi v en H TML p age and
w e h av e t o f u r t h er p ar se t h r ou gh t h e l i st t o
ex t r ac t t h e p ar t i c u l ar t ab l e w e ar e i nt er est ed
i n:
1 . For exam ple, if w e w ant to

get th e table of th e 2 01 6
su m m er Oly m pics m edal
tally (by nation), w e can
easily sear ch to get a page on
Wikipedia th at w e can pass
on to pandas. We can do th is
com m and:
list_of_df =
pd.read_html("https://
en.wikipedia.org/wiki/
2016_Summer_Olympics_m
edal_table",header=0)
2 . If w e ch eck th e length of th e
list r etu r ned, w e w ill see it is
6:
len(list_of_df)
3 . To look for th e table, w e can

r u n a sim ple loop:
for t in list_of_df:
print(t.shape)
Figure 5.20: Shape of the tables
4 . It looks like th e second

elem ent in th is list is th e
table w e ar e looking for :
df15=list_of_df[1]
df15.head()
5. Th e ou tpu t is as follow s:
Figure 5.21: Output of the data in the second table

A JSON FILE
Ov er t h e l ast 1 5 y ear s, JSON h as b ec ome a
u b i qu i t ou s c h oi c e f or dat a ex c h ange on t h e w eb .
Today , i t i s t h e f or mat of c h oi c e f or al most ev er y
p u b l i c l y av ai l ab l e w eb A PI, and i t i s f r equ ent l y
u sed f or p r i v at e w eb A PIs as w el l . It i s a sc h ema-
l ess, t ex t -b ased r ep r esent at i on of st r u c t u r ed dat a
t h at i s b ased on k ey -v al u e p ai r s and or der ed l i st s.
Th e p andas l i b r ar y p r ov i des ex c el l ent su p p or t

f or r eadi ng dat a f r om a JSON f i l e di r ec t l y i nt o a
Dat aFr ame. To p r ac t i c e w i t h t h i s c h ap t er , w e
h av e i nc l u ded a f i l e c al l ed movies.json. Th i s f i l e
c ont ai ns t h e c ast , genr e, t i t l e, and y ear (of
r el ease) i nf or mat i on f or al most al l major mov i es
si nc e 1 900:
1 . Extr act th e cast list for th e

2 01 2 A v enger s m ov ie (fr om
Mar v el com ics):
df16 =
pd.read_json("movies.j
son")
df16.head()
Figure 5.22: DataFrame displaying the

Avengers movie cast
2 . To look for th e cast w h er e th e

title is "A v enger s", w e can
u se filter ing:
cast_of_avengers=df16[
(df16['title']=="The
Avengers") &
(df16['year']==2012)]
['cast']
print(list(cast_of_ave
ngers))
[['Robert Downey,
Jr.', 'Chris Evans',
'Mark Ruffalo', 'Chris
Hemsworth', 'Scarlett
Johansson', 'Jeremy
Renner', 'Tom
Hiddleston', 'Clark
Gregg', 'Cobie
Smulders', 'Stellan
SkarsgÃyrd', 'Samuel
L. Jackson']]
READING A STATA FILE

Th e p andas l i b r ar y p r ov i des a di r ec t r eadi ng
f u nc t i on f or St at a f i l es, t oo. St at a i s a p op u l ar
st at i st i c al model i ng p l at f or m t h at 's u sed i n many
gov er nment al and r esear c h or gani zat i ons,
esp ec i al l y b y ec onomi st s and soc i al sc i ent i st s.
Th e si mp l e c ode t o r ead i n a St at a f i l e (.dta

f or mat ) i s as f ol l ow s:
df17 = pd.read_stata("wu-data.dta")
EXERCISE 68: READING

TABULAR DATA FROM A PDF
FILE
A mong t h e v ar i ou s t y p es of dat a sou r c es, t h e PDF
f or mat i s p r ob ab l y t h e most di f f i c u l t t o p ar se i n
gener al . W h i l e t h er e ar e some p op u l ar p ac k ages
i n Py t h on f or w or k i ng w i t h PDF f i l es f or
gener al p age f or mat t i ng, t h e b est l i b r ar y t o u se
f or t ab l e ex t r ac t i on f r om PDF f i l es i s tabula-py.
Fr om t h e Gi t H u b p age of t h i s p ac k age, tabula-py

i s a si mp l e Py t h on w r ap p er of tabula-java,
w h i c h c an r ead a t ab l e f r om a PDF. You c an r ead
t ab l es f r om PDFs and c onv er t t h em i nt o p andas
Dat aFr ames. Th e tabula-py l i b r ar y al so enab l es
y ou t o c onv er t a PDF f i l e i nt o a CSV /TSV /JSON
f i l e.
You w i l l need t h e f ol l ow i ng p ac k ages i nst al l ed

on y ou r sy st em b ef or e y ou c an r u n t h i s, b u t t h ey
ar e f r ee and easy t o i nst al l :
u r llib3
pandas
py test
flake8
distr o
path lib
1 . Find th e PDF file in th e

follow ing link:
w ith -
r 05/Exer cise6 0-
6 8/Hou sing_data.xlsx. Th e
follow ing code r etr iev es th e
tables fr om tw o pages and
joins th em to m ake one table:
from tabula import

read_pdf
df18_1 =
read_pdf('Housing_data
.pdf',pages=
[1],pandas_options=
{'header':None})
df18_1
Figure 5.23: DataFrame with a table

derived by merging a table flowing over
two pages in a PDF
2 . Retr iev e th e table fr om

anoth er page of th e sam e
PDF by u sing th e follow ing
com m and:
df18_2 =
.pdf',pages=
[2],pandas_options=
{'header':None})
df18_2
Figure 5.24: DataFrame displaying a

table from another page
3 . To concatenate th e tables
th at w er e der iv ed fr om th e
fir st tw o steps, execu te th e
follow ing code:
df18=pd.concat([df18_1
,df18_2],axis=1)
df18
Figure 5.25: DataFrame derived by

concatenating two tables
4 . With PDF extr action, m ost of

th e tim e, h eader s w ill be
difficu lt to extr act
au tom atically . You h av e to
pass on th e list of h eader s
w ith th e names ar gu m ent in
th e read-pdf fu nction as
pandas_option, as follow s:
names=
['CRIM','ZN','INDUS','
CHAS','NOX','RM','AGE'
,'DIS','RAD','TAX','PT
RATIO','B','LSTAT','PR
ICE']
df18_1 =
.pdf',pages=
[1],pandas_options=
{'header':None,'names'
:names[:10]})
df18_2 =
.pdf',pages=
[2],pandas_options=
{'header':None,'names'
:names[10:]})
df18=pd.concat([df18_1
,df18_2],axis=1)
df18
Figure 5.26: DataFrame with correct column headers

for PDF data
W e w i l l h av e a f u l l ac t i v i t y on r eadi ng t ab l es
f r om a PDF r ep or t and p r oc essi ng t h em at t h e end
of t h i s c h ap t er .
Introduction to
Beautiful Soup 4 and
Web Page Parsing
Th e ab i l i t y t o r ead and u nder st and w eb p ages i s
one of p ar amou nt i nt er est f or a p er son
c ol l ec t i ng and f or mat t i ng dat a. For ex amp l e,
c onsi der t h e t ask of gat h er i ng dat a ab ou t mov i es
and t h en f or mat t i ng i t f or a dow nst r eam sy st em.
Dat a f or t h e mov i es i s b est ob t ai ned b y t h e
w eb si t es su c h as IMDB and t h at dat a does not
c ome p r e-p ac k aged i n ni c e f or ms(CSV , JSON < and
so on), so y ou need t o k now h ow t o dow nl oad and
r ead w eb p age.
Fu r t h er mor e, y ou al so need t o b e equ i p p ed w i t h

t h e k now l edge of t h e st r u c t u r e of a w eb p age so
t h at y ou c an desi gn a sy st em t h at c an sear c h f or
(qu er y ) a p ar t i c u l ar p i ec e of i nf or mat i on f r om a
w h ol e w eb p age and get t h e v al u e of i t . Th i s
i nv ol v es u nder st andi ng t h e gr ammar of mar k u p
l angu ages and b ei ng ab l e t o w r i t e somet h i ng t h at
c an p ar se t h em. Doi ng t h i s, and k eep i ng al l t h e
edge c ases i n mi nd, f or somet h i ng l i k e H TML i s
al r eady i nc r edi b l y c omp l ex , and i f y ou ex t end
t h e sc op e of t h e b esp ok e mar k u p l angu age t o
i nc l u de XML as w el l , t h en i t b ec omes f u l l -t i me
w or k f or a t eam of p eop l e.
Th ank f u l l y , w e ar e u si ng Py t h on, and Py t h on h as

a v er y mat u r e and st ab l e l i b r ar y t o do al l of t h e
c omp l i c at ed job s f or u s. Th i s l i b r ar y i s c al l ed
BeautifulSoup (i t i s, at p r esent , i n v er si on 4 and
t h u s w e w i l l c al l i t bs4 i n sh or t f r om now on). bs4
i s a l i b r ar y f or get t i ng dat a f r om H TML or XML
doc u ment s, and i t gi v es y ou a ni c e, nor mal i zed,
i di omat i c w ay of nav i gat i ng and qu er y i ng a
doc u ment . It does not i nc l u de a p ar ser b u t i t
su p p or t s di f f er ent ones.
STRUCTURE OF HTML
Bef or e w e ju mp i nt o bs4 and st ar t w or k i ng w i t h
i t , w e need t o ex ami ne t h e st r u c t u r e of a H TML
doc u ment . Hy p er T ex t Mar k u p Langu age i s a
st r u c t u r ed w ay of t el l i ng w eb b r ow ser s ab ou t
t h e or gani zat i on of a w eb p age, meani ng w h i c h
k i nd of el ement s (t ex t , i mage, v i deo, and so on)
c ome f r om w h er e, i n w h i c h p l ac e i nsi de t h e p age
t h ey sh ou l d ap p ear , w h at t h ey l ook l i k e, w h at
t h ey c ont ai n, and h ow t h ey w i l l b eh av e w i t h
u ser i np u t . H TML5 i s t h e l at est v er si on of H TML.
A n H TML doc u ment c an b e v i ew ed as a t r ee, as w e
c an see f r om t h e f ol l ow i ng di agr am:
Figure 5.27: HTML structure
Eac h node of t h e t r ee r ep r esent s one el ement i n

t h e doc u ment . A n el ement i s any t h i ng t h at st ar t s
w i t h < and ends w i t h >. For ex amp l e, <html>,
<head>, <p>, <br>, <img>, and so on ar e v ar i ou s
H TML el ement s. Some el ement s h av e a st ar t and
end el ement , w h er e t h e end el ement b egi ns w i t h
"</" and h as t h e same name as t h e st ar t el ement ,
su c h as <p> and </p>, and t h ey c an c ont ai n an
ar b i t r ar y nu mb er of el ement s of ot h er t y p es i n
t h em. Some el ement s do not h av e an endi ng p ar t ,
su c h as t h e <br /> el ement , and t h ey c annot
c ont ai n any t h i ng w i t h i n t h em.
Th e onl y ot h er t h i ng t h at w e need t o k now ab ou t

an el ement at t h i s p oi nt i s t h e f ac t t h at el ement s
c an h av e at t r i b u t es, w h i c h ar e t h er e t o modi f y
t h e def au l t b eh av i or of an el ement . A n <a>
el ement r equ i r es a href at t r i b u t e t o t el l t h e
b r ow ser w h i c h w eb si t e i t sh ou l d nav i gat e t o
w h en t h at p ar t i c u l ar <a> i s c l i c k ed, l i k e t h i s: <a
href="http://cnn.com">. Th e CN N new s c h annel ,
</a>, w i l l t ak e y ou t o c nn.c om w h en c l i c k ed:
Figure 5.28: CNN news channel hyperlink
So, w h en y ou ar e at a p ar t i c u l ar el ement of t h e
t r ee, y ou c an v i si t al l t h e c h i l dr en of t h at
el ement t o get t h e c ont ent s and at t r i b u t es of
t h em.
Equ i p p ed w i t h t h i s k now l edge, l et 's see h ow w e

c an r ead and qu er y dat a f r om a H TML doc u ment .
In t h i s t op i c , w e w i l l c ov er t h e r eadi ng and
p ar si ng of w eb p ages, b u t w e do not r equ est t h em
f r om a l i v e w eb si t e. Inst ead, w e r ead t h em f r om
di sk . A sec t i on on r eadi ng t h em f r om t h e i nt er net
w i l l f ol l ow i n a f u t u r e c h ap t er .
EXERCISE 69: READING AN

HTML FILE AND EXTRACTING
ITS CONTENTS USING
BEAUTIFULSOUP
In t h i s ex er c i se, w e w i l l do t h e si mp l est t h i ng
p ossi b l e. W e w i l l i mp or t t h e BeautifulSoup
l i b r ar y and t h en u se i t t o r ead an H TML
doc u ment . Th en, w e w i l l ex ami ne t h e di f f er ent
k i nds of ob jec t s i t r et u r ns. W h i l e doi ng t h e
ex er c i ses f or t h i s t op i c , y ou sh ou l d h av e t h e
ex amp l e H TML f i l e op en i n a t ex t edi t or al l t h e
t i me so t h at y ou c an c h ec k f or t h e di f f er ent t ags
and t h ei r at t r i b u t es and c ont ent s:
1 . Im por t th e bs4 libr ar y :
from bs4 import

BeautifulSoup
2 . Please dow nload th e

follow ing test HTML file and
sav e it on y ou r disk and th e
u se bs4 to r ead it fr om th e
disk:
with open("test.html",
"r") as fd: soup =
BeautifulSoup(fd)
print(type(soup))
<class
'bs4.BeautifulSoup'>
You can pass a file h andler

dir ectly to th e constr u ctor of
th e BeautifulSoup object
and it w ill r ead th e contents
fr om th e file th at th e
h andler is attach ed to. We
w ill see th at r etu r n-ty pe is
an instance of
bs4.BeautifulSoup. Th is
class h olds all th e m eth ods
w e need to nav igate th r ou gh
th e DOM tr ee th at th e
docu m ent r epr esents.
3 . Pr int th e contents of th e file

in a nice w ay by u sing th e
prettify m eth od fr om th e
class like th is:
print(soup.prettify())
Figure 5.29: Contents of the HTML file
Th e sam e infor m ation can

also be obtained by u sing th e
soup.contents m em ber
v ar iable. Th e differ ences ar e:
fir st, it w on't pr int any th ing
pr etty and, second, it is
essentially a list.
If w e look car efu lly at th e

contents of th e HTML file in a
separ ate text editor , w e w ill
see th at th er e ar e m any
par agr aph tags, or <p> tags.
Let's r ead content fr om one
su ch <p> tag. We can do th at
u sing th e sim ple . access
m odifier as w e w ou ld h av e
done for a nor m al m em ber
v ar iable of a class.
4 . Th e m agic of bs4 is th e fact

th at it giv es u s th is excellent
w ay to der efer ence tags as
m em ber v ar iables of th e
BeautifulSoup class
instance:
"r") as fd:
soup =
BeautifulSoup(fd)
print(soup.p)
Figure 5.30: Text from the <p> tag
A s w e can see, th is is th e
content of a <p> tag.
We saw h ow to r ead a tag in

th e last exer cise, bu t w e can
easily see th e pr oblem w ith
th is appr oach . Wh en w e look
into ou r HTML docu m ent, w e
can see th at w e h av e m or e
th an one <p> tag th er e. How
can w e access all th e <p>
tags? It tu r ns ou t th at th is is
easy .
5. Use th e findall m eth od to

extr act th e content fr om th e
tag:
"r") as fd:
soup =
BeautifulSoup(fd)
all_ps =
soup.find_all('p')
print("Total number of
<p> ---
{}".format(len(all_ps)
))
Total number of <p> --

- 6
Th is w ill pr int 6 , w h ich is

exactly th e nu m ber of
<p>tags in th e docu m ent.
We h av e seen h ow to access
all th e tags of th e sam e ty pe.
We h av e also seen h ow to get
th e content of th e entir e
HTML docu m ent.
6 . Now , w e w ill see h ow to get

th e contents u nder a
par ticu lar HTML tag, as
follow s:
"r") as fd:
soup =
BeautifulSoup(fd)
table = soup.table
print(table.contents)
Figure 5.31: Content under the <table>
tag
Her e, w e ar e getting th e
(fir st) table fr om th e
docu m ent and th en u sing
th e sam e "." notation, to get
th e contents u nder th at tag.
We saw in th e pr ev iou s
exer cise th at w e can access
th e entir e content u nder a
par ticu lar tag. How ev er ,
HTML is r epr esented as a
tr ee and w e ar e able to
tr av er se th e ch ildr en of a
par ticu lar node. Th er e ar e a
few w ay s to do th is.
7 . Th e fir st w ay is by u sing th e
children gener ator fr om
any bs4 instance, as follow s:
"r") as fd:
soup =
BeautifulSoup(fd)
table = soup.table
for child in
table.children:
print(child)
print("*****")
Wh en w e execu te th e code,
w e w ill see som eth ing like
th e follow ing:
Figure 5.32: Traversing the children of a
table node
It seem s th at th e loop h as
only been execu ted tw ice!
Well, th e pr oblem w ith th e
"children" gener ator is th at
it only takes into accou nt th e
im m ediate ch ildr en of th e
tag. We h av e <tbody> u nder
th e <table> and ou r w h ole
table str u ctu r e is w r apped in
it. Th at's w h y it w as
consider ed a single ch ild of
th e <table> tag.
We looked into h ow to br ow se
th e im m ediate ch ildr en of a
tag. We w ill see h ow w e can
br ow se all th e possible
ch ildr en of a tag and not
only th e im m ediate one.
8. To do th at, w e u se th e
descendants gener ator fr om
th e bs4 instance, as follow s:
"r") as fd:
soup =
BeautifulSoup(fd)
table = soup.table
children =
table.children
des =
table.descendants
print(len(list(childre
n)), len(list(des)))
9 61
Th e c omp ar i son p r i nt at t h e end of t h e c ode

b l oc k w i l l sh ow u s t h e di f f er enc e b et w een
children and descendants. Th e l engt h of t h e l i st
w e got f r om children i s onl y 9, w h er eas t h e
l engt h of t h e l i st w e got f r om descendants i s 61 .
EXERCISE 70: DATAFRAMES
AND BEAUTIFULSOUP
So f ar , w e h av e seen some b asi c w ay s t o nav i gat e
t h e t ags i nsi de a H TML doc u ment u si ng bs4. N ow ,
w e ar e goi ng t o go one st ep f u r t h er and u se t h e
p ow er of bs4 c omb i ned w i t h t h e p ow er of p andas
t o gener at e a Dat aFr ame ou t of a p l ai n H TML
t ab l e. Th i s p ar t i c u l ar k now l edge i s v er y u sef u l
f or u s. W i t h t h e k now l edge w e w i l l ac qu i r e now ,
i t w i l l b e f ai r l y easy f or u s t o p r ep ar e a p andas
Dat aFr ame t o p er f or m EDA (ex p l or at or y dat a
anal y si s) or model i ng. W e ar e goi ng t o sh ow t h i s
p r oc ess on a si mp l e smal l t ab l e f r om t h e t est
H TML f i l e, b u t t h e ex ac t same c onc ep t ap p l i es t o
any ar b i t r ar i l y l ar ge t ab l e as w el l :
1 . Im por t pandas and r ead th e

docu m ent, as follow s:
import pandas as pd
fd = open("test.html",
"r")
soup =
BeautifulSoup(fd)
data =
soup.findAll('tr')
print("Data is a {}
and {} items
long".format(type(data
), len(data)))
Data is a <class
'bs4.element.ResultSet
'> and 4 items long
2 . Ch eck th e or iginal table

str u ctu r e in th e HTML
sou r ce. You w ill see th at th e
fir st r ow is th e colu m n
h eadings and all of th e
follow ing r ow s ar e th e data.
We assign tw o differ ent
v ar iables for th e tw o
sections, as follow s:
data_without_header =
data[1:]
headers = data[0]
header
<tr>
<th>Entry Header
1</th>
<th>Entry Header
2</th>
<th>Entry Header
3</th>
<th>Entry Header
4</th>
</tr>
Note
Keep in mind that the art of

scraping a HTML page goes
hand in hand w ith an
understanding of the source
HTML structure. So,
w henever you w ant to scrape
a page, the first thing you
need to do is right-click on it
and then use "View Source"
from the brow ser to see the
source HTML.
3 . Once w e h av e separ ated th e

tw o sections, w e need tw o list
com pr eh ensions to m ake
th em r eady to go in a
DataFr am e. For th e h eader ,
th is is easy :
col_headers =
[th.getText() for th
in
headers.findAll('th')]
col_headers
['Entry Header 1',

'Entry Header 2',
'Entry Header 3',
'Entry Header 4']
4 . Data pr epar ation is a bit

tr icky for a pandas
DataFr am e. You need to
h av e a tw o-dim ensional list,
w h ich is a list of lists. We
accom plish th at in th e
follow ing w ay :
df_data =
[[td.getText() for td
in tr.findAll('td')]
for tr in
data_without_header]
df_data
Figure 5.33: Output as a two-dimensional

list
5. Inv oke th e pd.DataFrame

m eth od and su pply th e r igh t
ar gu m ents by u sing th e
follow ing code:
df =
pd.DataFrame(df_data,
columns=col_headers)
df.head()
Figure 5.34: Output in tabular format with column
headers
EXERCISE 71: EXPORTING A

DATAFRAME AS AN EXCEL FILE
In t h i s ex er c i se, w e w i l l see h ow w e c an sav e a
Dat aFr ame as an Ex c el f i l e. Pandas c an nat i v el y
do t h i s, b u t i t needs t h e h el p of t h e openpyxl
l i b r ar y t o ac h i ev e t h i s goal :
1 . Install th e openpyxl libr ar y

com m and:
!pip install openpyxl
2 . To sav e th e DataFr am e as an
Excel file, u se th e follow ing
com m and fr om inside of th e
Ju py ter notebook:
writer =
pd.ExcelWriter('test_o
utput.xlsx')df.to_exce
l(writer,
"Sheet1")writer.save()
writer
<pandas.io.excel._Xlsx
Writer at
0x24feb2939b0>
EXERCISE 72: STACKING URLS

FROM A DOCUMENT USING
BS4
Pr ev i ou sl y (w h i l e di sc u ssi ng st ac k ), w e
ex p l ai ned h ow i mp or t ant i t i s t o h av e a st ac k
t h at w e c an p u sh t h e U RLs f r om a w eb p age t o so
t h at w e c an p op t h em at a l at er t i me t o f ol l ow
eac h of t h em. H er e, i n t h i s ex er c i se, w e w i l l see
h ow t h at w or k s.
In t h e gi v en t est , H TML f i l e l i nk s or <a> t ags ar e

u nder a <ul> t ag, and eac h of t h em i s c ont ai ned
i nsi de a </li> t ag:
1 . Find all th e <a> tags by

com m and:
d = open("test.html",
"r")
soup =
BeautifulSoup(fd)
lis =
soup.find('ul').findAl
l('li')
stack = []
for li in lis: a =
li.find('a',
href=True)
2 . Define a stack befor e y ou

star t th e loop. Th en, inside
th e loop, u se th e append
m eth od to pu sh th e links in
th e stack:
stack.append(a['href']
)
3 . Pr int th e stack:
Figure 5.35: Output of the stack
ACTIVITY 7: READING
TABULAR DATA FROM A WEB
PAGE AND CREATING
DATAFRAMES
In t h i s ac t i v i t y , y ou h av e b een gi v en a
W i k i p edi a p age w h er e y ou h av e t h e GDP of al l
c ou nt r i es l i st ed. You h av e b een ask ed t o c r eat e
t h r ee DataFrames f r om t h e t h r ee sou r c es
ment i oned i n t h e p age
(h t t p s://en.w i k i p edi a.or g/w i k i /Li st _of _c ou nt r i e
s_b y _GDP_(nomi nal )):
You w i l l h av e t o do t h e f ol l ow i ng:
1 . Open th e page in a separ ate

Ch r om e/Fir efox tab and u se
som eth ing like an Inspect
Element tool to v iew th e
sou r ce HTML and
u nder stand its str u ctu r e
2 . Read th e page u sing bs4
3 . Find th e table str u ctu r e y ou

w ill need to deal w ith (h ow
m any tables th er e ar e?)
4 . Find th e r igh t table u sing

bs4
5. Separ ate th e sou r ce nam es

and th eir cor r esponding data
6 . Get th e sou r ce nam es fr om

th e list of sou r ces y ou h av e
cr eated
7 . Separ ate th e h eader and

data fr om th e data th at y ou
separ ated befor e for th e fir st
sou r ce only , and th en cr eate
a DataFr am e u sing th at
8. Repeat th e last task for th e

oth er tw o data sou r ces
Note

Summary
In t h i s t op i c , w e l ook ed at t h e st r u c t u r e of an
H TML doc u ment . H TML doc u ment s ar e t h e
c or ner st one of t h e W or l d W i de W eb and, gi v en
t h e amou nt of dat a t h at 's c ont ai ned on i t , w e c an
easi l y i nf er t h e i mp or t anc e of H TML as a dat a
sou r c e.
W e l ear ned ab ou t b s4 (Beau t i f u l Sou p 4), a Py t h on

l i b r ar y t h at gi v es u s Py t h oni c w ay s t o r ead and
qu er y H TML doc u ment s. W e u sed b s4 t o l oad an
H TML doc u ment and al so ex p l or ed sev er al
di f f er ent w ay s t o nav i gat e t h e l oaded doc u ment .
W e al so got nec essar y i nf or mat i on ab ou t t h e
di f f er enc e b et w een al l of t h ese met h ods.
W e l ook ed at h ow w e c an c r eat e a p andas

Dat aFr ame f r om an H TML doc u ment (w h i c h
c ont ai ns a t ab l e). A l t h ou gh t h er e ar e some b u i l t -
i n w ay s t o do t h i s job i n p andas, t h ey f ai l as soon
as t h e t ar get t ab l e i s enc oded i nsi de a c omp l ex
h i er ar c h y of el ement s. So, t h e k now l edge w e
gat h er ed i n t h i s t op i c b y t r ansf or mi ng an H TML
t ab l e i nt o a p andas Dat aFr ame i n a st ep -b y -st ep
manner i s i nv al u ab l e.
Fi nal l y , w e l ook ed at h ow w e c an c r eat e a st ac k

i n ou r c ode, w h er e w e p u sh al l t h e U RLs t h at w e
enc ou nt er w h i l e r eadi ng t h e H TML f i l e and t h en
u se t h em at a l at er t i me. In t h e nex t c h ap t er , w e
w i l l di sc u ss l i st c omp r eh ensi ons, zi p , f or mat and
ou t l i er det ec t i on and c l eani ng.
Chapter 6
Learning the Hidden
Secrets of Data
Wrangling
Learning Objectives
Clean and h andle r eal-life

m essy data
Pr epar e data for data

analy sis by for m atting data
in th e for m at r equ ir ed by
dow nstr eam sy stem s
Identify and r em ov e ou tlier s

fr om data
In t h i s c h ap t er , y ou w i l l l ear n ab ou t dat a i ssu es

t h at h ap p en i n r eal -l i f e. You w i l l al so l ear n h ow
t o sol v e t h ese i ssu es.
Introduction
In t h i s c h ap t er , w e w i l l l ear n ab ou t t h e sec r et
sau c e b eh i nd c r eat i ng a su c c essf u l dat a
w r angl i ng p i p el i ne. In t h e p r ev i ou s c h ap t er s,
w e w er e i nt r odu c ed t o t h e b asi c dat a st r u c t u r es
and b u i l di ng b l oc k s of Dat a W r angl i ng, su c h as
p andas and N u mPy . In t h i s c h ap t er , w e w i l l l ook
at t h e dat a h andl i ng sec t i on of dat a w r angl i ng.
Imagi ne t h at y ou h av e a dat ab ase of p at i ent s w h o

h av e h ear t di seases, and l i k e any su r v ey , t h e dat a
i s ei t h er mi ssi ng, i nc or r ec t , or h as ou t l i er s.
Ou t l i er s ar e v al u es t h at ar e ab nor mal and t end
t o b e f ar aw ay f r om t h e c ent r al t endenc y , and
t h u s i nc l u di ng i t i nt o y ou r f anc y mac h i ne
l ear ni ng model may i nt r odu c e a t er r i b l e b i as
t h at w e need t o av oi d. Of t en, t h ese p r ob l ems c an
c au se a h u ge di f f er enc e i n t er ms of money , man-
h ou r s, and ot h er or gani zat i onal r esou r c es. It i s
u ndeni ab l e t h at someone w i t h t h e sk i l l s t o sol v e
t h ese p r ob l ems w i l l p r ov e t o b e an asset t o an
or gani zat i on.
ADDITIONAL SOFTWARE
REQUIRED FOR THIS SECTION
Th e c ode f or t h i s ex er c i se dep ends on t w o
addi t i onal l i b r ar i es. W e need t o i nst al l SciPy and
python-Levenshtein, and w e ar e goi ng t o i nst al l
t h em i n t h e r u nni ng Doc k er c ont ai ner . Be w ar y
of t h i s, as w e ar e not i n t h e c ont ai ner .
To i nst al l t h e l i b r ar i es, t y p e t h e f ol l ow i ng
c ommand i n t h e r u nni ng Ju p y t er not eb ook :
!pip install scipy python-Levenshtein
Advanced List
Comprehension and the
zip Function
In t h i s t op i c , w e w i l l deep di v e i nt o t h e h ear t of
l i st c omp r eh ensi on. W e h av e al r eady seen a
b asi c f or m of i t , i nc l u di ng somet h i ng as si mp l e as
a = [i for i in range(0, 30)] t o somet h i ng a
b i t mor e c omp l ex t h at i nv ol v es one c ondi t i onal
st at ement . H ow ev er , as w e al r eady ment i oned,
l i st c omp r eh ensi on i s a v er y p ow er f u l t ool and,
i n t h i s t op i c , w e w i l l ex p l or e t h e p ow er of t h i s
amazi ng t ool f u r t h er . W e w i l l i nv est i gat e
anot h er c l ose r el at i v e of l i st c omp r eh ensi on
c al l ed generators, and al so w or k w i t h zip and i t s
r el at ed f u nc t i ons and met h ods. By t h e end of t h i s
t op i c , y ou w i l l b e c onf i dent i n h andl i ng
c omp l i c at ed l ogi c al p r ob l ems.
INTRODUCTION TO
GENERATOR EXPRESSIONS
Pr ev i ou sl y , w h i l e di sc u ssi ng adv anc ed dat a
st r u c t u r es, w e w i t nessed f u nc t i ons su c h as
repeat. W e sai d t h at t h ey r ep r esent a sp ec i al
t y p e of f u nc t i on k now n as i t er at or s. W e al so
sh ow ed y ou h ow t h e l azy ev al u at i on of an
i t er at or c an l ead t o an enor mou s amou nt of sp ac e
sav i ng and t i me ef f i c i enc y .
It er at or s ar e one b r i c k i n t h e f u nc t i onal
p r ogr ammi ng c onst r u c t t h at Py t h on h as t o of f er .
Fu nc t i onal p r ogr ammi ng i s i ndeed a v er y
ef f i c i ent and saf e w ay t o ap p r oac h a p r ob l em. It
of f er s v ar i ou s adv ant ages ov er ot h er met h ods,
su c h as modu l ar i t y , ease of deb u ggi ng and
t est i ng, c omp osab i l i t y , f or mal p r ov ab i l i t y (a
t h eor et i c al c omp u t er sc i enc e c onc ep t ), and so on.
EXERCISE 73: GENERATOR

EXPRESSIONS
In t h i s ex er c i se, w e w i l l b e i nt r odu c ed t o
gener at or ex p r essi ons, w h i c h ar e c onsi der ed
anot h er b r i c k of f u nc t i onal p r ogr ammi ng (as a
mat t er of f ac t , t h ey ar e i nsp i r ed b y t h e p u r e
f u nc t i onal l angu age k now n as H ask el l ). Si nc e w e
h av e seen some amou nt of l i st c omp r eh ensi on
al r eady , gener at or ex p r essi ons w i l l l ook
f ami l i ar t o u s. H ow ev er , t h ey al so of f er some
adv ant ages ov er l i st c omp r eh ensi on:
1 . Wr ite th e follow ing code

u sing list com pr eh ension to
gener ate a list of all th e odd
nu m ber s betw een 0 and
1 00,000:
odd_numbers2 = [x for
x in range(100000) if
x % 2 != 0]
2 . Use getsizeof fr om sys by

from sys import

getsizeof
getsizeof(odd_numbers2
)
406496
We w ill see th at it takes a

good am ou nt of m em or y to
do th is. It is also not v er y
tim e efficient. How can w e
ch ange th at? Using
som eth ing like repeat is not
applicable h er e becau se w e
need to h av e th e logic of th e
list com pr eh ension.
For tu nately , w e can tu r n
any list com pr eh ension into
a gener ator expr ession.
3 . Wr ite th e equ iv alent

gener ator expr ession for th e
afor em entioned list
com pr eh ension:
odd_numbers = (x for x
in range(100000) if x
% 2 != 0)
Notice th at th e only ch ange

w e m ade is to su r r ou nd th e
list com pr eh ension
statem ent w ith r ou nd
br ackets instead of squ ar e
ones. Th at m akes it sh r ink to
only ar ou nd 1 00 by tes! Th is
m akes it becom e a lazy
ev alu ation, and th u s is m or e
efficient.
4 . Pr int th e fir st 1 0 odd

nu m ber s, as follow s:
for i, number in
enumerate(odd_numbers)
:
print(number)
if i > 10:
break
11
13
15
17
19
21
23
EXERCISE 74: ONE-LINER

GENERATOR EXPRESSION
In t h i s ex er c i se, w e w i l l u se ou r k now l edge of
gener at or ex p r essi ons t o gener at e an ex p r essi on
t h at w i l l r ead one w or d at a t i me f r om a l i st of
w or ds and w i l l r emov e new l i ne c h ar ac t er s at t h e
end of t h em and mak e t h em l ow er c ase. Th i s c an
c er t ai nl y b e done u si ng a for l oop ex p l i c i t l y :
1 . Cr eate a w or ds str ing, as

follow s:
words = ["Hello\n",
"My name", "is\n",
"Bob", "How are you",
"doing\n"]
2 . Wr ite th e follow ing
gener ator expr ession to
ach iev e th e task, as follow s:
modified_words =
(word.strip().lower()
for word in words)
3 . Cr eate a list com pr eh ension

to get w or ds one by one fr om
th e gener ator expr ession and
finally pr int th e list, as
follow s:
final_list_of_word =
[word for word in
modified_words]
final_list_of_word
Figure 6.1: List comprehension of words
EXERCISE 75: EXTRACTING A

LIST WITH SINGLE WORDS
If w e l ook at t h e ou t p u t of t h e p r ev i ou s ex er c i se,
w e w i l l not i c e t h at du e t o t h e messy nat u r e of
t h e sou r c e dat a (w h i c h i s nor mal i n t h e r eal
w or l d), w e ended u p w i t h a l i st w h er e, i n some
c ases, w e h av e mor e t h an one w or d t oget h er ,
sep ar at ed b y a sp ac e. To i mp r ov e t h i s, and t o get a
l i st of si ngl e w or ds, w e w i l l h av e t o modi f y t h e
gener at or ex p r essi ons:
1 . Wr ite th e gener ator

expr ession and th en w r ite
th e equ iv alent nested for
loops so th at w e can com par e
th e r esu lts:
words = ["Hello\n",
"My name", "is\n",
"Bob", "How are you",
"doing\n"]
modified_words2 =
(w.strip().lower() for
word in words for w in
word.split(" "))
final_list_of_word =
[word for word in
modified_words2]
final_list_of_word
Figure 6.2: List of words from the string
2 . Wr ite an equ iv alent to th is

by follow ing a nested for
loop, as follow s:
modified_words3 = []
for word in words:
for w in word.split("
"):
modified_words3.append
(w.strip().lower())
modified_words3
Figure 6.3: List of words from the string using a nested

loop
W e mu st admi t t h at t h e gener at or ex p r essi on

w as not onl y sp ac e and t i me sav i ng b u t al so a
mor e el egant w ay t o w r i t e t h e same l ogi c .
To r ememb er h ow t h e nest ed l oop i n gener at or

ex p r essi ons w or k s, k eep i n mi nd t h at t h e l oop s
ar e ev al u at ed f r om l ef t t o r i gh t and t h e f i nal
l oop v ar i ab l e (i n ou r ex amp l e, w h i c h i s denot ed
b y t h e si ngl e l et t er "w ") i s gi v en b ac k (t h u s w e
c ou l d c al l strip and lower on i t ).
Th e f ol l ow i ng di agr am w i l l h el p y ou r ememb er
t h e t r i c k ab ou t nest ed for l oop s i n l i st
c omp r eh ensi on or gener at or ex p r essi on:
Figure 6.4: Nested loops illustration
W e h av e l ear ned ab ou t nest ed for l oop s i n

gener at or ex p r essi ons p r ev i ou sl y , b u t now w e
ar e goi ng t o l ear n ab ou t i ndep endent for l oop s i n
a gener at or ex p r essi on. W e w i l l h av e t w o ou t p u t
v ar i ab l es f r om t w o for l oop s and t h ey mu st b e
t r eat ed as a t u p l e so t h at t h ey don't h av e
amb i gu ou s gr ammar i n Py t h on.
Cr eat e t h e f ol l ow i ng t w o l i st s:
marbles = ["RED", "BLUE", "GREEN"]
counts = [1, 5, 13]
You ar e ask ed t o gener at e al l p ossi b l e

c omb i nat i ons of mar b l es and c ou nt s af t er b ei ng
gi v en t h e p r ec edi ng t w o l i st s. H ow w i l l y ou do
t h at ? Su r el y u si ng a nest ed for l oop and w i t h
l i st 's append met h od y ou c an ac c omp l i sh t h e t ask .
H ow ab ou t a gener at or ex p r essi on? A mor e
el egant and easy sol u t i on i s as f ol l ow s:
marble_with_count = ((m, c) for m in

marbles for c in counts)
Th i s gener at or ex p r essi on c r eat es a t u p l e i n

eac h i t er at i on of t h e si mu l t aneou s for l oop s.
Th i s c ode i s equ i v al ent t o t h e f ol l ow i ng ex p l i c i t
c ode:
marble_with_count_as_list_2 = []
for m in marbles:
for c in counts:
marble_with_count_as_list_2.append((m, c))
marble_with_count_as_list_2
Figure 6.5: Appending the marbles and counts
Th i s gener at or ex p r essi on c r eat es a t u p l e i n

eac h i t er at i on of t h e si mu l t aneou s f or l oop s.
Onc e agai n, t h e gener at or ex p r essi on i s easy ,
el egant , and ef f i c i ent .
EXERCISE 76: THE ZIP

FUNCTION
In t h i s ex er c i se, w e w i l l ex ami ne t h e zip
f u nc t i on and c omp ar e i t w i t h t h e gener at or
ex p r essi on w e w r ot e i n t h e p r ev i ou s ex er c i se.
Th e p r ob l em w i t h t h e p r ev i ou s gener at or
ex p r essi on i s t h e f ac t t h at , i t p r odu c ed al l
p ossi b l e c omb i nat i ons. For i nst anc e, i f w e need
t o r el at e c ou nt r i es w i t h i t s c ap i t al s, doi ng so
u si ng gener at or ex p r essi on w i l l b e di f f i c u l t .
For t u nat el y , Py t h on gi v es u s a b u i l t -i n f u nc t i on
c al l ed zip f or ju st t h i s p u r p ose:
1 . Cr eate th e follow ing tw o

lists:
countries = ["India",
"USA", "France", "UK"]
capitals = ["Delhi",
"Washington", "Paris",
"London"]
2 . Gener ate a list of tu ples

w h er e th e fir st elem ent is th e
nam e of th e cou ntr y and th e
second elem ent is th e nam e
of th e capital by u sing th e
follow ing com m ands:
countries_and_capitals
= [t for t in
zip(countries,
capitals)]
3 . Th is is not v er y w ell
r epr esented. We can u se
dict w h er e key s ar e th e
nam es of th e cou ntr ies,
w h er eas th e v alu es ar e th e
nam es of th e capitals by
com m and:
_as_dict =
dict(zip(countries,
capitals))
Figure 6.6: Dictionary with countries and capitals

MESSY DATA
A s al w ay s, i n r eal l i f e, dat a i s messy . So, t h e ni c e
equ al l engt h l i st s of c ou nt r i es and c ap i t al s t h at
w e ju st saw ar e not av ai l ab l e.
Th e zip f u nc t i on c annot b e u sed w i t h u nequ al
l engt h l i st s, b ec au se zip w i l l st op w or k i ng as
soon as one of t h e l i st s c omes t o an end. To sav e u s
i n su c h a si t u at i on, w e h av e ziplongest i n t h e
itertools modu l e:
1 . Cr eate tw o lists of u nequ al

length , as follow s:
countries = ["India",
"USA", "France", "UK",
"Brasil", "Japan"]
capitals = ["Delhi",
"Washington", "Paris",
"London"]
2 . Cr eate th e final dict and pu t

None as th e v alu e to th e
cou ntr ies w h o do not h av e a
capital in th e capital's list:

zip_longest
_as_dict_2 =
dict(zip_longest(count
ries, capitals))
_as_dict_2
Figure 6.7: Output using ziplongest
W e sh ou l d p au se h er e f or a sec ond and t h i nk

ab ou t h ow many l i nes of ex p l i c i t c ode and
di f f i c u l t -t o-u nder st and if-else c ondi t i onal
l ogi c w e ju st sav ed b y c al l i ng a si ngl e f u nc t i on
and ju st gi v i ng i t t h e t w o sou r c e dat a l i st s. It i s
i ndeed amazi ng!
W i t h t h ese ex er c i ses, w e ar e endi ng t h e f i r st

t op i c of t h i s c h ap t er . A dv anc ed l i st
c omp r eh ensi on, gener at or ex p r essi ons, and
f u nc t i ons su c h as zip and ziplongest ar e some
v er y i mp or t ant t r i c k s t h at w e need t o mast er i f
w e w ant t o w r i t e c l ean, ef f i c i ent , and
mai nt ai nab l e c ode. Code t h at does not h av e t h ese
t h r ee qu al i t i es ar e c onsi der ed su b -p ar i n t h e
i ndu st r y , and w e c er t ai nl y don't w ant t o w r i t e
su c h c ode.
H ow ev er , w e di d not c ov er one i mp or t ant ob jec t

h er e, t h at i s, g e ne rators. Gener at or s ar e a
sp ec i al t y p e of f u nc t i on t h at sh ar es t h e
b eh av i or al t r ai t s w i t h gener at or ex p r essi ons.
H ow ev er , b ei ng a f u nc t i on, t h ey h av e a b r oader
sc op e and t h ey ar e mu c h mor e f l ex i b l e. W e
st r ongl y enc ou r age y ou t o l ear n ab ou t t h em.
Data Formatting
In t h i s t op i c , w e w i l l f or mat a gi v en dat aset . Th e
mai n mot i v at i ons b eh i nd f or mat t i ng dat a
p r op er l y ar e as f ol l ow s:
It h elps all th e dow nstr eam

sy stem s to h av e a single and
pr e-agr eed for m of data for
each data point, th u s
av oiding su r pr ises and, in
effect, br eaking it.
To pr odu ce a h u m an-
r eadable r epor t fr om low er -
lev el data th at is, m ost of th e
tim e, cr eated for m ach ine
consu m ption.
To find er r or s in data.
Th er e ar e a f ew w ay s t o do dat a f or mat t i ng i n
Py t h on. W e w i l l b egi n w i t h t h e modu l u s
op er at or .
THE % OPERATOR
Py t h on gi v es u s t h e % op er at or t o ap p l y b asi c
f or mat t i ng on dat a. To demonst r at e t h i s, w e w i l l
l oad t h e dat a f i r st b y r eadi ng t h e CSV f i l e, and
t h en w e w i l l ap p l y some b asi c f or mat t i ng on i t .
Load t h e dat a f r om t h e CSV f i l e b y u si ng t h e

from csv import DictReader
raw_data = []
with open("combinded_data.csv", "rt") as

fd:
data_rows = DictReader(fd)
for data in data_rows:
raw_data.append(dict(data))
N ow , w e h av e a l i st c al l ed raw_data t h at c ont ai ns
al l t h e r ow s of t h e CSV f i l e. Feel f r ee t o p r i nt i t
t o c h ec k ou t w h at i t l ook s l i k e.
Figure 6.8: Raw data
W e w i l l b e p r odu c i ng a r ep or t on t h i s dat a. Th i s
r ep or t w i l l c ont ai n one sec t i on f or eac h dat a
p oi nt and w i l l r ep or t t h e name, age, w ei gh t ,
h ei gh t , h i st or y of f ami l y di sease, and f i nal l y t h e
p r esent h ear t c ondi t i on of t h e p er son. Th ese
p oi nt s mu st b e c l ear and easi l y u nder st andab l e
Engl i sh sent enc es.
W e do t h i s i n t h e f ol l ow i ng w ay :
for data in raw_data:

report_str = """%s is %s years old and is
%s meter tall weighing about %s kg.\n
Has a history of family illness: %s.\n
Presently suffering from a heart disease:

%s
""" % (data["Name"], data["Age"],

data["Height"], data["Weight"],
data["Disease_history"],
data["Heart_problem"])
print(report_str)
Figure 6.9: Raw data in a presentable format
Th e % op er at or i s u sed i n t w o di f f er ent w ay s:
Wh en u sed inside a qu ote, it

signifies w h at kind of data to
expect h er e. %s stands for
str ing, w h er eas %d stands for
integer . If w e indicate a
w r ong data ty pe, it w ill
th r ow an er r or . Th u s, w e can
effectiv ely u se th is kind of
for m atting as an er r or filter
in th e incom ing data.
Wh en w e u se th e % oper ator
ou tside th e qu ote, it basically
tells Py th on to star t th e
r eplacem ent of all th e data
inside w ith th e v alu es
pr ov ided for th em ou tside.
USING THE FORMAT

FUNCTION
In t h i s sec t i on, w e w i l l b e l ook i ng at t h e ex ac t
same f or mat t i ng p r ob l em, b u t t h i s t i me w e w i l l
u se a mor e adv anc ed ap p r oac h . W e w i l l u se
Py t h on's format f u nc t i on.
To u se t h e format f u nc t i on, w e do t h e f ol l ow i ng:
report_str = """{} is {} years old and is

{} meter tall weighing about {} kg.\n
Has a history of family illness: {}.\n

{}
""".format(data["Name"], data["Age"],
data["Height"], data["Weight"],
data["Disease_history"],
data["Heart_problem"])
print(report_str)
Figure 6.10: Data formatted using the format function
of the string
N ot i c e t h at w e h av e r ep l ac ed t h e %s w i t h {} and
i nst ead of t h e % ou t si de t h e qu ot e, w e h av e c al l ed
t h e format f u nc t i on.
W e w i l l see h ow t h e p ow er f u l format f u nc t i on
c an mak e t h e p r ev i ou s c ode a l ot mor e r eadab l e
and u nder st andab l e. Inst ead of si mp l e and b l ank
{}, w e ment i on t h e k ey names i nsi de and t h en u se
t h e sp ec i al Py t h on ** op er at i on on a dict t o
u np ac k i t and gi v e t h at t o t h e f or mat f u nc t i on. It
i s smar t enou gh t o f i gu r e ou t h ow t o r ep l ac e t h e
k ey names i nsi de t h e qu ot e w i t h t h e v al u es f r om
t h e ac t u al dict b y u si ng t h e f ol l ow i ng c ommand:
report_str = """{Name} is {Age} years old

and is {Height} meter tall weighing about
{Weight} kg.\n
Has a history of family illness:

{Disease_history}.\n

{Heart_problem}
""".format(**data)
print(report_str)
Figure 6.11: Readable file using the ** operation
Th i s ap p r oac h i s i ndeed mu c h mor e c onc i se and

mai nt ai nab l e.
EXERCISE 78: DATA

REPRESENTATION USING {}
Th e {} not at i on i nsi de t h e qu ot e i s p ow er f u l and
w e c an c h ange ou r dat a r ep r esent at i on
si gni f i c ant l y b y u si ng i t :
1 . Ch ange a decim al nu m ber to

its binar y for m by u sing th e
original_number = 42
print("The binary
representation of 42
is -
{0:b}".format(original
_number))
Figure 6.12: A number in its binary

representation
2 . Pr inting a str ing th at's

center or iented:
print("
{:^42}".format("I am
at the center"))
Figure 6.13: A string that's been center

formatted
3 . Pr inting a str ing th at's

center or iented, bu t th is
tim e w ith padding on both
sides:
print("
{:=^42}".format("I am
at the center"))
Figure 6.14: A string that's been center formatted with
padding
A s w e'v e al r eady ment i oned, t h e f or mat

st at ement i s a p ow er f u l one.
It i s i mp or t ant t o f or mat dat e as dat e h as v ar i ou s

f or mat s dep endi ng on w h at t h e sou r c e of t h e dat a
i s, and i t may need sev er al t r ansf or mat i ons
i nsi de t h e dat a w r angl i ng p i p el i ne.
W e c an u se t h e f ami l i ar dat e f or mat t i ng

not at i ons w i t h f or mat as f ol l ow s:
from datetime import datetime
print("The present datetime is {:%Y-%m-%d

%H:%M:%S}".format(datetime.utcnow()))
Figure 6.15: Data a er being formatted
Comp ar e i t w i t h t h e ac t u al ou t p u t of
datetime.utcnow and y ou w i l l see t h e p ow er of
t h i s ex p r essi on easi l y .
Identify and Clean

Outliers
W h en c onf r ont ed w i t h r eal -w or l d dat a, w e of t en
see a sp ec i f i c t h i ng i n a set of r ec or ds: t h er e ar e
some dat a p oi nt s t h at do not f i t w i t h t h e r est of
t h e r ec or ds. Th ey h av e some v al u es t h at ar e t oo
b i g, or t oo smal l , or c omp l et el y mi ssi ng. Th ese
k i nds of r ec or ds ar e c al l ed outliers.
St at i st i c al l y , t h er e i s a p r op er def i ni t i on and
i dea ab ou t w h at an ou t l i er means. A nd of t en, y ou
need deep domai n ex p er t i se t o u nder st and w h en
t o c al l a p ar t i c u l ar r ec or d an ou t l i er . H ow ev er ,
i n t h i s p r esent ex er c i se, w e w i l l l ook i nt o some
b asi c t ec h ni qu es t h at ar e c ommonp l ac e t o f l ag
and f i l t er ou t l i er s i n r eal -w or l d dat a f or day -t o-
day w or k .
EXERCISE 79: OUTLIERS IN

NUMERICAL DATA
In t h i s ex er c i se, w e w i l l f i r st c onst r u c t a not i on
of an ou t l i er b ased on nu mer i c al dat a. Imagi ne a
c osi ne c u r v e. If y ou r ememb er t h e mat h f or t h i s
f r om h i gh sc h ool , t h en a c osi ne c u r v e i s a v er y
smoot h c u r v e w i t h i n t h e l i mi t of [1, -1]:
1 . To constr u ct a cosine cu r v e,
execu te th e follow ing
com m and:
from math import cos,

pi
ys = [cos(i*(pi/4))
for i in range(50)]
2 . Plot th e data by u sing th e

follow ing code:
import
plt
plt.plot(ys)
Figure 6.16: Cosine wave
A s w e can see, it is a v er y
sm ooth cu r v e, and th er e is
no ou tlier . We ar e going to
intr odu ce som e now .
3 . Intr odu ce som e ou tlier s by

com m and:
ys[4] = ys[4] + 5.0
ys[20] = ys[20] + 8.0
4 . Plot th e cu r v e:
plt.plot(ys)
Figure 6.17: Wave with outliers
W e c an see t h at w e h av e su c c essf u l l y i nt r odu c ed

t w o v al u es i n t h e c u r v e, w h i c h b r ok e t h e
smoot h ness and h enc e c an b e c onsi der ed as
ou t l i er s.
A good w ay t o det ec t i f ou r dat aset h as an ou t l i er

i s t o c r eat e a b ox p l ot . A b ox p l ot i s a w ay of
p l ot t i ng nu mer i c al dat a b ased on t h ei r c ent r al
t endenc y and some buckets (i n r eal i t y , w e c al l
t h em quartiles). In a b ox p l ot , t h e ou t l i er s ar e
u su al l y dr aw n as sep ar at e p oi nt s. Th e matplotlib
l i b r ar y h el p s dr aw b ox p l ot s ou t of a ser i es of
nu mer i c al dat a, w h i c h i sn't h ar d at al l . Th i s i s
h ow w e do i t :
plt.boxplot(ys)
Onc e y ou ex ec u t e t h e p r ec edi ng c ode, y ou w i l l

b e ab l e t o see t h at t h er e i s a ni c e b ox p l ot w h er e
t h e t w o ou t l i er s t h at w e h ad c r eat ed ar e c l ear l y
sh ow n, ju st l i k e i n t h e f ol l ow i ng di agr am:
Figure 6.18: Boxplot with outliers

Z-SCORE
A z-sc or e i s a measu r e on a set of dat a t h at gi v es
y ou a v al u e f or eac h dat a p oi nt r egar di ng h ow
mu c h t h at dat a p oi nt i s sp r ead ou t w i t h r esp ec t
t o t h e st andar d dev i at i on and mean of t h e dat aset .
W e c an u se z-sc or e t o nu mer i c al l y det ec t
ou t l i er s i n a set of dat a. N or mal l y , any dat a p oi nt
w i t h a z-sc or e gr eat er t h an +3 or l ess t h en -3 i s
c onsi der ed an ou t l i er . W e c an u se t h i s c onc ep t
w i t h a b i t of h el p f r om t h e ex c el l ent Sc i Py and
pandas l i b r ar i es t o f i l t er ou t t h e ou t l i er s.
U se Sc i Py and c al c u l at e t h e z-sc or e b y u si ng t h e
from scipy import stats
cos_arr_z_score = stats.zscore(ys)
Cos_arr_z_score
Figure 6.19: The z-score values
EXERCISE 80: THE Z-SCORE

VALUE TO REMOVE OUTLIERS
In t h i s ex er c i se, w e w i l l di sc u ss h ow t o get r i d of
ou t l i er s i n a set of dat a. In t h e l ast ex er c i se, w e
c al c u l at ed t h e z-sc or e of eac h dat a p oi nt . In t h i s
ex er c i se, w e w i l l u se t h at t o r emov e ou t l i er s
f r om ou r dat a:
1 . Im por t pandas and cr eate a

DataFr am e:
import pandas as pd
df_original =
pd.DataFrame(ys)
2 . A ssign ou tlier s w ith a z-scor e
less th an 3 :
cos_arr_without_outlie
rs =
df_original[(cos_arr_z
_score < 3)]
3 . Use th e print fu nction to

pr int th e new and old sh ape:
print(cos_arr_without_
outliers.shape)
print(df_original.shap
e)
Fr om th e tw o pr ints (4 8, 1
and 50, 1 ), it is clear th at th e
der iv ed DataFr am e h as tw o
r ow s less. Th ese ar e ou r
ou tlier s. If w e plot th e
cos_arr_without_outliers
DataFr am e, th en w e w ill see
th e follow ing ou tpu t:
Figure 6.20: Cosine wave without outliers
A s ex p ec t ed, w e got b ac k t h e smoot h c u r v e and

got r i d of t h e ou t l i er s.
Det ec t i ng and get t i ng r i d of ou t l i er s i s an

i nv ol v i ng and c r i t i c al p r oc ess i n any dat a
w r angl i ng p i p el i ne. Th ey need deep domai n
k now l edge, ex p er t i se i n desc r i p t i v e st at i st i c s,
mast er y ov er t h e p r ogr ammi ng l angu age (and al l
t h e u sef u l l i b r ar i es), and a l ot of c au t i on. W e
r ec ommend b ei ng v er y c ar ef u l w h en doi ng t h i s
op er at i on on a dat aset .
EXERCISE 81: FUZZY
MATCHING OF STRINGS
In t h i s ex er c i se, w e w i l l l ook i nt o a sl i gh t l y
di f f er ent p r ob l em w h i c h , at t h e f i r st gl anc e, may
l ook l i k e an ou t l i er . H ow ev er , u p on c ar ef u l
ex ami nat i on, w e w i l l see t h at i t i s i ndeed not ,
and w e w i l l l ear n ab ou t a u sef u l c onc ep t t h at i s
somet i mes r ef er r ed t o as f u zzy mat c h i ng of
st r i ngs.
Lev ensh t ei n di st anc e i s an adv anc ed c onc ep t . W e

c an t h i nk of i t as t h e mi ni mu m nu mb er of si ngl e-
c h ar ac t er edi t s t h at ar e needed t o c onv er t one
st r i ng i nt o anot h er . W h en t w o st r i ngs ar e
i dent i c al , t h e di st anc e b et w een t h em i s 0 – t h e
mor e t h e di f f er enc e, t h e h i gh er t h e nu mb er . W e
c an c onsi der a t h r esh ol d of di st anc e u nder w h i c h
w e w i l l c onsi der t w o st r i ngs as t h e same. Th u s,
w e c an not onl y r ec t i f y h u man er r or b u t al so
sp r ead a saf et y net so t h at w e don't p ass al l t h e
c andi dat es.
Lev ensh t ei n di st anc e c al c u l at i on i s an i nv ol v i ng

p r oc ess, and w e ar e not goi ng t o i mp l ement i t
f r om sc r at c h h er e. Th ank f u l l y , l i k e a l ot of ot h er
t h i ngs, t h er e i s a l i b r ar y av ai l ab l e f or u s t o do
t h i s. It i s c al l ed p y t h on-Lev ensh t ei n:
1 . Cr eate th e load data of a sh ip

on th r ee differ ent dates:
Figure 6.21: Initialized ship_data variable

If y ou look car efu lly , y ou w ill
notice th at th e nam e of th e
sh ip is spelled differ ently in
all th r ee differ ent cases. Let's
assu m e th at th e actu al nam e
of th e sh ip is "Sea Pr incess".
Fr om a nor m al per spectiv e,
it does look like th er e h ad
been a h u m an er r or and th e
data points do descr ibe a
single sh ip. Rem ov ing tw o of
th em on a str ict basis of
ou tlier s m ay not be th e best
th ing to do.
2 . Th en, w e sim ply need to

im por t th e distance fu nction
fr om it and pass tw o str ings
to it to calcu late th e distance
betw een th em :
from Levenshtein
import distance
name_of_ship = "Sea
Princess"
for k, v in
ship_data.items():
print("{} {}
{}".format(k,
name_of_ship,
distance(name_of_ship,
k)))
Figure 6.22: Distance between the strings
W e w i l l not i c e t h at t h e di st anc e b et w een t h e

st r i ngs ar e di f f er ent . It i s 0 w h en t h ey ar e
i dent i c al , and i t i s a p osi t i v e i nt eger w h en t h ey
ar e not . W e c an u se t h i s c onc ep t i n ou r dat a
w r angl i ng job s and say t h at st r i ngs w i t h di st anc e
l ess t h an or equ al t o a c er t ai n nu mb er i s t h e same
st r i ng.
H er e, agai n, w e need t o b e c au t i ou s ab ou t w h en
and h ow t o u se t h i s k i nd of f u zzy st r i ng mat c h i ng.
Somet i mes, t h ey ar e needed, and ot h er t i mes t h ey
w i l l r esu l t i n a v er y b ad b u g.
Activity 8: Handling
Outliers and Missing
Data
In t h i s ac t i v i t y , w e w i l l i dent i f y and get r i d of
ou t l i er s. H er e, w e h av e a CSV f i l e. Th e goal h er e
i s t o c l ean t h e dat a b y u si ng t h e k now l edge t h at
w e h av e l ear ned ab ou t so f ar and c ome u p w i t h a
ni c el y f or mat t ed Dat aFr ame. Ident i f y t h e t y p e of
ou t l i er s and t h ei r ef f ec t on t h e dat a and c l ean
t h e messy dat a.
Th e st ep s t h at w i l l h el p y ou sol v e t h i s ac t i v i t y
ar e as f ol l ow s:
1 . Read th e visit_data.csv
file.
2 . Ch eck for du plicates.
3 . Ch eck if any essential

colu m n contains NaN.
4 . Get r id of th e ou tlier s.
5. Repor t th e size differ ence.
6 . Cr eate a box plot to ch eck for

ou tlier s.
7 . Get r id of any ou tlier s.
Note

Summary
In t h i s c h ap t er , w e l ear ned ab ou t i nt er est i ng
w ay s t o deal w i t h l i st dat a b y u si ng a generator
ex p r essi on. Th ey ar e easy and el egant and onc e
mast er ed, t h ey gi v e u s a p ow er f u l t r i c k t h at w e
c an u se r ep eat edl y t o si mp l i f y sev er al c ommon
dat a w r angl i ng t ask s. W e al so ex ami ned
di f f er ent w ay s t o f or mat dat a. For mat t i ng of dat a
i s not onl y u sef u l f or p r ep ar i ng b eau t i f u l
r ep or t s – i t i s of t en v er y i mp or t ant t o gu ar ant ee
dat a i nt egr i t y f or t h e dow nst r eam sy st em.
W e ended t h e c h ap t er b y c h ec k i ng ou t some
met h ods t o i dent i f y and r emov e ou t l i er s. Th i s i s
i mp or t ant f or u s b ec au se w e w ant ou r dat a t o b e
p r op er l y p r ep ar ed and r eady f or al l ou r f anc y
dow nst r eam anal y si s job s. W e al so ob ser v ed h ow
i mp or t ant i t i s t o t ak e t i me and u se domai n
ex p er t i se t o set u p r u l es f or i dent i f y i ng ou t l i er s,
as doi ng t h i s i nc or r ec t l y c an do mor e h ar m t h an
good.
In t h e nex t c h ap t er , w e w i l l c ov er t h e h ow t o
r ead w eb p ages, XML f i l es, and A PIs.
Chapter 7
Advanced Web Scraping
and Data Gathering
Learning Objectives
Make u se of requests and

BeautifulSoup to r ead
v ar iou s w eb pages and
gath er data fr om th em
Per for m r ead oper ations on

XML files and th e w eb u sing
an A pplication Pr ogr am
Inter face (A PI)
Make u se of r egex tech niqu es

to scr ape u sefu l infor m ation
fr om a lar ge and m essy text
cor pu s
In t h i s c h ap t er , y ou w i l l l ear n h ow t o gat h er
dat a f r om w eb p ages, XML f i l es, and A PIs.
Introduction
Th e p r ev i ou s c h ap t er c ov er ed h ow t o c r eat e a
su c c essf u l dat a w r angl i ng p i p el i ne. In t h i s
c h ap t er , w e w i l l b u i l d a r eal -l i f e w eb sc r ap er
u si ng al l of t h e t ec h ni qu es t h at w e h av e l ear ned
so f ar . Th i s c h ap t er b u i l ds on t h e f ou ndat i on of
BeautifulSoup and i nt r odu c es v ar i ou s met h ods
f or sc r ap i ng a w eb p age and u si ng an A PI t o
gat h er dat a.
The Basics of Web

Scraping and the
Beautiful Soup Library
In t oday 's c onnec t ed w or l d, one of t h e most
v al u ed and w i del y u sed sk i l l f or a dat a
w r angl i ng p r of essi onal i s t h e ab i l i t y t o ex t r ac t
and r ead dat a f r om w eb p ages and dat ab ases
h ost ed on t h e w eb . Most or gani zat i ons h ost dat a
on t h e c l ou d (p u b l i c or p r i v at e), and t h e major i t y
of w eb mi c r oser v i c es t h ese day s p r ov i de some
k i nd of A PI f or t h e ex t er nal u ser s t o ac c ess dat a:
Figure 7.1: Data wrangling HTTP request and an
XML/JSON reply
It i s nec essar y t h at , as a dat a w r angl i ng engi neer ,

y ou k now ab ou t t h e st r u c t u r e of w eb p ages and
Py t h on l i b r ar i es so t h at y ou ar e ab l e t o ex t r ac t
dat a f r om a w eb p age. Th e W or l d W i de W eb i s an
ev er -gr ow i ng, ev er -c h angi ng u ni v er se, i n w h i c h
di f f er ent dat a ex c h ange p r ot oc ol s and f or mat s
ar e u sed. A f ew of t h ese ar e w i del y u sed and h av e
b ec ome st andar d.
LIBRARIES IN PYTHON
Py t h on c omes equ i p p ed w i t h b u i l t -i n modu l es,
su c h as urllib 3, w h i c h t h at c an p l ac e H TTP
r equ est s ov er t h e i nt er net and r ec ei v e dat a f r om
t h e c l ou d. H ow ev er , t h ese modu l es op er at e at a
l ow er l ev el and r equ i r e deep er k now l edge of
H TTP p r ot oc ol s, enc odi ng, and r equ est s.
W e w i l l t ak e adv ant age of t w o Py t h on l i b r ar i es

i n t h i s c h ap t er : Requests and BeautifulSoup. To
av oi d deal i ng w i t h H TTP met h ods on a l ow er
l ev el , w e w i l l u se t h e Requests l i b r ar y . It i s an
A PI b u i l t on t op of p u r e Py t h on w eb u t i l i t y
l i b r ar i es, w h i c h mak es p l ac i ng H TTP r equ est s
easy and i nt u i t i v e.
Be autifulSoup i s one of t h e most p op u l ar H TML

p ar ser p ac k ages. It p ar ses t h e H TML c ont ent y ou
p ass on and b u i l ds a det ai l ed t r ee of al l t ags and
mar k u p s w i t h i n t h e p age f or easy and i nt u i t i v e
t r av er sal . Th i s t r ee c an b e u sed b y a p r ogr ammer
t o l ook f or c er t ai n mar k u p el ement s (f or
ex amp l e, a t ab l e, a h y p er l i nk , or a b l ob of t ex t
w i t h i n a p ar t i c u l ar di v ID) t o sc r ap e u sef u l dat a.

REQUESTS LIBRARY TO GET A
RESPONSE FROM THE
WIKIPEDIA HOME PAGE
Th e W i k i p edi a h ome p age c onsi st s of many
el ement s and sc r i p t s, al l of w h i c h ar e a mi x of
H TML, CSS, and Jav aSc r i p t c ode b l oc k s. To r ead
t h e h ome p age of W i k i p edi a and ex t r ac t some
u sef u l t ex t u al i nf or mat i on, w e need t o mov e st ep
b y st ep , as w e ar e not i nt er est ed i n al l of t h e c ode
or mar k u p t ags; onl y some sel ec t ed p or t i ons
of t ex t .
In t h i s ex er c i se, w e w i l l p eel of f t h e l ay er s of
H TML/CSS/Jav aSc r i p t t o p r y aw ay t h e
i nf or mat i on w e ar e i nt er est ed i n.
1 . Im por t th e requests
libr ar y :
import requests
2 . A ssign th e h om e page URL to
a v ar iable, wiki_home:
# First assign the URL

of Wikipedia home page
to a strings
wiki_home =
"https://en.wikipedia.
org/wiki/Main_Page"
3 . Use th e get m eth od fr om th e

requests libr ar y to get a
r esponse fr om th is page:
response =
requests.get(wiki_home
)
4 . To get infor m ation abou t th e

r esponse object, enter th e
follow ing code:
type(response)
requests.models.Respon
se
It i s a model dat a st r u c t u r e t h at 's def i ned i n t h e

requests l i b r ar y .
Th e w eb i s an ex t r emel y dy nami c p l ac e. It i s
p ossi b l e t h at t h e h ome p age of W i k i p edi a w i l l
h av e c h anged b y t h e t i me someb ody u ses y ou r
c ode, or t h at a p ar t i c u l ar w eb ser v er w i l l b e
dow n and y ou r r equ est w i l l essent i al l y f ai l . If
y ou p r oc eed t o w r i t e mor e c omp l ex and
el ab or at e c ode w i t h ou t c h ec k i ng t h e st at u s of
y ou r r equ est , t h en al l t h at su b sequ ent w or k w i l l
b e f r u i t l ess.
A w eb p age r equ est gener al l y c omes b ac k w i t h

v ar i ou s c odes. H er e ar e some of t h e c ommon c odes
y ou may enc ou nt er :
Figure 7.2: Web requests and their description
So, w e w r i t e a f u nc t i on t o c h ec k t h e c ode and

p r i nt ou t messages as needed. Th ese k i nds of
smal l h el p er /u t i l i t y f u nc t i ons ar e i nc r edi b l y
u sef u l f or c omp l ex p r ojec t s.
EXERCISE 82: CHECKING THE

STATUS OF THE WEB
REQUEST
N ex t , w e w i l l w r i t e a smal l u t i l i t y f u nc t i on t o
c h ec k t h e st at u s of t h e r esp onse.
W e w i l l st ar t b y get t i ng i nt o t h e a h ab i t of
w r i t i ng smal l f u nc t i ons t o ac c omp l i sh smal l
modu l ar t ask s, i nst ead of w r i t i ng l ong sc r i p t s,
w h i c h ar e h ar d t o deb u g and t r ac k :
1 . Cr eate a status_check
fu nction by u sing th e
def status_check(r):
if r.status_code==200:
print("Success!")
return 1
else:
print("Failed!")
return -1
Note th at, along w ith

pr inting th e appr opr iate
m essage, w e ar e r etu r ning
eith er 1 or -1 fr om th is
fu nction. Th is is im por tant.
2 . Ch eck th e r esponse u sing th e

status_check com m and:
status_check(response)
Figure 7.3 : The output of status_check
In t h i s c h ap t er , w e w i l l not u se t h ese r et u r ned

v al u es, b u t l at er , f or mor e c omp l ex
p r ogr ammi ng ac t i v i t y , y ou w i l l p r oc eed onl y i f
y ou get one as t h e r et u r n v al u e f or t h i s f u nc t i on,
t h at i s, y ou w i l l w r i t e a c ondi t i onal st at ement t o
c h ec k t h e r et u r n v al u e and t h en ex ec u t e t h e
su b sequ ent c ode b ased on t h at .
CHECKING THE ENCODING OF

THE WEB PAGE
W e c an al so w r i t e a u t i l i t y f u nc t i on t o c h ec k t h e
enc odi ng of t h e w eb p age. V ar i ou s enc odi ngs ar e
p ossi b l e w i t h any H TML doc u ment , al t h ou gh t h e
most p op u l ar i s U TF-8. Some of t h e most p op u l ar
enc odi ngs ar e A SCII, U ni c ode, and U TF-8. A SCII i s
t h e si mp l est , b u t i t c annot c ap t u r e t h e c omp l ex
sy mb ol s u sed i n v ar i ou s sp ok en and w r i t t en
l angu ages al l ov er t h e w or l d, so U TF-8 h as
b ec ome t h e al most u ni v er sal st andar d i n w eb
dev el op ment t h ese day s.
W h en w e r u n t h i s f u nc t i on on t h e W i k i p edi a
h ome p age, w e get b ac k t h e p ar t i c u l ar enc odi ng
t y p e t h at 's u sed f or t h at p age. Th i s f u nc t i on, l i k e
t h e p r ev i ou s one, t ak es t h e requests r esp onse
ob jec t as an ar gu ment and r et u r ns a v al u e:
def encoding_check(r):
return (r.encoding)
Ch ec k t h e r esp onse:
encoding_check(response)
'UTF-8'
H er e, U TF-8 denot es t h e most p op u l ar c h ar ac t er

enc odi ng sc h eme t h at 's u sed i n t h e di gi t al
medi u m and on t h e w eb t oday . It emp l oy s
v ar i ab l e-l engt h enc odi ng w i t h 1 -4 b y t es,
t h er eb y r ep r esent i ng al l U ni c ode c h ar ac t er s i n
v ar i ou s l angu ages ar ou nd t h e w or l d.

FUNCTION TO DECODE THE
CONTENTS OF THE
RESPONSE AND CHECK ITS
LENGTH
Th e f i nal ai m of t h i s ser i es of st ep s i s t o get a
p age's c ont ent s as a b l ob of t ex t or as a st r i ng
ob jec t t h at Py t h on c an p r oc ess af t er w ar d. Ov er
t h e i nt er net , dat a st r eams mov e i n an enc oded
f or mat . Th er ef or e, w e need t o dec ode t h e c ont ent
of t h e r esp onse ob jec t . For t h i s p u r p ose, w e need
t o p er f or m t h e f ol l ow i ng st ep s:
1 . Wr ite a u tility fu nction to

decode th e contents of th e
r esponse:
def
decode_content(r,encod
ing):
return
(r.content.decode(enco
ding))
contents =
decode_content(respons
e,encoding_check(respo
nse))
2 . Ch eck th e ty pe of th e decoded
object:
type(contents)
str
We finally got a str ing object

by r eading th e HTML page!
Note
Note that the answ er in this

chapter and in the exercise in
Jupyter notebook may vary
because of updates that have
been made to the Wikipedia
page.
3 . Ch eck th e length of th e
object and tr y pr inting som e
of it:
len(contents)
74182
If y ou pr int th e fir st 1 0,000

ch ar acter s of th is str ing, it
w ill look som e sim ilar to th is:
Figure 7.4: Output showing a mixed blob of HTML
markup tags, text and element names, and properties
Ob v i ou sl y , t h i s i s a mi x ed b l ob of v ar i ou s H TML
mar k u p t ags, t ex t , and el ement s
names/p r op er t i es. W e c annot h op e t o ex t r ac t
meani ngf u l i nf or mat i on f r om t h i s w i t h ou t u si ng
sop h i st i c at ed f u nc t i ons or met h ods. For t u nat el y ,
t h e BeautifulSoup l i b r ar y p r ov i des su c h
met h ods, and w e w i l l see h ow t o u se t h em nex t .
EXERCISE 84: EXTRACTING

HUMAN-READABLE TEXT
FROM A BEAUTIFULSOUP
OBJECT
It t u r ns ou t t h at a BeautifulSoup ob jec t h as a
text met h od, w h i c h c an b e u sed ju st t o ex t r ac t
t ex t :
1 . Im por t th e package and th en

pass on th e w h ole str ing
(HTML content) to a m eth od
for par sing:
from bs4 import

BeautifulSoup
soup =
BeautifulSoup(contents
, 'html.parser')
2 . Execu te th e follow ing code in

y ou r notebook:
txt_dump=soup.text
3 . Find th e ty pe of th e
txt_dmp:
type(txt_dump)
str
4 . Find th e length of th e
txt_dmp:
len(txt_dump)
15326
5. Now , th e length of th e text

du m p is m u ch sm aller th an
th e r aw HTML's str ing
length . Th is is becau se bs4
h as par sed th r ou gh th e
HTML and extr acted only
h u m an-r eadable text for
fu r th er pr ocessing.
6 . Pr int th e initial por tion of

th is text.
print(txt_dump[10000:1
1000])
You w ill see som eth ing

sim ilar to th e follow ing:
Figure 7.5: Output showing the initial portion of text

EXTRACTING TEXT FROM A
SECTION
N ow , l et 's mov e on t o a mor e ex c i t i ng dat a
w r angl i ng t ask . If y ou op en t h e W i k i p edi a h ome
p age, y ou ar e l i k el y t o see a sec t i on c al l ed F rom
today 's fe ature d article . Th i s i s an ex c er p t f r om
t h e day 's p r omi nent ar t i c l e, w h i c h i s r andoml y
sel ec t ed and p r omot ed on t h e h ome p age. In f ac t ,
t h i s ar t i c l e c an al so c h ange t h r ou gh ou t t h e day :
Figure 7.6: Sample Wikipedia page highlighting the
"From today's featured article" section
You need t o ex t r ac t t h e t ex t f r om t h i s sec t i on.
Th er e ar e nu mb er of w ay s t o ac c omp l i sh t h i s
t ask . W e w i l l go t h r ou gh a si mp l e and i nt u i t i v e
met h od f or doi ng so h er e.
Fi r st , w e t r y t o i dent i f y t w o i ndi c es – t h e st ar t
i ndex and end i ndex of t h e st r i ng, w h i c h
demar c at e t h e st ar t and end of t h e t ex t w e ar e
i nt er est ed i n. In t h e nex t sc r eensh ot , t h e i ndi c es
ar e sh ow n:
Figure 7.7: Wikipedia page highlighting the text to be
extracted
Th e f ol l ow i ng c ode ac c omp l i sh es t h e ex t r ac t i on:
idx1=txt_dump.find("From today's featured

article")
idx2=txt_dump.find("Recently featured")
print(txt_dump[idx1+len("From today's
featured article"):idx2])
N ot e, t h at w e h av e t o add t h e l engt h of t h e From

today's featured article st r i ng t o idx1 and
t h en p ass t h at as t h e st ar t i ng i ndex . Th i s i s
b ec au se i dx 1 f i nds w h er e t h e From tod a y 's
fe a tu re d a rtic le st r i ng st ar t s, not ends.
It p r i nt s ou t somet h i ng l i k e t h i s (t h i s i s a samp l e
ou t p u t ):
Figure 7.8: The extracted text
EXTRACTING IMPORTANT
HISTORICAL EVENTS THAT
HAPPENED ON TODAY'S DATE
N ex t , w e w i l l t r y t o ex t r ac t t h e t ex t
c or r esp ondi ng t o t h e i mp or t ant h i st or i c al
ev ent s t h at h ap p ened on t oday 's dat e. Th i s c an
gener al l y b e f ou nd at t h e b ot t om-r i gh t c or ner as
sh ow n i n t h e f ol l ow i ng sc r eensh ot :
Figure 7.9: Wikipedia page highlighting the "On this
day" section
So, c an w e ap p l y t h e same t ec h ni qu e as w e di d f or
"F rom today 's fe ature d article "? A p p ar ent l y
not , b ec au se t h er e i s t ex t ju st b el ow w h er e w e
w ant ou r ex t r ac t i on t o end, w h i c h i s not f i x ed,
u nl i k e i n t h e p r ev i ou s c ase. N ot e t h at , i n t h e
p r ev i ou s ex er c i se, t h e f i x ed st r i ng "Re ce ntly
fe ature d" oc c u r s at t h e ex ac t p l ac e w h er e w e
w ant t h e ex t r ac t i on t o st op . So, w e c ou l d u se i t i n
ou r c ode. H ow ev er , w e c annot do t h at i n t h i s c ase,
and t h e r eason f or t h i s i s i l l u st r at ed i n t h e
f ol l ow i ng sc r eensh ot :
Figure 7.10: Wikipedia page highlighting the text to be
extracted
So, i n t h i s sec t i on, w e ju st w ant t o f i nd ou t w h at

t h e t ex t l ook s l i k e ar ou nd t h e mai n c ont ent w e
ar e i nt er est ed i n. For t h at , w e mu st f i nd ou t t h e
st ar t of t h e st r i ng "On t h i s day " and p r i nt ou t t h e
nex t 1 ,000 c h ar ac t er s, u si ng t h e f ol l ow i ng
c ommand:
idx3=txt_dump.find("On this day")
print(txt_dump[idx3+len("On this
day"):idx3+len("On this day")+1000])
Th i s l ook s as f ol l ow s:
Figure 7.11: Output of the "On this day" section from
Wikipedia
To addr ess t h i s i ssu e, w e need t o t h i nk

di f f er ent l y and u se some ot h er met h ods f r om
Beau t i f u l Sou p (and w r i t e anot h er u t i l i t y
f u nc t i on).
EXERCISE 85: USING

ADVANCED BS4 TECHNIQUES
TO EXTRACT RELEVANT TEXT
H TML p ages ar e made of many mar k u p t ags, su c h
as <di v >, w h i c h denot es a di v i si on of
t ex t /i mages, or <u l >, w h i c h denot es l i st s. W e c an
t ak e adv ant age of t h i s st r u c t u r e and l ook at t h e
el ement t h at c ont ai ns t h e t ex t w e ar e i nt er est ed
i n. In t h e Mozi l l a Fi r ef ox b r ow ser , w e c an easi l y
do t h i s b y r i gh t -c l i c k i ng and sel ec t i ng t h e
"I nspe ct Ele me nt" op t i on:
Figure 7.12: Inspecting elements on Wikipedia
A s y ou h ov er ov er t h i s w i t h t h e mou se, y ou w i l l
see di f f er ent p or t i ons of t h e p age b ei ng
h i gh l i gh t ed. By doi ng t h i s, i t i s easy t o di sc ov er
t h e p r ec i se b l oc k of mar k u p t ex t , t h at i s
r esp onsi b l e f or t h e t ex t u al i nf or mat i on w e ar e
i nt er est ed i n. H er e, w e c an see t h at a c er t ai n
<ul> b l oc k c ont ai ns t h e t ex t :
Figure 7.13: Identifying the HTML block that contains

text
N ow , i t i s p r u dent t o f i nd t h e <div> t ag t h at
c ont ai ns t h i s <ul> b l oc k w i t h i n i t . By l ook i ng
ar ou nd t h e same sc r een as b ef or e, w e f i nd t h e
<div> and al so i t s ID:
Figure 7.14: The <ul> tag containing the text
1 . Use th e find_all m eth od

fr om Beau tifu lSou p, w h ich
scans all th e tags of th e
HTML page (and th eir su b-
elem ents) to find and extr act
th e text associated w ith th is
par ticu lar <div> elem ent.
Note
Note how w e are utilizing the

'mp-otd' I D of the < div> to
identify it among tens of other
< div> elements.
Th e find_all m eth od
r etu r ns a NavigableString
class, w h ich h as a u sefu l
text m eth od associated w ith
it for extr action.
2 . To pu t th ese ideas togeth er ,

w e w ill cr eate an em pty list
and append th e text fr om th e
NavigableString class to
th is list as w e tr av er se th e
page:
text_list=[] #Empty
list
for d in
soup.find_all('div'):
if (d.get('id')=='mp-
otd'):
for i in
d.find_all('ul'):
text_list.append(i.tex
t)
3 . Now , if w e exam ine th e

text_list list, w e w ill see
th at it h as th r ee elem ents. If
w e pr int th e elem ents,
separ ated by a m ar ker , w e
w ill see th at th e text w e ar e
inter ested in appear s as th e
fir st elem ent!
for i in text_list:
print(i)
print('-'*100)
Note
I n this example, it is the first

element of the list that w e are
interested in. How ever, the
exact position w ill depend on
the w eb page.
Figure 7.15: The text highlighted

COMPACT FUNCTION TO
EXTRACT THE "ON THIS DAY"
TEXT FROM THE WIKIPEDIA
HOME PAGE
A s w e di sc u ssed b ef or e, i t i s al w ay s good t o t r y t o
f u nc t i onal i ze sp ec i f i c t ask s, p ar t i c u l ar l y i n a
w eb sc r ap i ng ap p l i c at i on:
1 . Cr eate a fu nction, w h ose

only job is to take th e URL (as
a str ing) and to r etu r n th e
text cor r esponding to th e On
t his day section. Th e benefit
of su ch a fu nctional
appr oach is th at y ou can call
th is fu nction fr om any
Py th on scr ipt and u se it
any w h er e in anoth er
pr ogr am as a standalone
m odu le:
def
wiki_on_this_day(url="
https://en.wikipedia.o
rg/wiki/Main_Page"):
"""
2 . Extr act th e text fr om th e "On

th is day " section on th e
Wikipedia h om e page. A ccept
th e Wikipedia h om e page
URL as a str ing. A defau lt
URL is pr ov ided:
"""
import requests
from bs4 import

BeautifulSoup
wiki_home = str(url)
response =
requests.get(wiki_home
)
return 1
else:
return -1
status =
if status==1:
contents =
decode_content(respons
e,encoding_check(respo
nse))
else:
print("Sorry could not

reach the web page!")
return -1
soup =
, 'html.parser')
text_list=[]
for d in
soup.find_all('div'):
if (d.get('id')=='mp-
otd'):
for i in
d.find_all('ul'):
text_list.append(i.tex
t)
return (text_list[0])
3 . Note h ow th is fu nction
u tilizes th e statu s ch eck and
pr ints ou t an er r or m essage
if th e r equ est failed. Wh en
w e test th is fu nction w ith an
intentionally incor r ect URL,
it beh av es as expected:
print(wiki_on_this_day
("https://en.wikipedia
.org/wiki/Main_Page1")
)
Sorry could not reach

the web page!
Reading Data from XML

XML, or Ex t ensi b l e Mar k u p Langu age, i s a w eb
mar k u p l angu age t h at 's si mi l ar t o H TML b u t
w i t h si gni f i c ant f l ex i b i l i t y (on t h e p ar t of t h e
u ser ) b u i l t i n, su c h as t h e ab i l i t y t o def i ne y ou r
ow n t ags. It w as one of t h e most h y p ed
t ec h nol ogi es i n t h e 1 990s and ear l y 2000s. It i s a
met a-l angu age, t h at i s, a l angu age t h at al l ow s u s
t o def i ne ot h er l angu ages u si ng i t s mec h ani c s,
su c h as RSS, Mat h ML (a mat h emat i c al mar k u p
l angu age w i del y u sed f or w eb p u b l i c at i on and
t h e di sp l ay of mat h -h eav y t ec h ni c al
i nf or mat i on), and so on. XML i s al so h eav i l y u sed
i n r egu l ar dat a ex c h anges ov er t h e w eb , and as a
dat a w r angl i ng p r of essi onal , y ou sh ou l d h av e
enou gh f ami l i ar i t y w i t h i t s b asi c f eat u r es t o t ap
i nt o t h e dat a f l ow p i p el i ne w h enev er y ou need
t o ex t r ac t dat a f or y ou r p r ojec t .
EXERCISE 87: CREATING AN

XML FILE AND READING XML
ELEMENT OBJECTS
Let 's c r eat e some r andom dat a t o u nder st and t h e
XML dat a f or mat b et t er . Ty p e i n t h e f ol l ow i ng
c ode sni p p et s:
1 . Cr eate an XML file u sing th e

data = '''
<person>
<name>Dave</name>
<surname>Piccardo</sur
name>
<phone type="intl">
+1 742 101 4456
</phone>
<email hide="yes">
dave.p@gmail.com</emai
l>
</person>'''
2 . Th is is a tr iple-qu oted str ing

or m u ltiline str ing. If y ou
pr int th is object, y ou w ill get
th e follow ing ou tpu t. Th is is
an XML-for m atted data
str ing in a tr ee str u ctu r e, as
w e w ill see soon, w h en w e
par se th e str u ctu r e and tease
apar t th e indiv idu al par ts:
Figure 7.16: The XML file output
3 . To pr ocess and w r angle w ith

th e data, w e h av e to r ead it
as an Element object u sing
th e Py th on XML par ser
engine:
import
xml.etree.ElementTree
as ET
tree =
ET.fromstring(data)
type (tree)
xml.etree.ElementTree.
Element
EXERCISE 88: FINDING

VARIOUS ELEMENTS OF DATA
WITHIN A TREE (ELEMENT)
W e c an u se t h e find met h od t o sear c h f or v ar i ou s
p i ec es of u sef u l dat a w i t h i n an XML el ement
ob jec t and p r i nt t h em (or u se t h em i n w h at ev er
p r oc essi ng c ode w e w ant ) u si ng t h e text met h od.
W e c an al so u se t h e get met h od t o ex t r ac t t h e
sp ec i f i c at t r i b u t e w e w ant :
1 . Use th e find m eth od to find

Name:
# Print the name of

the person
print('Name:',
tree.find('name').text
)
Dave

Surname:
# Print the surname
print('Surname:',
tree.find('surname').t
ext)
Piccardo

Phone. Note th e u se of th e
strip m eth od to str ip aw ay
any tr ailing spaces/blanks:
# Print the phone

number
print('Phone:',
tree.find('phone').tex
t.strip())
+1 742 101 4456

email status and actual
email. Note th e u se of th e
get m eth od to extr act th e
statu s:
# Print email status

and the actual email
print('Email hidden:',
tree.find('email').get
('hide'))
print('Email:',
tree.find('email').tex
t.strip())
Email hidden: yes
Email:
dave.p@gmail.com
READING FROM A LOCAL XML

FILE INTO AN ELEMENTTREE
OBJECT
W e c an al so r ead f r om an XML f i l e (sav ed l oc al l y
on di sk ).
Th i s i s a f ai r l y c ommon si t u at i on w h er e a
f r ont end w eb sc r ap i ng modu l e h as al r eady
dow nl oaded a l ot of XML f i l es b y r eadi ng a t ab l e
of dat a on t h e w eb and now t h e dat a w r angl er
needs t o p ar se t h r ou gh t h i s XML f i l e t o ex t r ac t
meani ngf u l p i ec es of nu mer i c al and t ex t u al dat a.
W e h av e a f i l e assoc i at ed w i t h t h i s c h ap t er ,
c al l ed "xml1.xml". Pl ease mak e su r e y ou h av e t h e
f i l e i n t h e same di r ec t or y t h at y ou ar e r u nni ng
y ou r Ju p y t er N ot eb ook f r om:
tree2=ET.parse('xml1.xml')
type(tree2)
The output will be as follows:

xml.etree.ElementTree.ElementTree
N ot e h ow w e u se t h e parse met h od t o r ead t h i s

XML f i l e. Th i s i s sl i gh t l y di f f er ent t h an u si ng
t h e fromstring met h od u sed i n t h e p r ev i ou s
ex er c i se, w h er e w e w er e di r ec t l y r eadi ng f r om a
st r i ng ob jec t . Th i s p r odu c es an ElementTree
ob jec t i nst ead of a si mp l e Element.
Th e i dea of b u i l di ng a t r ee-l i k e ob jec t i s t h e same

as i n t h e domai ns of c omp u t er sc i enc e and
p r ogr ammi ng:
Th er e is a r oot
Th er e ar e ch ildr en objects
attach ed to th e r oot
Th er e cou ld be m u ltiple
lev els, th at is, ch ildr en of
ch ildr en r ecu r siv ely going
dow n
A ll of th e nodes of th e tr ee
(r oot and ch ildr en alike)
h av e attr ibu tes attach ed to
th em th at contain data
Tr ee tr av er sal algor ith m s

can be u sed to sear ch for a
par ticu lar attr ibu te
If pr ov ided, special m eth ods

can be u sed to pr obe a node
deeper
EXERCISE 89: TRAVERSING

THE TREE, FINDING THE
ROOT, AND EXPLORING ALL
CHILD NODES AND THEIR
TAGS AND ATTRIBUTES
Ev er y node i n t h e XML t r ee h as t ags and
at t r i b u t es. Th e i dea i s as f ol l ow s:
Figure 7.17: Finding the root and child nodes of an XML
tag
1 . Explor e th ese tags and

attr ibu tes u sing th e
follow ing code:
root=tree2.getroot()
for child in root:
print
("Child:",child.tag,
"| Child
attribute:",child.attr
ib)
Figure 7.18: The output showing the extracted XML

tags
Note
Re m e m be r th a t e v e ry XML d a ta file c ou ld follow a
d iffe re nt na m ing or s tru c tu ra l form a t, bu t u s ing
a n e le m e nt tre e a p p roa c h p u ts th e d a ta into a
s om e w h a t s tru c tu re d flow th a t c a n be e xp lore d
s y s te m a tic a lly . S till, it is be s t to e xa m ine th e ra w
XML file s tru c tu re onc e a nd u nd e rs ta nd (e v e n if a t
a h ig h le v e l) th e d a ta form a t be fore a tte m p ting
a u tom a tic e xtra c tions .

TEXT METHOD TO EXTRACT
MEANINGFUL DATA
W e c an al most t h i nk of t h e XML t r ee as a list of
lists and i ndex i t ac c or di ngl y :
1 . A ccess th e elem ent root[0]

[2] by u sing th e follow ing
code:
root[0][2]

<Element 'gdppc' at
0x00000000051FF278>
So, th is points to th e 'gdppc'

piece of data. Her e, 'gdppc' is
th e tag and th e actu al
GDP/per capita data is
attach ed to th is tag.
2 . Use th e text m eth od to

access th e data:
root[0][2].text
'70617'
3 . Use th e tag m eth od to access

gdppc:
root[0][2].tag
'gdppc'
4 . Ch eck root[0]:
root[0]
<Element 'country1' at
0x00000000050298B8>
5. Ch eck th e tag:
root[0].tag
'country1'
We can u se th e attrib
m eth od to access it:
root[0].attrib
{'name': 'Norway'}
So, root[0] is again an

elem ent, bu t it h as differ ent
a set of tags and attr ibu tes
th an root[0][2]. Th is is
expected becau se th ey ar e all
par t of th e tr ee as nodes, bu t
each is associated w ith a
differ ent lev el of data.
Th i s l ast p i ec e of c ode ou t p u t i s i nt er est i ng

b ec au se i t r et u r ns a di c t i onar y ob jec t .
Th er ef or e, w e c an ju st i ndex i t b y i t s k ey s. W e
w i l l do t h at i n t h e nex t ex er c i se.
EXTRACTING AND PRINTING

THE GDP/PER CAPITA
INFORMATION USING A LOOP
N ow t h at w e k now h ow t o r ead t h e GDP/p er
c ap i t a dat a and h ow t o get a di c t i onar y b ac k f r om
t h e t r ee, w e c an easi l y c onst r u c t a si mp l e dat aset
b y r u nni ng a l oop ov er t h e t r ee:
for c in root:
country_name=c.attrib['name']
gdppc = int(c[2].text)
print("{}: {}".format(country_name,gdppc))
Norway: 70617
Austria: 44857
Israel: 38788
W e c an p u t t h ese i n a Dat aFr ame or CSV f i l e f or

sav i ng t o a l oc al di sk or f u r t h er p r oc essi ng, su c h
as a si mp l e p l ot !
EXERCISE 91: FINDING ALL

THE NEIGHBORING
COUNTRIES FOR EACH
COUNTRY AND PRINTING
THEM
A s w e ment i oned b ef or e, t h er e ar e ef f i c i ent
sear c h al gor i t h ms f or t r ee st r u c t u r es, and one
su c h met h od f or XML t r ees i s findall. W e c an u se
t h i s, f or t h i s ex amp l e, t o f i nd al l t h e nei gh b or s a
c ou nt r y h as and p r i nt t h em ou t .
W h y do w e need t o u se findall ov er f i nd? W el l ,

b ec au se not al l t h e c ou nt r i es h av e an equ al
nu mb er of nei gh b or s and findall sear c h es f or al l
t h e dat a w i t h t h at t ag t h at i s assoc i at ed w i t h a
p ar t i c u l ar node, and w e w ant t o t r av er se al l of
t h em:
for c in root:
ne=c.findall('neighbor') # Find all the

neighbors
print("Neighbors\n"+"-"*25)
for i in ne: # Iterate over the neighbors
and print their 'name' attribute
print(i.attrib['name'])
print('\n')
Th e ou t p u t l ook s somet h i ng l i k e t h i s:
Figure 7.19: The output that's generated by using

findall
EXERCISE 92: A SIMPLE DEMO

OF USING XML DATA
OBTAINED BY WEB SCRAPING
In t h e l ast t op i c of t h i s c h ap t er , w e l ear ned
ab ou t si mp l e w eb sc r ap i ng u si ng t h e requests
l i b r ar y . So f ar , w e h av e w or k ed w i t h st at i c XML
dat a, t h at i s, dat a f r om a l oc al f i l e or a st r i ng
ob jec t w e'v e sc r i p t ed. N ow , i t i s t i me t o c omb i ne
ou r l ear ni ng and r ead XML dat a di r ec t l y ov er t h e
i nt er net (as y ou ar e ex p ec t ed t o do al most al l t h e
t i me):
1 . We w ill tr y to r ead a cooking

r ecipe fr om a w ebsite called
h ttp://w w w .r ecipepu ppy .co
m /, w h ich aggr egates links
to v ar iou s oth er sites w ith
th e r ecipe:
import urllib.request,
urllib.parse,
urllib.error
serviceurl =
'http://www.recipepupp
y.com/api/?'
item =
str(input('Enter the
name of a food item
(enter \'quit\' to
quit): '))
url = serviceurl +
urllib.parse.urlencode
({'q':item})+'&p=1&for
mat=xml'
uh =
urllib.request.urlopen
(url)
data =
uh.read().decode()
print('Retrieved',
len(data),
'characters')
tree3 =
ET.fromstring(data)
2 . Th is code w ill ask th e u ser for

inpu t. You h av e to enter th e
nam e of a food item . For
exam ple, 'ch icken tikka':
Figure 7.20: Demo of scraping from XML

data
3 . We get back data in XML

for m at and r ead and decode
it befor e cr eating an XML
tr ee ou t of it:
data =
uh.read().decode()
print('Retrieved',
len(data),
'characters')
tree3 =
ET.fromstring(data)
4 . Now , w e can u se anoth er

u sefu l m eth od, called iter,
w h ich basically iter ates ov er
th e nodes u nder an elem ent.
If w e tr av er se th e tr ee and
extr act th e text, w e get th e
follow ing ou tpu t:
for elem in
tree3.iter():
print(elem.text)
Figure 7.21: The output that's generated
by using iter
5. We can u se th e find m eth od

to sear ch for th e appr opr iate
attr ibu te and extr act its
content. Th is is th e r eason it
is im por tant to scan th r ou gh
th e XML data m anu ally and
ch eck w h at attr ibu tes ar e
u sed. Rem em ber , th is m eans
scanning th e r aw str ing
data, not th e tr ee str u ctu r e.
6 . Pr int th e r aw str ing data:

Figure 7.22: The output showing the
extracted href tags
Now w e know w h at tags to

sear ch for .
7 . Pr int all th e h y per links in

th e XML data:
for e in tree3.iter():
h=e.find('href')
t=e.find('title')
if h!=None and
t!=None:
print("Receipe Link
for:",t.text)
print(h.text)
print("-"*100)
Note th e u se of h!=None and

t!=None. Th ese ar e difficu lt
to expect w h en y ou fir st r u n
th is kind of code. You m ay
get an er r or becau se som e of
th e tags m ay r etu r n a None
object, th at is, th ey w er e
em pty for som e r eason in
th is XML data str eam . Th is
kind of situ ation is fair ly
com m on and cannot be
anticipated befor eh and. You
h av e to u se y ou r Py th on
know ledge and
pr ogr am m ing intu ition to
get ar ou nd it if y ou r eceiv e
su ch an er r or . Her e, w e ar e
ju st ch ecking for th e ty pe of
th e object and if it is not a
None, th en w e need to
extr act th e text associated
w ith it.
Th e final ou tpu t is as follow s:

Figure 7.23: The output showing the final output
Reading Data from an

API
Fu ndament al l y , an A PI or A p p l i c at i on
Pr ogr ammi ng Int er f ac e i s some k i nd of i nt er f ac e
t o a c omp u t i ng r esou r c e (f or ex amp l e, an
op er at i ng sy st em or dat ab ase t ab l e), w h i c h h as a
set of ex p osed met h ods (f u nc t i on c al l s) t h at al l ow
a p r ogr ammer t o ac c ess p ar t i c u l ar dat a or
i nt er nal f eat u r es of t h at r esou r c e.
A w eb A PI i s, as t h e name su ggest s, an A PI ov er
t h e w eb . N ot e t h at i t i s not a sp ec i f i c t ec h nol ogy
or p r ogr ammi ng f r amew or k , b u t an
ar c h i t ec t u r al c onc ep t . Th i nk of an A PI l i k e a
f ast f ood r est au r ant 's c u st omer ser v i c e c ent er .
Int er nal l y , t h er e ar e many f ood i t ems, r aw
mat er i al s, c ook i ng r esou r c es, and r ec i p e
management sy st ems, b u t al l y ou see ar e f i x ed
menu i t ems on t h e b oar d and y ou c an onl y
i nt er ac t t h r ou gh t h ose i t ems. It i s l i k e a p or t
t h at c an b e ac c essed u si ng an H TTP p r ot oc ol and
i s ab l e t o del i v er dat a and ser v i c es i f u sed
p r op er l y .
W eb A PIs ar e ex t r emel y p op u l ar t h ese day s f or

al l k i nds of dat a ser v i c es. In t h e v er y f i r st
c h ap t er , w e t al k ed ab ou t h ow U C San Di ego's dat a
sc i enc e t eam p u l l s dat a f r om Tw i t t er f eeds t o
anal y ze oc c u r r enc e of f or est f i r es. For t h i s, t h ey
do not go t o t w i t t er .c om and sc r ap e t h e dat a b y
l ook i ng at H TML p ages and t ex t . Inst ead, t h ey u se
t h e Tw i t t er A PI, w h i c h sends t h i s dat a
c ont i nu ou sl y i n a st r eami ng f or mat .
Th er ef or e, i t i s v er y i mp or t ant f or a dat a
w r angl i ng p r of essi onal t o u nder st and t h e b asi c s
of dat a ex t r ac t i on f r om a w eb A PI as y ou ar e
ex t r emel y l i k el y t o f i nd y ou r sel f i n a si t u at i on
w h er e l ar ge qu ant i t i es of dat a mu st b e r ead
t h r ou gh an A PI i nt er f ac e f or p r oc essi ng and
w r angl i ng. Th ese day s, most A PIs st r eam dat a ou t
i n JSON f or mat . In t h i s c h ap t er , w e w i l l u se a
f r ee A PI t o r ead some i nf or mat i on ab ou t v ar i ou s
c ou nt r i es ar ou nd t h e w or l d i n JSON f or mat and
p r oc ess i t .
W e w i l l u se Py t h on's b u i l t -i n urllib modu l e f or

t h i s t op i c , al ong w i t h p andas t o mak e a
Dat aFr ame. So, w e c an i mp or t t h em now . W e w i l l
al so i mp or t Py t h on's JSON modu l e:
import urllib.request, urllib.parse
from urllib.error import

HTTPError,URLError
import json
import pandas as pd
DEFINING THE BASE URL (OR

API ENDPOINT)
Fi r st , w e need t o set t h e b ase U RL. W h en w e ar e
deal i ng w i t h A PI mi c r oser v i c es, t h i s i s of t en
c al l ed t h e API e ndpoint. Th er ef or e, l ook f or
su c h a p h r ase i n t h e w eb ser v i c e p or t al y ou ar e
i nt er est ed i n and u se t h e endp oi nt U RL t h ey gi v e
y ou :
serviceurl =
'https://restcountries.eu/rest/v2/name/'
A PI-b ased mi c r oser v i c es ar e ex t r emel y dy nami c

i n nat u r e i n t er ms of w h at and h ow t h ey of f er
t h ei r ser v i c e and dat a. It c an c h ange at any t i me.
A t t h e t i me of t h i s c h ap t er p l anni ng, w e f ou nd
t h i s p ar t i c u l ar A PI t o b e a ni c e c h oi c e f or
ex t r ac t i ng dat a easi l y and w i t h ou t u si ng
au t h or i zat i on k ey s (l ogi n or sp ec i al A PI k ey s).
For most A PIs, h ow ev er , y ou need t o h av e y ou r

ow n A PI k ey . You get t h at b y r egi st er i ng w i t h
t h ei r ser v i c e. A b asi c u sage (u p t o a f i x ed
nu mb er of r equ est s or a dat a f l ow l i mi t ) i s of t en
f r ee, b u t af t er t h at y ou w i l l b e c h ar ged. To
r egi st er f or an A PI k ey , y ou of t en need t o ent er
c r edi t c ar d i nf or mat i on.
W e w ant ed t o av oi d al l t h at h assl e t o t eac h y ou

t h e b asi c s and t h at 's w h y w e c h ose t h i s ex amp l e,
w h i c h does not r equ i r e su c h au t h or i zat i on. Bu t ,
dep endi ng on w h at k i nd of dat a y ou w i l l
enc ou nt er i n y ou r w or k , p l ease b e p r ep ar ed t o
l ear n ab ou t u si ng an A PI k ey .
EXERCISE 93: DEFINING AND
TESTING A FUNCTION TO
PULL COUNTRY DATA FROM
AN API
Th i s p ar t i c u l ar A PI ser v es b asi c i nf or mat i on
ab ou t c ou nt r i es ar ou nd t h e w or l d:
1 . Define a fu nction to pu ll ou t
data w h en w e pass th e nam e
of a cou ntr y as an ar gu m ent.
Th e cr u x of th e oper ation is
contained in th e follow ing
tw o lines of code:
url = serviceurl +
country_name
uh =
(url)
2 . Th e fir st line of code appends

th e cou ntr y nam e as a str ing
to th e base URL and th e
second line sends a get
r equ est to th e A PI endpoint.
If all goes w ell, w e get back
th e data, decode it, and r ead
it as a JSON file. Th is w h ole
exer cise is coded in th e
follow ing fu nction, along
w ith som e er r or -h andling
code w r apped ar ou nd th e
basic actions w e talked abou t
pr ev iou sly :
def
get_country_data(count
ry):
"""
Function to get data

about country from
"https://restcountries
.eu" API
"""
country_name=str(count
ry)
url = serviceurl +
country_name
try:
uh =
(url)
except HTTPError as e:
print("Sorry! Could
not retrieve anything
on
{}".format(country_nam
e))
return None
except URLError as e:
print('Failed to reach
a server.')
print('Reason: ',
e.reason)
return None
else:
data =
uh.read().decode()
print("Retrieved data
on {}. Total {}
characters
read.".format(country_
name,len(data)))
return data
3 . Test th is fu nction by passing

som e ar gu m ents. We pass a
cor r ect nam e and an
er r oneou s nam e. Th e
r esponse is as follow s:
Note
This is an example of
rudimentary error handling.
You have to think about
various possibilities and put in
such code to catch and
gracefully respond to user
input w hen you are building a
real-life w eb or enterprise
application.
Figure 7.24: Input arguments
USING THE BUILT-IN JSON

LIBRARY TO READ AND
EXAMINE DATA
A s w e h av e al r eady ment i oned, JSON l ook s a l ot
l i k e a Py t h on di c t i onar y .
In t h i s ex er c i se, w e w i l l u se Py t h on's json

modu l e t o r ead r aw dat a i n t h at f or mat and see
w h at w e c an p r oc ess f u r t h er :
x=json.loads(data)
y=x[0]
type(y)
Th e ou t p u t w i l l b e as f ol l ow s:
dict
So, w e get a l i st b ac k w h en w e u se t h e loads

met h od f r om t h e json modu l e. It r eads a st r i ng
dat at y p e i nt o a l i st of di c t i onar i es. In t h i s c ase,
w e get onl y one el ement i n t h e l i st , so w e ex t r ac t
t h at and c h ec k i t s t y p e t o mak e su r e i t i s a
di c t i onar y .
W e c an qu i c k l y c h ec k t h e k ey s of t h e di c t i onar y ,
t h at i s t h e JSON dat a (not e t h at a f u l l sc r eensh ot
i s not sh ow n h er e). W e c an see t h e r el ev ant
c ou nt r y dat a, su c h as c al l i ng c odes, p op u l at i on,
ar ea, t i me zones, b or der s, and so on:
Figure 7.25: The output of dict_keys
PRINTING ALL THE DATA

ELEMENTS
Th i s t ask i s ex t r emel y si mp l e gi v en t h at w e h av e
a di c t i onar y at ou r di sp osal ! A l l w e h av e t o do i s
i t er at e ov er t h e di c t i onar y and p r i nt t h e
k ey s/i t ems p ai r one b y one:
for k,v in y.items():
print("{}: {}".format(k,v))
Figure 7.26: The output using dict
N ot e t h at t h e i t ems i n t h e di c t i onar y ar e not of

t h e same t y p e, t h at i s, t h ey ar e not si mi l ar
ob jec t s. Some ar e f l oat i ng-p oi nt nu mb er s, su c h as
t h e ar ea, many ar e si mp l e st r i ngs, b u t some ar e
l i st s or ev en l i st s of di c t i onar i es!
Th i s i s f ai r l y c ommon w i t h JSON dat a. Th e

i nt er nal dat a st r u c t u r e of JSON c an b e
ar b i t r ar i l y c omp l ex and mu l t i l ev el , t h at i s, y ou
c an h av e a di c t i onar y of l i st s of di c t i onar i es of
di c t i onar i es of l i st s of l i st s…. and so on.
Note
I t is c le a r, th e re fore , th a t th e re is no u niv e rs a l
m e th od or p roc e s s ing fu nc tion for JS ON d a ta
form a t, a nd y ou h a v e to w rite c u s tom loop s a nd
fu nc tions to e xtra c t d a ta from s u c h a d ic tiona ry
obje c t ba s e d on y ou r p a rtic u la r ne e d s .
N ow , w e w i l l w r i t e a smal l l oop t o ex t r ac t t h e
l angu ages sp ok en i n Sw i t zer l and. Fi r st , l et 's
ex ami ne t h e di c t i onar y c l osel y and see w h er e
t h e l angu age dat a i s:
Figure 7.27: The tags
So, t h e dat a i s emb edded i nsi de a l i st of

di c t i onar i es, w h i c h i s ac c essed b y a p ar t i c u l ar
k ey of t h e mai n di c t i onar y .
W e c an w r i t e si mp l e t w o-l i ne c ode t o ex t r ac t
t h i s dat a:
for lang in y['languages']:
print(lang['name'])
Figure 7.28: The output showing the languages

USING A FUNCTION THAT
EXTRACTS A DATAFRAME
CONTAINING KEY
INFORMATION
H er e, w e ar e i nt er est ed i n w r i t i ng a f u nc t i on
t h at c an t ak e a l i st of c ou nt r i es and r et u r n a
p andas Dat aFr ame w i t h some k ey i nf or mat i on:
Capital
Region
Su b-r egion
Popu lation
Latitu de/longitu de
A r ea
Gini index
Tim e zones
Cu r r encies
Langu ages
Note
This is the kind of w rapper

function you are generally
expected to w rite in real-life
data w rangling tasks, that is,
a utility function that can take
a user argument and output a
useful data structure (or a
mini database type object)
w ith key information
extracted over the internet
about the item the user is
interested in.
W e w i l l sh ow y ou t h e w h ol e f u nc t i on f i r st and
t h en di sc u ss some k ey p oi nt s ab ou t i t . It i s a
sl i gh t l y c omp l ex and l ong p i ec e of c ode.
H ow ev er , b ased on y ou r Py t h on- b ased dat a
w r angl i ng k now l edge, y ou sh ou l d b e ab l e t o
ex ami ne t h i s f u nc t i on c l osel y and u nder st and
w h at i t i s doi ng:
import pandas as pd
import json
def build_country_database(list_country):
"""
Takes a list of country names.
Output a DataFrame with key information

about those countries.
"""
# Define an empty dictionary with keys
country_dict={'Country':[],'Capital':
[],'Region':[],'Sub-region':
[],'Population':[],
'Lattitude':[],'Longitude':[],'Area':
[],'Gini':[],'Timezones':[],
'Currencies':[],'Languages':[]}
Note
Th e c od e h a s be e n tru nc a te d h e re . Ple a s e find th e
e ntire c od e a t th e follow ing GitHu b link a nd c od e
bu nd le fold e r link
h ttp s ://g ith u b.c om /Tra ining By Pa c k t/Da ta -
Wra ng ling -w ith -
Py th on/blob/m a s te r/Ch a p te r07/Exe rc is e 93-
94/Ch a p te r%207%20Top ic %203%20Exe rc is e s .ip y
nb.
H er e ar e some of t h e k ey p oi nt s ab ou t t h i s
f u nc t i on:
It star ts by bu ilding an
em pty dictionar y of lists.
Th is is th e ch osen for m at for
finally passing to th e pandas
DataFrame m eth od, w h ich
can accept su ch a for m at and
r etu r ns a nice DataFr am e
w ith colu m n nam es set to
th e dictionar y key s' nam es.
We u se th e pr ev iou sly
defined get_country_data
fu nction to extr act data for
each cou ntr y in th e u ser -
defined list. For th is, w e
sim ply iter ate ov er th e list
and call th is fu nction.
We ch eck th e ou tpu t of th e
get_country_data fu nction.
If, for som e r eason, it r etu r ns
a None object, w e w ill know
th at th e A PI r eading w as not
su ccessfu l, and w e w ill pr int
ou t a su itable m essage.
A gain, th is is an exam ple of
an er r or -h andling
m ech anism and y ou m u st
h av e th em in y ou r code.
With ou t su ch sm all er r or
ch ecking code, y ou r
application w on't be r obu st
enou gh for th e occasional
incor r ect inpu t or A PI
m alfu nction!
For m any data ty pes, w e

sim ply extr act th e data fr om
th e m ain JSON dictionar y
and append it to th e
cor r esponding list in ou r
data dictionar y .
How ev er , for special data

ty pes, su ch as tim e zones,
cu r r encies, and langu ages,
w e w r ite a special loop to
extr act th e data w ith ou t
er r or .
We also take car e of th e fact

th at th ese special data ty pes
can h av e a v ar iable length ,
th at is, som e cou ntr ies m ay
h av e m u ltiple spoken
langu ages, bu t m ost w ill
h av e only one entr y . So, w e
ch eck w h eth er th e length of
th e list is gr eater th an one
and h andle th e data
accor dingly .
EXERCISE 94: TESTING THE

FUNCTION BY BUILDING A
SMALL DATABASE OF
COUNTRIES' INFORMATION
Fi nal l y , w e t est t h i s f u nc t i on b y p assi ng a l i st of
c ou nt r y names:
1 . To test its r obu stness, w e pass

in an er r oneou s nam e – su ch
as 'Tu r m er ic' in th is case!
See th e ou tpu t… it detected

th at it did not get any data
back for th e incor r ect entr y
and pr inted ou t a su itable
m essage. Th e key is th at, if
y ou do not h av e th e er r or
ch ecking and h andling code
in y ou r fu nction, th en it w ill
stop execu tion on th at entr y
and w ill not r etu r n th e
expected m ini database. To
av oid th is beh av ior , su ch
er r or h andling code is
inv alu able:
Figure 7.29: The incorrect entry

highlighted
2 . Finally , th e ou tpu t is a
pandas DataFr am e, w h ich is
as follow s:
Figure 7.30: The data extracted correctly
Fundamentals of
Regular Expressions
(RegEx)
Re g u l ar e xp r essi ons or re g e x ar e u sed t o
i dent i f y w h et h er a p at t er n ex i st s i n a gi v en
sequ enc e of c h ar ac t er s a (st r i ng) or not . Th ey
h el p i n mani p u l at i ng t ex t u al dat a, w h i c h i s
of t en a p r er equ i si t e f or dat a sc i enc e p r ojec t s
t h at i nv ol v e t ex t mi ni ng.
REGEX IN THE CONTEXT OF
WEB SCRAPING
W eb p ages ar e of t en f u l l of t ex t and w h i l e t h er e
ar e some met h ods i n BeautifulSoup or XML
p ar ser t o ex t r ac t r aw t ex t , t h er e i s no met h od f or
t h e i nt el l i gent anal y si s of t h at t ex t . If , as a dat a
w r angl er , y ou ar e l ook i ng f or a p ar t i c u l ar p i ec e
of dat a (f or ex amp l e, emai l IDs or p h one nu mb er s
i n a sp ec i al f or mat ), y ou h av e t o do a l ot of st r i ng
mani p u l at i on on a l ar ge c or p u s t o ex t r ac t emai l
IDs or p h one nu mb er s. RegEx ar e v er y p ow er f u l
and sav e dat a w r angl i ng p r of essi onal a l ot of
t i me and ef f or t w i t h st r i ng mani p u l at i on
b ec au se t h ey c an sear c h f or c omp l ex t ex t u al
p at t er ns w i t h w i l dc ar ds of an ar b i t r ar y l engt h .
RegEx i s l i k e a mi ni -p r ogr ammi ng l angu age i n

i t sel f and c ommon i deas ar e u sed not onl y
Py t h on, b u t i n al l w i del y u sed w eb ap p
l angu ages l i k e Jav aSc r i p t , PH P, Per l , and so on.
Th e RegEx modu l e i s i n-b u i l t i n Py t h on, and y ou
c an i mp or t i t b y u si ng t h e f ol l ow i ng c ode:
import re

MATCH METHOD TO CHECK
WHETHER A PATTERN
MATCHES A
STRING/SEQUENCE
One of t h e most c ommon r egex met h ods i s match.
Th i s i s u sed t o c h ec k f or an ex ac t or p ar t i al
mat c h at a b egi nni ng of t h e st r i ng (b y def au l t ):
1 . Im por t th e RegEx m odu le:
import re
2 . Define a str ing and a

patter n:
string1 = 'Python'
pattern = r"Python"
3 . Wr ite a conditional
expr ession to ch eck for a
m atch :
if
re.match(pattern,strin
g1):
print("Matches!")
else:
print("Doesn't
match.")
Th e pr eceding code sh ou ld
giv e an affir m ativ e answ er ,
th at is, "Match es!".
4 . Test th is w ith a str ing th at

only differ s in th e fir st letter
by m aking it low er case:
string2 = 'python'
if
re.match(pattern,strin
g2):
print("Matches!")
else:
print("Doesn't
match.")
Doesn't match.
USING THE COMPILE METHOD

TO CREATE A REGEX
PROGRAM
In a p r ogr am or modu l e, i f w e ar e mak i ng h eav y
u se of a p ar t i c u l ar p at t er n, t h en i t i s b et t er t o
u se t h e compile met h od and c r eat e a r egex
p r ogr am and t h en c al l met h ods on t h i s p r ogr am.
H er e i s h ow y ou c omp i l e a r egex p r ogr am:
prog = re.compile(pattern)
prog.match(string1)
<_sre.SRE_Match object; span=(0, 6),

match='Python'>
Th i s c ode p r odu c ed an SRE.Match ob jec t t h at h as a

span of (0,6) and t h e mat c h ed st r i ng of 'Py t h on'.
Th e sp an h er e si mp l y denot es t h e st ar t and end
i ndi c es of t h e p at t er n t h at w as mat c h ed. Th ese
i ndi c es may c ome i n h andy i n a t ex t mi ni ng
p r ogr am w h er e t h e su b sequ ent c ode u ses t h e
i ndi c es f or f u r t h er sear c h or dec i si on-mak i ng
p u r p oses. W e w i l l see some ex amp l es of t h at
l at er .
EXERCISE 96: COMPILING

PROGRAMS TO MATCH
OBJECTS
Comp i l ed ob jec t s ac t l i k e f u nc t i ons i n t h at t h ey
r et u r n None i f t h e p at t er n does not mat c h . H er e,
w e ar e goi ng t o c h ec k t h at b y w r i t i ng a si mp l e
c ondi t i onal . Th i s c onc ep t w i l l c ome i n h andy
l at er w h en w e w r i t e a smal l u t i l i t y f u nc t i on t o
c h ec k f or t h e t y p e of t h e r et u r ned ob jec t f r om
r egex -c omp i l ed p r ogr ams and ac t ac c or di ngl y .
W e c annot b e su r e w h et h er a p at t er n w i l l mat c h
a gi v en st r i ng or w h et h er i t w i l l ap p ear i n a
c or p u s of t h e t ex t (i f w e ar e sear c h i ng f or t h e
p at t er n any w h er e w i t h i n t h e t ex t ). Dep endi ng
on t h e si t u at i on, w e may enc ou nt er Match ob jec t s
or None as t h e r et u r ned v al u e, and w e h av e t o
h andl e t h i s gr ac ef u l l y :
#string1 = 'Python'
#string2 = 'python'
#pattern = r"Python"
1 . Use th e compile fu nction in
RegEx:
prog =
re.compile(pattern)
2 . Match it w ith th e fir st

str ing:
if
prog.match(string1)!=N
one:
print("Matches!")
else:
print("Doesn't
match.")
Matches!
3 . Match it w ith th e second

str ing:
if
prog.match(string2)!=N
one:
print("Matches!")
else:
print("Doesn't
match.")
Doesn't match.
EXERCISE 97: USING
ADDITIONAL PARAMETERS IN
MATCH TO CHECK FOR
POSITIONAL MATCHING
By def au l t , match l ook s f or p at t er n mat c h i ng at
t h e b egi nni ng of t h e gi v en st r i ng. Bu t somet i mes,
w e need t o c h ec k mat c h i ng at a sp ec i f i c l oc at i on
i n t h e st r i ng:
1 . Match y for th e second

position:
prog =
re.compile(r'y')
prog.match('Python',po
s=1)
<_sre.SRE_Match
object; span=(1, 2),
match='y'>
2 . Ch eck for a patter n called

thon star ting fr om pos=2,
th at is, th e th ir d ch ar acter :
prog =
re.compile(r'thon')
prog.match('Python',po
s=2)
<_sre.SRE_Match
match='thon'>
3 . Find a m atch in a differ ent

str ing by u sing th e follow ing
com m and:
prog.match('Marathon',
pos=4)
<_sre.SRE_Match
match='thon'>
FINDING THE NUMBER OF
WORDS IN A LIST THAT END
WITH "ING"
Su p p ose w e w ant t o f i nd ou t i f a gi v en st r i ng h as
t h e l ast t h r ee l et t er s: 'i ng'. Th i s k i nd of qu er y
may c ome u p i n a t ex t anal y t i c s/t ex t mi ni ng
p r ogr am w h er e someb ody i s i nt er est ed i n
f i ndi ng i nst anc es of p r esent c ont i nu ou s t ense
w or ds, w h i c h ar e h i gh l y l i k el y t o end w i t h 'i ng'.
H ow ev er , ot h er nou ns may al so end w i t h 'i ng' (as
w e w i l l see i n t h i s ex amp l e):
prog = re.compile(r'ing')
words = ['Spring','Cycling','Ringtone']
Cr eat e a for l oop t o f i nd w or ds endi ng w i t h 'i ng':
for w in words:
if prog.match(w,pos=len(w)-3)!=None:
print("{} has last three letters

'ing'".format(w))
else:
print("{} does not have last three letter

as 'ing'".format(w))
Spring has last three letters 'ing'
Cycling has last three letters 'ing'
Ringtone does not have last three letter

as 'ing'
Note
I t look s p la in a nd s im p le , a nd y ou m a y w e ll
w ond e r w h a t th e p u rp os e of u s ing a s p e c ia l
re g e x m od u le for th is is . A s im p le s tring m e th od
s h ou ld h a v e be e n s u ffic ie nt. Ye s , it w ou ld h a v e
be e n OK for th is p a rtic u la r e xa m p le , bu t th e w h ole
p oint of u s ing re g e x is to be a ble to u s e v e ry
c om p le x s tring p a tte rns th a t a re not a t a ll obv iou s
w h e n it c om e s to h ow th e y a re w ritte n u s ing
s im p le s tring m e th od s . We w ill s e e th e re a l
p ow e r of re g e x c om p a re d to s tring m e th od s
s h ortly . Bu t be fore th a t, le t's e xp lore a noth e r of th e
m os t c om m only u s e d m e th od s , c a lle d search.
EXERCISE 98: THE SEARCH

METHOD IN REGEX
Search and match ar e r el at ed c onc ep t s and t h ey
b ot h r et u r n t h e same Mat c h ob jec t . Th e r eal
di f f er enc e b et w een t h em i s t h at match works
for only the first match (ei t h er at t h e b egi nni ng
of t h e st r i ng or at a sp ec i f i ed p osi t i on, as w e saw
i n t h e p r ev i ou s ex er c i ses), w h er eas se arch looks
for the patte rn any whe re in the string and
r et u r ns t h e ap p r op r i at e p osi t i on i f i t f i nds a
mat c h :
1 . Use th e compile m eth od to

find m atch ing str ings:
prog =
re.compile('ing')
if
prog.match('Spring')==
None:
print("None")
2 . Th e ou tpu t is as follow s:
None
3 . Sear ch th e str ing by u sing

prog.search('Spring')
<_sre.SRE_Match
match='ing'>
prog.search('Ringtone'
)
<_sre.SRE_Match
match='ing'>
A s y ou can see, th e match

m eth od r etu r ns None for th e
inpu t spring, and w e h ad to
w r ite code to pr int th at ou t
explicitly (becau se in
Ju py ter notebook, noth ing
w ill sh ow u p for a None
object). Bu t search r etu r ns a
Match object w ith span=
(3,6) as it finds th e ing
patter n spanning th ose
positions.
Si mi l ar l y , f or t h e Ringtone st r i ng, i t f i nds t h e

c or r ec t p osi t i on of t h e mat c h and r et u r ns span=
(1,4).

SPAN METHOD OF THE MATCH
OBJECT TO LOCATE THE
POSITION OF THE MATCHED
PATTERN
A s y ou w i l l u nder st and b y now , t h e span
c ont ai ned i n t h e Match ob jec t i s u sef u l f or
l oc at i ng t h e ex ac t p osi t i on of t h e p at t er n as i t
ap p ear s i n t h e st r i ng.
1 . Intitialize prog w ith patter n

ing.
prog =
re.compile(r'ing')
words =
['Spring','Cycling','R
ingtone']
2 . Cr eate a fu nction to r etu r n a

tu ple of star t and end
positions of m atch .
for w in words:
mt = prog.search(w)
# Span returns a tuple

of start and end
positions of the match
start_pos = mt.span()
[0] # Starting
position of the match
end_pos = mt.span()[1]
# Ending position of
the match
3 . Pr int th e w or ds ending w ith

ing in th e star t or end
position.
print("The word '{}'

contains 'ing' in the
position {}-
{}".format(w,start_pos
,end_pos))
The word 'Spring' contains 'ing' in the

position 3-6
The word 'Cycling' contains 'ing' in the

position 4-7
The word 'Ringtone' contains 'ing' in the

position 1-4
SINGLE CHARACTER PATTERN
MATCHING WITH SEARCH
N ow , w e w i l l st ar t get t i ng i nt o t h e r eal u sage of
r egex w i t h ex amp l es of v ar i ou s u sef u l p at t er n
mat c h i ng. Fi r st , w e w i l l ex p l or e si ngl e-
c h ar ac t er mat c h i ng. W e w i l l al so u se t h e group
met h od, w h i c h essent i al l y r et u r ns t h e mat c h ed
p at t er n i n a st r i ng f or mat so t h at w e c an p r i nt
and p r oc ess i t easi l y :
1 . Dot (.) m atch es any single

ch ar acter except a new line
ch ar acter :
prog =
re.compile(r'py.')
print(prog.search('pyg
my').group())
print(prog.search('Jup
yter').group())
pyg
pyt
2 . \w (low er case w ) m atch es

any single letter , digit, or
u nder scor e:
prog =
re.compile(r'c\wm')
print(prog.search('com
edy').group())
print(prog.search('cam
era').group())
print(prog.search('pac
_man').group())
print(prog.search('pac
2man').group())
com
cam
c_m
c2m
3 . \W (u pper case W) m atch es
any th ing not cov er ed w ith
\w:
prog =
re.compile(r'4\W1')
print(prog.search('4/1
was a wonderful
day!').group())
print(prog.search('4-1
was a wonderful
day!').group())
print(prog.search('4.1
was a wonderful
day!').group())
print(prog.search('Rem
ember the wonderful
day 04/1?').group())
4/1
4-1
4.1
4/1
4 . \s (low er case s) m atch es a

single w h itespace ch ar acter ,
su ch as a space, new line, tab,
or r etu r n:
prog =
re.compile(r'Data\swra
ngling')
print(prog.search("Dat
a wrangling is
cool").group())
print("-"*80)
print("Data\twrangling
is the full string")
a\twrangling is the
full string").group())
print("-"*80)
print("Data\nwrangling
is the full string")
a\nwrangling").group()
)
Data wrangling
----------------------
----------------------
----------------------
----
Data wrangling is the

full string
Data wrangling
----------------------
----------------------
----------------------
----
Data
wrangling is the full

string
Data
wrangling
5. \d m atch es nu m er ical digits

0 – 9:
prog =
re.compile(r"score was
\d\d")
print(prog.search("My
score was
67").group())
print(prog.search("You
r score was
73").group())
score was 67
score was 73

PATTERN MATCHING AT THE
START OR END OF A STRING
In t h i s ex er c i se, w e w i l l mat c h p at t er ns w i t h
st r i ngs. Th e f oc u s i s t o f i nd ou t w h et h er t h e
p at t er n i s p r esent at t h e st ar t or t h e end of t h e
st r i ng:
1 . Wr ite a fu nction to h andle

cases w h er e m atch is not
fou nd, th at is, to h andle None
objects as r etu r ns:
def print_match(s):
if
prog.search(s)==None:
print("No match")
else:
print(prog.search(s).g
roup())
2 . Use ^ (Car et) to m atch a

patter n at th e star t of th e
str ing:
prog =
re.compile(r'Îndia')
print_match("Russia
implemented this law")
print_match("India
implemented that law")
print_match("This law
was implemented by
India")
The output is as
follows: No match
India
No match
3 . Use $ (dollar sign) to m atch a

patter n at th e end of th e
str ing:
prog =
re.compile(r'Apple$')
print_match("Patent no
123456 belongs to
Apple")
345672 belongs to
Samsung")
987654 belongs to
Apple")
Apple
No match
Apple

PATTERN MATCHING WITH
MULTIPLE CHARACTERS
N ow , w e w i l l t u r n t o mor e ex c i t i ng and u sef u l
p at t er n mat c h i ng w i t h ex amp l es of mu l t i p l e
c h ar ac t er s mat c h i ng. You sh ou l d st ar t seei ng and
ap p r ec i at i ng t h e r eal p ow er of r egex b y now .
Note:
For th e s e e xa m p le s a nd e xe rc is e s , a ls o try to
th ink h ow y ou w ou ld im p le m e nt th e m w ith ou t
re g e x, th a t is , by u s ing s im p le s tring m e th od s
a nd a ny oth e r log ic th a t y ou c a n th ink of. Th e n,
c om p a re th a t s olu tion to th e one s im p le m e nte d
w ith re g e x for bre v ity a nd e ffic ie nc y .
1 . Use * to m atch 0 or m or e
r epetitions of th e pr eceding
RE:
prog =
re.compile(r'ab*')
print_match("a")
print_match("ab")
print_match("abbb")
print_match("b")
print_match("bbab")
print_match("something
_abb_something")
ab
abbb
No match
ab
abb
2 . Using + cau ses th e r esu lting

RE to m atch 1 or m or e
RE:
prog =
re.compile(r'ab+')
print_match("a")
print_match("ab")
print_match("abbb")
print_match("b")
print_match("bbab")
_abb_something")
No match
ab
abbb
No match
ab
abb
3 . ? cau ses th e r esu lting RE to

m atch pr ecisely 0 or 1
RE:
prog =
re.compile(r'ab?')
print_match("a")
print_match("ab")
print_match("abbb")
print_match("b")
print_match("bbab")
_abb_something")
ab
ab
No match
ab
ab
EXERCISE 103: GREEDY

VERSUS NON-GREEDY
MATCHING
Th e st andar d (def au l t ) mode of p at t er n mat c h i ng
i n r egex i s gr eedy , t h at i s, t h e p r ogr am t r i es t o
mat c h as mu c h as i t c an. Somet i mes, t h i s
b eh av i or i s nat u r al , b u t , i n some c ases, y ou may
w ant t o mat c h mi ni mal l y :
1 . Th e gr eedy w ay of m atch ing

a str ing is as follow s:
prog =
re.compile(r'<.*>')
print_match('<a> b
<c>')
<a> b <c>
2 . So, th e pr eceding r egex

fou nd both tags w ith th e < >
patter n, bu t w h at if w e
w anted to m atch th e fir st tag
only and stop th er e. We can
u se ? by inser ting it after
any r egex expr ession to
m ake it non-gr eedy :
prog =
re.compile(r'<.*?>')
print_match('<a> b
<c>')
<a>
EXERCISE 104: CONTROLLING

REPETITIONS TO MATCH
In many si t u at i ons, w e w ant t o h av e p r ec i se
c ont r ol ov er h ow many r ep et i t i ons of t h e
p at t er n w e w ant t o mat c h i n a t ex t . Th i s c an b e
done i n a f ew w ay s, w h i c h w e w i l l sh ow
ex amp l es of h er e:
1 . {m} specifies exactly m copies
of RE to m atch . Few er
m atch es cau se a non-m atch
and r etu r ns None:
prog =
re.compile(r'A{3}')
print_match("ccAAAdd")
print_match("ccAAAAdd"
)
print_match("ccAAdd")
AAA
AAA
No match
2 . {m,n} specifies exactly m to n

copies of RE to m atch :
prog =
re.compile(r'A{2,4}B')
print_match("ccAAABdd"
)
print_match("ccABdd")
print_match("ccAABBBdd
")
print_match("ccAAAAAAA
Bdd")
AAAB
No match
AAB
AAAAB
3 . Om itting m specifies a low er

bou nd of zer o:
prog =
re.compile(r'A{,3}B')
)
")
Bdd")
AAAB
AB
AAB
AAAB
4 . Om itting n specifies an
infinite u pper bou nd:
prog =
re.compile(r'A{3,}B')
)
")
Bdd")
AAAB
No match
No match
AAAAAAAB
5. {m,n}? specifies m to n copies

of RE to m atch in a non-
gr eedy fash ion:
prog =
re.compile(r'A{2,4}')
print_match("AAAAAAA")
prog =
re.compile(r'A{2,4}?')
print_match("AAAAAAA")
AAAA
AA
EXERCISE 105: SETS OF

MATCHING CHARACTERS
To mat c h an ar b i t r ar i l y c omp l ex p at t er n, w e
need t o b e ab l e t o i nc l u de a l ogi c al c omb i nat i on
of c h ar ac t er s t oget h er as a b u nc h . Regex gi v es u s
t h at k i nd of c ap ab i l i t y :
1 . Th e follow ing exam ples

dem onstr ate su ch u ses of
r egex. [x,y,z] m atch es x, y ,
or z:
prog =
re.compile(r'[A,B]')
print_match("ccAd")
print_match("ccABd")
print_match("ccXdB")
print_match("ccXdZ")

A
No match
A r ange of ch ar acter s can be

m atch ed inside th e set u sing
-. Th is is one of th e m ost
w idely u sed r egex
tech niqu es!
2 . Su ppose w e w ant to pick ou t

an em ail addr ess fr om a text.
Em ail addr esses ar e
gener ally of th e for m <some
name>@<some domain
name>.<some domain
identifier>:
prog =
re.compile(r'[a-zA-
Z]+@+[a-zA-Z]+\.com')
print_match("My email
is coolguy@xyz.com")
is coolguy12@xyz.com")
coolguy@xyz.com
No match
Look at th e r egex patter n
inside th e [ … ]. It is 'a-zA-
Z'. Th is cov er s all alph abets,
inclu ding low er case and
u pper case! With th is one
sim ple r egex, y ou ar e able to
m atch any (pu r e)
alph abetical str ing for th at
par t of th e em ail. Now , th e
next patter n is '@', w h ich is
added to th e pr ev iou s r egex
by a '+' ch ar acter . Th is is th e
w ay to bu ild u p a com plex
r egex: by adding/stacking
u p indiv idu al r egex
patter ns. We also u se th e
sam e [a-zA-Z] for th e em ail
dom ain nam e and add a
'.com' at th e end to com plete
th e patter n as a v alid em ail
addr ess. Wh y \.? Becau se, by
itself, DOT (.) is u sed as a
special m odifier in r egex, bu t
h er e w e w ant to u se DOT (.)
ju st as DOT (.), not as a
m odifier . So, w e need to
pr ecede it by a '\'.
3 . So, w ith th is r egex, w e cou ld

extr act th e fir st em ail
addr ess per fectly bu t got 'No
match' w ith th e second one.
4 . Wh at h appened w ith th e
second em ail ID?
5. Th e r egex cou ld not captu r e

it becau se it h ad th e nu m ber
'1 2 ' in th e nam e! Th at
patter n is not captu r ed by
th e expr ession [a-zA -Z].
6 . Let's ch ange th at and add th e

digits as w ell:
prog =
re.compile(r'[a-zA-Z0-
9]+@+[a-zA-Z]+\.com')
is coolguy12@xyz.com")
is coolguy12@xyz.org")
coolguy12@xyz.com
No match
Now , w e catch th e fir st em ail

ID per fectly . Bu t w h at's
going on w ith th e second
one? A gain, w e got a
m ism atch . Th e r eason is th at
w e ch anged th e .com to .or g
in th at em ail, and in ou r
r egex expr ession, th at
por tion w as h ar dcoded as
.com, so it did not find a
m atch .
7 . Let's tr y to addr ess th is in th e

follow ing r egex:
prog =
re.compile(r'[a-zA-Z0-
9]+@+[a-zA-Z]+\.+[a-
zA-Z]{2,3}')
is coolguy12@xyz.org")
is
coolguy12[AT]xyz[DOT]o
rg")
coolguy12@xyz.org
No match
8. In th is r egex, w e u sed th e
fact th at m ost dom ain
identifier s h av e 2 or 3
ch ar acter s, so w e u sed [a-
zA-Z]{2,3} to captu r e th at.
W h at h ap p ened w i t h t h e sec ond emai l ID? Th i s i s

an ex amp l e of t h e smal l t w eak s t h at y ou c an
mak e t o st ay ah ead of t el emar k et er s w h o w ant t o
sc r ap e onl i ne f or u ms or any ot h er c or p u s of t ex t
and ex t r ac t y ou r emai l ID. If y ou do not w ant
y ou r emai l t o b e f ou nd, y ou c an c h ange @ t o [AT]
and . t o [DOT] ,and h op ef u l l y t h at c an b eat some
r egex t ec h ni qu es (b u t not al l )!
EXERCISE 106: THE USE OF OR
IN REGEX USING THE OR
OPERATOR
Bec au se r egex p at t er ns ar e l i k e c omp l ex and
c omp ac t l ogi c al c onst r u c t or s t h emsel v es, i t
mak es p er f ec t sense t h at w e w ant t o c omb i ne
t h em t o c onst r u c t ev en mor e c omp l ex p r ogr ams
w h en needed. W e c an do t h at b y u si ng t h e |
op er at or :
1 . Th e follow ing exam ple

dem onstr ates th e u se of th e
OR oper ator :
prog =
re.compile(r'[0-9]
{10}')
print_match("312456789
7")
print_match("312-456-
7897")
3124567897
No match
So, h er e, w e ar e tr y ing to
extr act patter ns of 1 0-digit
nu m ber s th at cou ld be ph one
nu m ber s. Note th e u se of
{10} to denote exactly 1 0-
digit nu m ber s in th e
patter n. Bu t th e second
nu m ber cou ld not be
m atch ed for obv iou s r easons
– it h ad '-' sy m bols inser ted
in betw een gr ou ps of
nu m ber s.
2 . Use m u ltiple sm aller r egexes

and logically com bine th em
com m and:
prog =
re.compile(r'[0-9]
{10}|[0-9]{3}-[0-9]
{3}-[0-9]{4}')
7")
7897")
3124567897
312-456-7897
Ph one nu m ber s ar e w r itten

in a m y r iad of w ay s and if
y ou sear ch on th e w eb, y ou
w ill see exam ples of v er y
com plex r egexes (w r itten not
only in Py th on bu t oth er
w idely u sed langu ages, for
w eb apps su ch as Jav aScr ipt,
C+ + , PHP, Per l, and so on)
for captu r ing ph one
nu m ber s.
3 . Cr eate fou r str ings and

execu te print_match on
th em :
p1= r'[0-9]{10}'
p2=r'[0-9]{3}-[0-9]
{3}-[0-9]{4}'
p3 = r'$[0-9]{3}$[0-
9]{3}-[0-9]{4}'
p4 = r'[0-9]{3}\.[0-9]
{3}\.[0-9]{4}'
pattern=
p1+'|'+p2+'|'+p3+'|'+p
4
prog =
re.compile(pattern)
7")
7897")
print_match("(312)456-
7897")
print_match("312.456.7
897")
3124567897
312-456-7897
(312)456-7897
312.456.7897
THE FINDALL METHOD

Th e l ast r egex met h od t h at w e w i l l l ear n i n t h i s
c h ap t er i s findall. Essent i al l y , i t i s a se arch-
and-ag g re g ate met h od, t h at i s, i t p u t s al l t h e
i nst anc es t h at mat c h w i t h t h e r egex p at t er n i n a
gi v en t ex t and r et u r ns t h em i n a l i st . Th i s i s
ex t r emel y u sef u l , as w e c an ju st c ou nt t h e l engt h
of t h e r et u r ned l i st t o c ou nt t h e nu mb er of
oc c u r r enc es or p i c k and u se t h e r et u r ned
p at t er n-mat c h ed w or ds one b y one as w e see f i t .
N ot e t h at , al t h ou gh w e ar e gi v i ng sh or t
ex amp l es of si ngl e sent enc es i n t h i s c h ap t er , y ou
w i l l of t en deal w i t h a l ar ge c or p u s of t ex t w h en
u si ng a RegEx .
In t h ose c ases y ou ar e l i k el y t o get many mat c h es

f r om a si ngl e r egex p at t er n sear c h . For al l of
t h ose c ases, t h e findall met h od i s goi ng t o b e t h e
most u sef u l :
ph_numbers = """Here are some phone

numbers.
Pick out the numbers with 312 area code:
312-423-3456, 456-334-6721, 312-5478-9999,
312-Not-a-Number,777.345.2317,
312.331.6789"""
print(ph_numbers)
re.findall('312+[-\.][0-9-
\.]+',ph_numbers)
Here are some phone numbers.
Pick out the numbers with 312 area code:
312-423-3456, 456-334-6721, 312-5478-9999,
312-Not-a-Number,777.345.2317,
312.331.6789
['312-423-3456', '312-5478-9999',
'312.331.6789']
ACTIVITY 9: EXTRACTING THE

TOP 100 EBOOKS FROM
GUTENBERG
Pr ojec t Gu t enb er g enc ou r ages t h e c r eat i on and
di st r i b u t i on of eBook s b y enc ou r agi ng v ol u nt eer
ef f or t s t o di gi t i ze and ar c h i v e c u l t u r al w or k s.
Th i s ac t i v i t y ai ms t o sc r ap e t h e U RL of Pr ojec t
Gu t enb er g's Top 1 00 eBook s t o i dent i f y t h e
eBook s' l i nk s. It u ses Beau t i f u l Sou p 4 t o p ar se t h e
H TML and r egu l ar ex p r essi on c ode t o i dent i f y
t h e Top 1 00 eBook f i l e nu mb er s.
You c an u se t h ose b ook ID nu mb er s t o dow nl oad
t h e b ook i nt o y ou r l oc al dr i v e i f y ou w ant .
H ead ov er t o t h e su p p l i ed Ju p y t er not eb ook (i n

t h e Gi t H u b r ep osi t or y ) t o w or k on t h i s ac t i v i t y .
ac t i v i t y :
1 . Im por t th e necessar y
libr ar ies, inclu ding regex
and beautifulsoup.
2 . Ch eck th e SSL cer tificate.
3 . Read th e HTML fr om th e
URL.
4 . Wr ite a sm all fu nction to

ch eck th e statu s of th e w eb
r equ est.
5. Decode th e r esponse and pass

th is on to Beau tifu lSou p for
HTML par sing.
6 . Find all th e href tags and

stor e th em in th e list of links.
Ch eck w h at th e list looks like
– pr int th e fir st 3 0 elem ents.
7 . Use a r egu lar expr ession to

find th e nu m er ic digits in
th ese links. Th ese ar e th e file
nu m ber s for th e top 1 00
eBooks.
8. Initialize th e em pty list to

h old th e file nu m ber s ov er
an appr opr iate r ange and
u se regex to find th e
nu m er ic digits in th e link
href str ing. Use th e findall
m eth od.
9 . Wh at does th e soup object's

text look like? Use th e .text
m eth od and pr int only th e
fir st 2 ,000 ch ar acter s (do
not pr int th e w h ole th ing, as
it is too long).
1 0. Sear ch in th e extr acted text

(u sing a r egu lar expr ession)
fr om th e sou p object to find
th e nam es of th e top 1 00
eBooks (y ester day 's
r anking).
1 1 . Cr eate a star ting index. It

sh ou ld point at th e text Top
100 Ebooks yesterday. Use
th e splitlines m eth od of
sou p.text. It splits th e lines of
text of th e sou p object.
1 2 . Loop 1 -1 00 to add th e str ings

of th e next 1 00 lines to th is
tem por ar y list. Hint: u se th e
splitlines m eth od.
1 3 . Use a r egu lar expr ession to

extr act only text fr om th e
nam e str ings and append it
to an em pty list. Use match
and span to find th e indices
and u se th em .
Note

ACTIVITY 10: BUILDING YOUR

OWN MOVIE DATABASE BY
READING AN API
In t h i s ac t i v i t y , y ou w i l l b u i l d a c omp l et e mov i e
dat ab ase b y c ommu ni c at i ng and i nt er f ac i ng
w i t h a f r ee A PI. You w i l l l ear n ab ou t ob t ai ni ng a
u ni qu e u ser k ey t h at mu st b e u sed w h en y ou r
p r ogr am t r i es t o ac c ess t h e A PI. Th i s ac t i v i t y
w i l l t eac h y ou gener al c h ap t er s ab ou t w or k i ng
w i t h an A PI, w h i c h ar e f ai r l y c ommon f or ot h er
h i gh l y p op u l ar A PI ser v i c es su c h as Googl e or
Tw i t t er . Th er ef or e, af t er doi ng t h i s ex er c i se, y ou
w i l l b e c onf i dent ab ou t w r i t i ng mor e c omp l ex
p r ogr ams t o sc r ap e dat a f r om su c h ser v i c es.
Th e ai ms of t h i s ac t i v i t y ar e as f ol l ow s:
To r etr iev e and pr int basic

data abou t a m ov ie (th e title
is enter ed by th e u ser ) fr om
th e w eb (OMDb database)
If a poster of th e m ov ie can
be fou nd, it dow nloads th e
file and sav es it at a u ser -
specified location
ac t i v i t y :
1 . Im por t urllib.request,
urllib.parse,
urllib.error, and json.
2 . Load th e secr et A PI key (y ou

h av e to get one fr om th e
OMDb w ebsite and u se th at;
it h as a daily lim it of 1 ,000)
fr om a JSON file stor ed in
th e sam e folder in a v ar iable,
by u sing json.loads.
3 . Obtain a key and stor e it in

JSON as APIkeys.json.
4 . Open th e APIkeys.json file.
5. A ssign th e OMDb por tal

(h ttp://w w w .om dbapi.com /
?) as a str ing to a v ar iable.
6 . Cr eate a v ar iable called

apikey w ith th e last por tion
of th e URL
(&apikey=secretapikey),
w h er e secretapikey is y ou r
ow n A PI key .
7 . Wr ite a u tility fu nction

called print_json to pr int
th e m ov ie data fr om a JSON
file (w h ich w e w ill get fr om
th e por tal).
8. Wr ite a u tility fu nction to

dow nload a poster of th e
m ov ie based on th e
infor m ation fr om th e JSON
dataset and sav e it in y ou r
local folder . Use th e os
m odu le. Th e poster data is
stor ed in th e JSON key
Poster. Use th e Py th on
com m and to open a file and
w r ite th e poster data. Close
th e file after y ou 'r e done.
Th is fu nction w ill sav e th e
poster data as an im age file.

called search_movie to
sear ch for a m ov ie by its
nam e, pr int th e dow nloaded
JSON data, and sav e th e
m ov ie poster in th e local
folder . Use a try-except loop
for th is. Use th e pr ev iou sly
cr eated serviceurl and
apikey v ar iables. You h av e
to pass on a dictionar y w ith a
key , t, and th e m ov ie nam e
as th e cor r esponding v alu e to
th e
urllib.parse.urlencode()
fu nction and th en add th e
serviceurl and apikey to
th e ou tpu t of th e fu nction to
constr u ct th e fu ll URL. Th is
URL w ill be u sed to access th e
data. Th e JSON data h as a
key called Response. If it is
True, th at m eans th e r ead
w as su ccessfu l. Ch eck th is
befor e pr ocessing th e data. If
it's not su ccessfu l, th en pr int
th e JSON key Error, w h ich
w ill contain th e appr opr iate
er r or m essage r etu r ned by
th e m ov ie database.
1 0. Test th e search_movie
fu nction by enter ing
Titanic.
1 1 . Test th e search_movie
"Random_error" (obv iou sly ,
th is w ill not be fou nd, and
y ou sh ou ld be able to ch eck
w h eth er y ou r er r or catch ing
code is w or king pr oper ly ).
Note:

Summary
In t h i s c h ap t er , w e w ent t h r ou gh sev er al
i mp or t ant c onc ep t s and l ear ni ng modu l es
r el at ed t o adv anc ed dat a gat h er i ng and w eb
sc r ap i ng. W e st ar t ed b y r eadi ng dat a f r om w eb
p ages u si ng t w o of t h e most p op u l ar Py t h on
l i b r ar i es – requests and BeautifulSoup. In t h i s
t ask , w e u t i l i zed t h e p r ev i ou s c h ap t er 's
k now l edge ab ou t t h e gener al st r u c t u r e of H TML
p ages and t h ei r i nt er ac t i on w i t h Py t h on c ode.
W e ex t r ac t ed meani ngf u l dat a f r om t h e
W i k i p edi a h ome p age du r i ng t h i s p r oc ess.
Th en, w e l ear ned h ow t o r ead dat a f r om XML and

JSON f i l es, t w o of t h e most w i del y u sed dat a
st r eami ng/ex c h ange f or mat s on t h e w eb . For t h e
XML p ar t , w e sh ow ed y ou h ow t o t r av er se t h e
t r ee-st r u c t u r e dat a st r i ng ef f i c i ent l y t o ex t r ac t
k ey i nf or mat i on. For t h e JSON p ar t , w e mi x ed i t
w i t h r eadi ng dat a f r om t h e w eb u si ng an A PI
(A p p l i c at i on Pr ogr am Int er f ac e). Th e A PI w e
c onsu med w as RESTf u l , w h i c h i s one of t h e major
st andar ds i n W eb A PI.
A t t h e end of t h i s c h ap t er , w e w ent t h r ou gh a
det ai l ed ex er c i se of u si ng r egex t ec h ni qu es i n
t r i c k y st r i ng-mat c h i ng p r ob l ems t o sc r ap e
u sef u l i nf or mat i on f r om a l ar ge and messy t ex t
c or p u s, p ar sed f r om H TML. Th i s c h ap t er sh ou l d
c ome i n ex t r emel y h andy f or st r i ng and t ex t
p r oc essi ng t ask s i n y ou r dat a w r angl i ng c ar eer .
In t h e nex t c h ap t er , w e w i l l l ear n ab ou t
dat ab ases w i t h Py t h on.
Chapter 8
RDBMS and SQL
Learning Objectives
A pply th e basics of RDBMS to

qu er y databases u sing
Py th on
Conv er t data fr om SQL into a

pandas DataFr am e
Th i s c h ap t er ex p l ai ns t h e c onc ep t s of dat ab ases,

i nc l u di ng t h ei r c r eat i on, mani p u l at i on and
c ont r ol , and t r ansf or mi ng t ab l es i nt o p andas
Dat aFr ames.
Introduction
Th i s c h ap t er of ou r dat a jou r ney i s f oc u sed on
RDBMS (Rel at i onal Dat ab ase Management
Sy st ems) and SQL (St r u c t u r ed Qu er y Langu age).
In t h e p r ev i ou s c h ap t er , w e st or ed and r ead dat a
f r om a f i l e. In t h i s c h ap t er , w e w i l l r ead
st r u c t u r ed dat a, desi gn ac c ess t o t h e dat a, and
c r eat e qu er y i nt er f ac es f or dat ab ases.
Dat a h as b een st or ed i n RDBMS f or mat f or y ear s.

Th e r easons b eh i nd i t ar e as f ol l ow s:
RDBMS is one of th e safest

w ay s to stor e, m anage, and
r etr iev e data.
Th ey ar e backed by a solid
m ath em atical fou ndation
(r elational algebr a and
calcu lu s) and th ey expose an
efficient and intu itiv e
declar ativ e langu age – SQL
– for easy inter action.
A lm ost ev er y langu age h as a

r ich set of libr ar ies to
inter act w ith differ ent
RDBMS and th e tr icks and
m eth ods of u sing th em ar e
w ell tested and w ell
u nder stood.
Scaling an RDBMS is a pr etty
w ell-u nder stood task and
th er e ar e a bu nch of w ell
tr ained, exper ienced
pr ofessionals to do th is job
(DBA or database
adm inistr ator ).
A s w e c an see i n t h e f ol l ow i ng c h ar t , t h e mar k et
of DBMS i s b i g. Th i s c h ar t w as p r odu c ed b ased on
mar k et r esear c h t h at w as done b y Gartne r, I nc.
i n 2016:
Figure 8.1 Commercial database market share in 2016
W e w i l l l ear n and p l ay ar ou nd w i t h some b asi c

and f u ndament al c onc ep t s of dat ab ase and
dat ab ase management sy st ems i n t h i s c h ap t er .
Refresher of RDBMS and

SQL
A n RDBMS i s a p i ec e of sof t w ar e t h at manages
dat a (r ep r esent ed f or t h e end u ser i n a t ab u l ar
f or m) on p h y si c al h ar d di sk s and i s b u i l t u si ng
t h e Codd's r el at i onal model . Most of t h e dat ab ases
t h at w e enc ou nt er t oday ar e RDBMS. In r ec ent
y ear s, t h er e h as b een a h u ge i ndu st r y sh i f t
t ow ar d a new er k i nd of dat ab ase management
sy st em, c al l ed NoSQL (Mong oDB, CouchDB,
Riak, and so on). Th ese sy st ems, al t h ou gh i n some
asp ec t s t h ey f ol l ow some of t h e r u l es of RDBMS,
i n most c ases r ejec t or modi f y t h em.
HOW IS AN RDBMS
STRUCTURED?
Th e RDBMS st r u c t u r e c onsi st s of t h r ee mai n
el ement s, namel y t h e st or age engi ne, qu er y
engi ne, and l og management . H er e i s a di agr am
t h at sh ow s t h e st r u c t u r e of a RDBMS:
Figure 8.2 RDBMS structure
Th e f ol l ow i ng ar e t h e mai n c onc ep t s of any

RDBMS st r u c t u r e:
St orage engine: Th is is th e
par t of th e RDBMS th at is
r esponsible for stor ing th e
data in an efficient w ay and
also to giv e it back w h en
asked for , in an efficient
w ay . A s an end u ser of th e
RDBMS sy stem (an
application dev eloper is
consider ed an end u ser of an
RDBMS), w e w ill nev er need
to inter act w ith th is lay er
dir ectly .
Query engine: Th is is th e
par t of th e RDBMS th at
allow s u s to cr eate data
objects (tables, v iew s, and so
on), m anipu late th em
(cr eate and delete colu m ns,
cr eate/delete/u pdate r ow s,
and so on), and qu er y th em
(r ead r ow s) u sing a sim ple
y et pow er fu l langu age.
Log management : Th is
par t of th e RDBMS is
r esponsible for cr eating and
m aintaining th e logs. If y ou
ar e w onder ing w h y th e log is
su ch an im por tant th ing,
th en y ou sh ou ld look into
h ow r eplication and
par titions ar e h andled in a
m oder n RDBMS (su ch as
Postgr eSQL) u sing som eth ing
called Writ e Ahead Log (or
WA L for sh or t).
W e w i l l f oc u s on t h e qu er y engi ne i n t h i s
c h ap t er .
SQL
St r u c t u r ed Qu er y Langu age or SQL (p r onou nc ed
sequ el ), as i t i s c ommonl y k now n, i s a domai n-
sp ec i f i c l angu age t h at w as or i gi nal l y desi gned
b ased on E.F. Codd's r el at i onal model and i s
w i del y u sed i n t oday 's dat ab ases t o def i ne, i nser t ,
mani p u l at e, and r et r i ev e dat a f r om t h em. It c an
b e f u r t h er su b -di v i ded i nt o f ou r smal l er su b -
l angu ages, namel y DDL (Dat a Def i ni t i on
Langu age), DML (Dat a Mani p u l at i on Langu age),
DQL (Dat a Qu er y Langu age), and DCL (Dat a
Cont r ol Langu age). Th er e ar e sev er al adv ant ages
of u si ng SQL, w i t h some of t h em b ei ng as f ol l ow s:
It is based on a solid
m ath em atical fr am ew or k
and th u s it is easy to
u nder stand.
It is a declar ativ e langu age,

w h ich m eans th at w e
actu ally nev er tell it h ow to
do its job. We alm ost alw ay s
tell it w h at to do. Th is fr ees
u s fr om a big bu r den of
w r iting cu stom code for data
m anagem ent. We can be
m or e focu sed on th e actu al
qu er y pr oblem w e ar e tr y ing
to solv e instead of both er ing
abou t h ow to cr eate and
m aintain a data stor e.
It giv es y ou a fast and

r eadable w ay to deal w ith
data.
SQL giv es y ou ou t-of-th e-box

w ay s to get m u ltiple pieces of
data w ith a single qu er y .
Th e mai n ar eas of f oc u s f or t h e f ol l ow i ng t op i c
w i l l b e DDL, DML, and DQL. Th e DCL p ar t i s mor e
f or dat ab ase admi ni st r at or s.
DDL: Th is is h ow w e define
ou r data str u ctu r e in SQL. A s
RDBMS is m ainly designed
and bu ilt w ith str u ctu r ed
data in m ind, w e h av e to tell
an RDBMS engine
befor eh and w h at ou r data is
going to look like. We can
u pdate th is definition at a
later point in tim e, bu t an
initial one is a m u st. Th is is
w h er e w e w ill w r ite
statem ents su ch as CREATE
TABLE or DROP TABLE or
ALTER TABLE.
Note
Notice the use of uppercase

letters. I t is not a specification
and you can use low ercase
letters, but it is a w idely
follow ed convention and w e
w ill use that in this book.
DML: DML is th e par t of SQL

th at let u s inser t, delete, or
u pdate a cer tain data point
(a r ow ) in a pr ev iou sly
defined data object (a table).
Th is is th e par t of SQL w h ich
contains statem ents su ch as
INSERT INTO, DELETE FROM,
or UPDATE.
DQL: With DQL, w e enable

ou r selv es to qu er y th e data
stor ed in a RDBMS, w h ich
w as defined by DDL and
inser ted u sing DML. It giv es
u s enor m ou s pow er and
flexibility to not only qu er y
data ou t of a single object
(table), bu t also to extr act
r elev ant data fr om all th e
r elated objects u sing qu er ies.
Th e fr equ ently u sed qu er y
th at's u sed to r etr iev e data is
th e SELECT com m and. We
w ill also see and u se th e
concepts of th e pr im ar y key ,
for eign key , index, joins, and
so on.
Onc e y ou def i ne and i nser t dat a i n a dat ab ase, i t

c an b e r ep r esent ed as f ol l ow s:
Figure 8.3 Table displaying sample data
A not h er t h i ng t o r ememb er ab ou t RDBMS i s

r el at i ons. Gener al l y , i n a t ab l e, w e h av e one or
mor e c ol u mns t h at w i l l h av e u ni qu e v al u es f or
eac h r ow i n t h e t ab l e. W e c al l t h em primary
ke y s f or t h e t ab l e. W e sh ou l d b e aw ar e t h at w e
w i l l enc ou nt er u ni qu e v al u es ac r oss t h e r ow s,
w h i c h ar e not p r i mar y k ey s. Th e mai n
di f f er enc e b et w een t h em and p r i mar y k ey s i s
t h e f ac t t h at a p r i mar y k ey c annot b e nu l l .
By u si ng t h e p r i mar y k ey of one t ab l e and

ment i oni ng i t as a f or ei gn k ey i n anot h er t ab l e,
w e c an est ab l i sh r el at i ons b et w een t w o t ab l es. A
c er t ai n t ab l e c an b e r el at ed t o any f i ni t e
nu mb er of t ab l es. Th e r el at i ons c an b e 1 :1 , w h i c h
means t h at eac h r ow of t h e sec ond t ab l e i s
u ni qu el y r el at ed t o one r ow of t h e f i r st t ab l e, or
1 :N , N :1 , or N : M. A n ex amp l e of r el at i ons i s as
f ol l ow s:
Figure 8.4 Diagram showing relations
W i t h t h i s b r i ef r ef r esh er , w e ar e now r eady t o

ju mp i nt o h ands-on ex er c i ses and w r i t e some SQL
t o st or e and r et r i ev e dat a.
Using an RDBMS
(MySQL/PostgreSQL/SQ
Lite)
In t h i s t op i c , w e w i l l f oc u s on h ow t o w r i t e some
b asi c SQL c ommands, as w el l as h ow t o c onnec t t o
a dat ab ase f r om Py t h on and u se i t ef f ec t i v el y
w i t h i n Py t h on. Th e dat ab ase w e w i l l c h oose h er e
i s SQLi t e. Th er e ar e ot h er dat ab ases, su c h as
Oracle, MySQL, Postgresql, and DB2. Th e mai n
t r i c k s t h at y ou ar e goi ng t o l ear n h er e w i l l not
c h ange b ased on w h at dat ab ase y ou ar e u si ng. Bu t
f or di f f er ent dat ab ases, y ou w i l l need t o i nst al l
di f f er ent t h i r d-p ar t y Py t h on l i b r ar i es (su c h as
Psycopg2 f or Postgresql, and so on). Th e r eason
t h ey al l b eh av e t h e same w ay (ap ar t f or some
smal l det ai l s) i s t h e f ac t t h at t h ey al l adh er e t o
PEP249 (c ommonl y k now n as Py t h on DB A PI 2).
Th i s i s a good st andar di zat i on and sav es u s a l ot of

h eadac h es w h i l e p or t i ng f r om one RDBMS t o
anot h er .
Note
Mos t of th e ind u s try s ta nd a rd p roje c ts w h ic h a re
w ritte n in Py th on a nd u s e s om e k ind of RDBMS a s
th e d a ta s tore , m os t ofte n re la y on a n ORM or
Obje c t Re la tiona l Ma p p e r. An ORM is a h ig h -le v e l
libra ry in Py th on w h ic h m a k e s m a ny ta s k s ,
w h ile d e a ling w ith RDBMS , e a s ie r. I t a ls o
e xp os e s a m ore Py th onic API th a n w riting ra w
S QL ins id e Py th on c od e .
EXERCISE 107: CONNECTING

TO DATABASE IN SQLITE
In t h i s ex er c i se, w e w i l l l ook i nt o t h e f i r st st ep
t ow ar d u si ng a RDBMS i n Py t h on c ode. A l l w e ar e
goi ng t o do i s c onnec t t o a dat ab ase and t h en c l ose
t h e c onnec t i on. W e w i l l al so l ear n ab ou t t h e b est
w ay t o do t h i s:
1 . Im por t th e sqlite3 libr ar y

of Py th on by u sing th e
import sqlite3
2 . Use th e connect fu nction to

connect to a database. If y ou
alr eady h av e som e
exper ience w ith databases,
th en y ou w ill notice th at w e
ar e not u sing any server
address, user name,
password, or oth er
cr edentials to connect to a
database. Th is is becau se
th ese fields ar e not
m andator y in sqlite3,
u nlike in Postgresql or
MySQL. Th e m ain database
engine of SQLite is
em bedded:
conn =
sqlite3.connect("chapt
er.db")
3 . Close th e connection, as
follow s:
conn.close()
Th is conn object is th e m ain

connection object, and w e
w ill need th at to get a second
ty pe of object in th e fu tu r e
once w e w ant to inter act
w ith th e database. We need
to be car efu l abou t closing
any open connection to ou r
database.
4 . Use th e sam e with statem ent

fr om Py th on, ju st like w e did
for files, and connect to th e
database, as follow s:
with
er.db") as conn:
pass
In t h i s ex er c i se, w e h av e c onnec t ed t o a dat ab ase

u si ng Py t h on.
EXERCISE 108: DDL AND DML

COMMANDS IN SQLITE
In t h i s ex er c i se, w e w i l l l ook at h ow w e c an
c r eat e a t ab l e, and w e w i l l al so i nser t dat a i n i t .
A s t h e name su ggest s, DDL (Dat a Def i ni t i on

Langu age) i s t h e w ay t o c ommu ni c at e t o t h e
dat ab ase engi ne i n adv anc e t o def i ne w h at t h e
dat a w i l l l ook l i k e. Th e dat ab ase engi ne c r eat es a
t ab l e ob jec t b ased on t h e def i ni t i on p r ov i ded and
p r ep ar es i t .
To c r eat e a t ab l e i n SQL, u se t h e CREATE TABLE SQL

c l au se. Th i s w i l l need t h e t ab l e name and t h e
t ab l e def i ni t i on. Tab l e name i s a u ni qu e
i dent i f i er f or t h e dat ab ase engi ne t o f i nd and u se
t h e t ab l e f or al l f u t u r e t r ansac t i ons. It c an b e
any t h i ng (any al p h anu mer i c st r i ng), as l ong as i t
i s u ni qu e. W e add t h e t ab l e def i ni t i on i n t h e
f or m of (c ol u mn_name_1 dat a_t y p e,
c ol u mn_name_2 dat a t y p e, … ). For ou r p u r p ose,
w e w i l l u se t h e text and integer dat at y p es, b u t
u su al l y a st andar d dat ab ase engi ne su p p or t s
many mor e dat at y p es, su c h as f l oat , dou b l e, dat e
t i me, Bool ean, and so on. W e w i l l al so need t o
sp ec i f y a p r i mar y k ey . A p r i mar y k ey i s a
u ni qu e, non-nu l l i dent i f i er t h at 's u sed t o
u ni qu el y i dent i f y a r ow i n a t ab l e. In ou r c ase,
w e u se emai l as a p r i mar y k ey . A p r i mar y k ey
c an b e an i nt eger or t ex t .
Th e l ast t h i ng y ou need t o k now i s t h at u nl ess

y ou c al l a commit on t h e ser i es of op er at i ons y ou
ju st p er f or med (t oget h er , w e f or mal l y c al l t h em
a transaction), not h i ng w i l l b e ac t u al l y
p er f or med and r ef l ec t ed i n t h e dat ab ase. Th i s
p r op er t y i s c al l ed atomicity . In f ac t , f or a
dat ab ase t o b e i ndu st r y st andar d (t o b e u seab l e i n
r eal l i f e), i t needs t o f ol l ow t h e A CID (A t omi c i t y ,
Consi st enc y , Isol at i on, Du r ab i l i t y ) p r op er t i es:
1 . Use SQLite's connect
fu nction to connect to th e
chapter.db database, as
follow s:
with
er.db") as conn:
Note
This code w ill w ork once you

add the snippet from step 3.
2 . Cr eate a cu r sor object by

calling conn.cursor(). Th e
cu r sor object acts as a
m ediu m to com m u nicate
w ith th e database. Cr eate a
table in Py th on, as follow s:
cursor = conn.cursor()
cursor.execute("CREATE
TABLE IF NOT EXISTS
user (email text,
first_name text,
last_name text,
address text, age
integer, PRIMARY KEY
(email))")
3 . Inser t r ow s into th e database

th at y ou cr eated, as follow s:
cursor.execute("INSERT
INTO user VALUES
('bob@example.com',
'Bob', 'Codd', '123
Fantasy lane, Fantasy
City', 31)")
INTO user VALUES
('tom@web.com', 'Tom',
'Fake', '456 Fantasy
lane, Fantasu City',
39)")
4 . Com m it to th e database:
conn.commit()
Th i s w i l l c r eat e t h e t ab l e and w r i t e t w o r ow s t o
i t w i t h dat a.
READING DATA FROM A
DATABASE IN SQLITE
In t h e p r ec edi ng ex er c i se, w e c r eat ed a t ab l e and
st or ed dat a i n i t . N ow , w e w i l l l ear n h ow t o r ead
t h e dat a t h at 's st or ed i n t h i s dat ab ase.
Th e SELECT c l au se i s i mmensel y p ow er f u l , and i t

i s r eal l y i mp or t ant f or a dat a p r ac t i t i oner t o
mast er SELECT and ev er y t h i ng r el at ed t o i t (su c h
as c ondi t i ons, joi ns, gr ou p -b y , and so on).
Th e * af t er SELECT t el l s t h e engi ne t o sel ec t al l of

t h e c ol u mns f r om t h e t ab l e. It i s a u sef u l
sh or t h and. W e h av e not ment i oned any c ondi t i on
f or t h e sel ec t i on (su c h as ab ov e a c er t ai n age,
f i r st name st ar t i ng w i t h a c er t ai n sequ enc e of
l et t er s, and so on). W e ar e p r ac t i c al l y t el l i ng t h e
dat ab ase engi ne t o sel ec t al l t h e r ow s and al l t h e
c ol u mns f r om t h e t ab l e. It i s t i me-c onsu mi ng and
l ess ef f ec t i v e i f w e h av e a h u ge t ab l e. H enc e, w e
w ou l d w ant t o u se t h e LIMIT c l au se t o l i mi t t h e
nu mb er of r ow s w e w ant .
You c an u se t h e SELECT c l au se i n SQL t o r et r i ev e

dat a, as f ol l ow s:
with sqlite3.connect("chapter.db") as
conn:
rows = cursor.execute('SELECT * FROM

user')
for row in rows:
print(row)
Figure 8.5: Output of the SELECT clause
Th e sy nt ax t o u se t h e SELECT c l au se w i t h a LIMIT
as f ol l ow s:
SELECT * FROM <table_name> LIMIT 50;
Note
Th is s y nta x is a s a m p le c od e a nd w ill not w ork
on Ju p y te r note book .
Th i s w i l l sel ec t al l t h e c ol u mns, b u t onl y t h e

f i r st 50 r ow s f r om t h e t ab l e.
EXERCISE 109: SORTING

VALUES THAT ARE PRESENT
IN THE DATABASE
In t h i s ex er c i se, w e w i l l u se t h e ORDER BY c l au se
t o sor t t h e r ow s of u ser t ab l e w i t h r esp ec t t o age:
1 . Sor t th e chapter.db by age

in descending or der , as
follow s:
with
er.db") as conn:
rows =
cursor.execute('SELECT
* FROM user ORDER BY
age DESC')
for row in rows:
print(row)
Figure 8.6: Output of data displaying age

in descending order
2 . Sor t th e chapter.db by age

in ascending or der , as
follow s:
with
er.db") as conn:
rows =
* FROM user ORDER BY
age')
for row in rows:
print(row)
3 . Th e ou tpu t is as follow s:
Figure 8.7: Output of data displaying age in ascending

order
N ot i c e t h at w e don't need t o sp ec i f y t h e or der as

ASC t o sor t i t i nt o asc endi ng or der .
EXERCISE 110: ALTERING THE
STRUCTURE OF A TABLE AND
UPDATING THE NEW FIELDS
In t h i s ex er c i se, w e ar e goi ng t o add a c ol u mn
u si ng ALTER and UPDATE t h e v al u es i n t h e new l y
added c ol u mn.
Th e UPDATE c ommand i s u sed t o edi t /u p dat e any

r ow af t er i t h as b een i nser t ed. Be c ar ef u l w h en
u si ng i t b ec au se u si ng UPDATE w i t h ou t sel ec t i v e
c l au ses (su c h as WHERE) af f ec t s t h e ent i r e t ab l e:
1 . Establish th e connection
w ith th e database by u sing
with
er.db") as conn:
2 . A dd anoth er colu m n in th e
user table and fill it w ith
null v alu es by u sing th e
cursor.execute("ALTER
TABLE user ADD COLUMN
gender text")
3 . Update all of th e v alu es of

gender so th at th ey ar e M by
com m and:
cursor.execute("UPDATE
user SET gender='M'")
conn.commit()
4 . To ch eck th e alter ed table,

com m and:
rows =
* FROM user')
for row in rows:
print(row)
Figure 8.8: Output a er altering the table
W e h av e u p dat ed t h e ent i r e t ab l e b y set t i ng t h e

gender of al l t h e u ser s as M, w h er e M st ands f or
mal e.
EXERCISE 111: GROUPING

VALUES IN TABLES
In t h i s ex er c i se, w e w i l l l ear n ab ou t a c onc ep t
t h at w e h av e al r eady l ear ned ab ou t i n p andas.
Th i s i s t h e GROUP BY c l au se. Th e GROUP BY c l au se i s
a t ec h ni qu e t h at 's u sed t o r et r i ev e di st i nc t
v al u es f r om t h e dat ab ase and p l ac e t h em i n
i ndi v i du al b u c k et s.
Th e f ol l ow i ng di agr am ex p l ai ns h ow t h e GROU P
BY c l au se w or k s:
Figure 8.9: Illustration of the GROUP BY clause on a

table
In t h e p r ec edi ng di agr am, w e c an see t h at t h e

Col3 c ol u mn h as onl y t w o u ni qu e v al u es ac r oss
al l r ow s, A and B.
Th e c ommand t h at 's u sed t o c h ec k t h e t ot al

nu mb er of r ow s b el ongi ng t o eac h gr ou p i s as
f ol l ow s:
SELECT count(*), col3 FROM table1 GROUP BY

col3
A dd f emal e u ser s t o t h e t ab l e and gr ou p t h em
b ased on t h e gender :
1 . A dd a fem ale u ser to th e

table:
INTO user VALUES
('shelly@www.com',
'Shelly', 'Milar',
'123, Ocean View
Lane', 39, 'F')")
2 . Ru n th e follow ing code to see

th e cou nt by each gender :
rows =
cursor.execute("SELECT
COUNT(*), gender FROM
user GROUP BY gender")
for row in rows:
print(row)
Figure 8.10: Output of the GROUP BY clause
RELATION MAPPING IN
DATABASES
W e h av e b een w or k i ng w i t h a si ngl e t ab l e and
al t er i ng i t , as w el l as r eadi ng b ac k t h e dat a.
H ow ev er , t h e r eal p ow er of an RDBMS c omes
f r om t h e h andl i ng of r el at i onsh i p s among
di f f er ent ob jec t s (t ab l es). In t h i s sec t i on, w e ar e
goi ng t o c r eat e a new t ab l e c al l ed comments and
l i nk i t w i t h t h e u ser t ab l e i n a 1 : N r el at i onsh i p .
Th i s means t h at one u ser c an h av e mu l t i p l e
c omment s. Th e w ay w e ar e goi ng t o do t h i s i s b y
addi ng t h e user t ab l e's p r i mar y k ey as a f or ei gn
k ey i n t h e comments t ab l e. Th i s w i l l c r eat e a 1 : N
r el at i onsh i p .
W h en w e l i nk t w o t ab l es, w e need t o sp ec i f y t o
t h e dat ab ase engi ne w h at sh ou l d b e done i f t h e
p ar ent r ow i s del et ed, w h i c h h as many c h i l dr en
i n t h e ot h er t ab l e. A s w e c an see i n t h e f ol l ow i ng
di agr am, w e ar e ask i ng w h at h ap p ens at t h e
p l ac e of t h e qu est i on mar k s w h en w e del et e r ow 1
of t h e u ser t ab l e:
Figure 8.11: Illustration of relations
In a non-RDBMS si t u at i on, t h i s si t u at i on c an
qu i c k l y b ec ome di f f i c u l t and messy t o manage
and mai nt ai n. H ow ev er , w i t h an RDBMS, al l w e
h av e t o t el l t h e dat ab ase engi ne, i n v er y p r ec i se
w ay s, i s w h at t o do w h en a si t u at i on l i k e t h i s
oc c u r s. Th e dat ab ase engi ne w i l l do t h e r est f or
u s. W e u se ON DELETE t o t el l t h e engi ne w h at w e do
w i t h al l t h e r ow s of a t ab l e w h en t h e p ar ent r ow
get s del et ed. Th e f ol l ow i ng c ode i l l u st r at es t h ese
c onc ep t s:
conn:
cursor.execute("PRAGMA foreign_keys = 1")
sql = """
CREATE TABLE comments (
user_id text,
comments text,
FOREIGN KEY (user_id) REFERENCES user

(email)
ON DELETE CASCADE ON UPDATE NO ACTION
"""
cursor.execute(sql)
conn.commit()
Th e ON DELETE CASCADE l i ne i nf or ms t h e dat ab ase

engi ne t h at w e w ant t o del et e al l t h e c h i l dr en
r ow s w h en t h e p ar ent get s del et ed. W e c an al so
def i ne ac t i ons f or UPDATE. In t h i s c ase, t h er e i s
not h i ng t o do on UPDATE.
Th e FOREIGN KEY modi f i er modi f i es a c ol u mn

def i ni t i on (user_id, i n t h i s c ase) and mar k s i t as a
f or ei gn k ey , w h i c h i s r el at ed t o t h e p r i mar y k ey
(email, i n t h i s c ase) of anot h er t ab l e.
You may not i c e t h e st r ange l ook i ng

l i ne i n t h e c ode. It i s t h er e ju st b ec au se SQLi t e
does not u se t h e nor mal f or ei gn k ey f eat u r es b y
def au l t . It i s t h i s l i ne t h at enab l es t h at f eat u r e.
It i s t y p i c al t o SQLi t e and w e w on't need i t f or
any ot h er dat ab ases.
ADDING ROWS IN THE
COMMENTS TABLE
W e h av e c r eat ed a t ab l e c al l ed c omment s. In t h i s
sec t i on, w e w i l l dy nami c al l y gener at e an i nser t
qu er y , as f ol l ow s:
conn:
sql = "INSERT INTO comments VALUES ('{}',

'{}')"
rows = cursor.execute('SELECT * FROM user

ORDER BY age')
for row in rows:
email = row[0]
print("Going to create rows for

{}".format(email))
name = row[1] + " " + row[2]
for i in range(10):
comment = "This is comment {} by

{}".format(i, name)
conn.cursor().execute(sql.format(email,
comment))
conn.commit()
Pay at t ent i on t o h ow w e dy nami c al l y gener at e

t h e i nser t qu er y so t h at w e c an i nser t 1 0
c omment s f or eac h u ser .
JOINS
In t h i s ex er c i se, w e w i l l l ear n h ow t o ex p l oi t t h e
r el at i onsh i p w e ju st b u i l t . Th i s means t h at i f w e
h av e t h e p r i mar y k ey f r om one t ab l e, w e c an
r ec ov er al l t h e dat a needed f r om t h at t ab l e and
al so al l t h e l i nk ed r ow s f r om t h e c h i l d t ab l e. To
ac h i ev e t h i s, w e w i l l u se somet h i ng c al l ed a j oin.
A joi n i s b asi c al l y a w ay t o r et r i ev e l i nk ed r ow s
f r om t w o t ab l es u si ng any k i nd of p r i mar y k ey -
f or ei gn k ey r el at i on t h at t h ey h av e. Th er e ar e
many t y p es of joi n, su c h as INNER, LEFT OUTER,
RIGHT OUTER, FULL OUTER, and CROSS. Th ey ar e u sed
i n di f f er ent si t u at i ons. H ow ev er , most of t h e
t i me, i n si mp l e 1 : N r el at i ons, w e end u p u si ng an
INNER joi n. In Ch a p te r 1: I ntrod u c tion to Da ta
Wra ng ling w ith Py th on, w e l ear ned ab ou t set s,
t h en w e c an v i ew an INNER JOIN as an
i nt er sec t i on of t w o set s. Th e f ol l ow i ng di agr am
i l l u st r at e t h e c onc ep t s:
Figure 8.12: Intersection Join
H er e, A r ep r esent s one t ab l e and B r ep r esent s

anot h er . Th e meani ng of h av i ng c ommon
memb er s i s t o h av e a r el at i onsh i p b et w een t h em.
It t ak es al l of t h e r ow s of A and c omp ar es t h em
w i t h al l of t h e r ow s of B t o f i nd t h e mat c h i ng
r ow s t h at sat i sf y t h e joi n p r edi c at e. Th i s c an
qu i c k l y b ec ome a c omp l ex and t i me-c onsu mi ng
op er at i on. Joi ns c an b e v er y ex p ensi v e
op er at i ons. U su al l y , w e u se some k i nd of where
c l au se, af t er w e sp ec i f y t h e joi n, t o sh or t en t h e
sc op e of r ow s t h at ar e f et c h ed f r om t ab l e A or B
t o p er f or m t h e mat c h i ng.
In ou r c ase, ou r f i r st t ab l e, user, h as t h r ee
ent r i es, w i t h t h e p r i mar y k ey b ei ng t h e email.
W e c an mak e u se of t h i s i n ou r qu er y t o get
c omment s ju st f r om Bob:
conn:
cursor = conn.cursor()s
sql = """
SELECT * FROM comments
JOIN user ON comments.user_id = user.email
WHERE user.email='bob@example.com'
"""
rows = cursor.execute(sql)
for row in rows:
print(row)
('bob@example.com', 'This is comment 0 by

Bob Codd', 'bob@example.com', 'Bob',
'Codd', '123 Fantasy lane, Fantasu City',
31, None)

31, None)

31, None)

31, None)

31, None)

31, None)

31, None)

31, None)

31, None)

31, None)
Figure 8.13: Output of the Join query
RETRIEVING SPECIFIC
COLUMNS FROM A JOIN
QUERY
In t h e p r ev i ou s ex er c i se, w e saw t h at w e c an u se
a JOIN t o f et c h t h e r el at ed r ow s f r om t w o t ab l es.
H ow ev er , i f w e l ook at t h e r esu l t s, w e w i l l see
t h at i t r et u r ned al l t h e c ol u mns, t h u s c omb i ni ng
b ot h t ab l es. Th i s i s not v er y c onc i se. W h at ab ou t
i f w e onl y w ant t o see t h e emai l s and t h e r el at ed
c omment s, and not al l t h e dat a?
Th er e i s some ni c e sh or t h and c ode t h at l et s u s do

t h i s:
conn:
sql = """
SELECT comments.* FROM comments
JOIN user ON comments.user_id = user.email
WHERE user.email='bob@example.com'
"""
rows = cursor.execute(sql)
for row in rows:
print(row)
Ju st b y c h angi ng t h e SELECT st at ement , w e made

ou r f i nal r esu l t l ook as f ol l ow s:

Bob Codd')

Bob Codd')

Bob Codd')

Bob Codd')

Bob Codd')

Bob Codd')

Bob Codd')

Bob Codd')

Bob Codd')

Bob Codd')
EXERCISE 112: DELETING
ROWS
In t h i s ex er c i se, w e ar e goi ng t o del et e a r ow
f r om t h e u ser t ab l e and ob ser v e t h e ef f ec t s i t
w i l l h av e on t h e comments t ab l e. Be v er y c ar ef u l
w h en r u nni ng t h i s c ommand as i t c an h av e a
dest r u c t i v e ef f ec t on t h e dat a. Pl ease k eep i n
mi nd t h at i t h as t o al most al w ay s b e r u n
ac c omp ani ed b y a WHERE c l au se so t h at w e del et e
ju st a p ar t of t h e dat a and not ev er y t h i ng:
1 . To delete a r ow fr om a table,
w e u se th e DELETE clau se in
SQL. To r u n delete on th e
user table, w e ar e going to
u se th e follow ing code:
with
er.db") as conn:
cursor.execute("PRAGMA
foreign_keys = 1")
cursor.execute("DELETE
FROM user WHERE
email='bob@example.com
'")
conn.commit()
2 . Per for m th e SELECT

oper ation on th e u ser table:
with
er.db") as conn:
foreign_keys = 1")
rows =
* FROM user")
for row in rows:
print(row)
Obser v e th at th e u ser Bob

h as been deleted.
Now , m ov ing on to th e
comments table, w e h av e to
r em em ber th at w e h ad
m entioned ON DELETE
CASCADE w h ile cr eating th e
table. Th e database engine
know s th at if a r ow is deleted
fr om th e par ent table (user),
all th e r elated r ow s fr om th e
ch ild tables (comments) w ill
h av e to be deleted.
3 . Per for m a select oper ation on

th e com m ents table by u sing
with
er.db") as conn:
foreign_keys = 1")
rows =
* FROM comments")
for row in rows:
print(row)
('tom@web.com', 'This
is comment 0 by Tom
Fake')
is comment 1 by Tom
Fake')
is comment 2 by Tom
Fake')
is comment 3 by Tom
Fake')
is comment 4 by Tom
Fake')
is comment 5 by Tom
Fake')
is comment 6 by Tom
Fake')
is comment 7 by Tom
Fake')
is comment 8 by Tom
Fake')
is comment 9 by Tom
Fake')
We can see th at all of th e

r ow s r elated to Bob ar e
deleted.
UPDATING SPECIFIC VALUES

IN A TABLE
In t h i s ex er c i se, w e w i l l see h ow w e c an u p dat e
r ow s i n a t ab l e. W e h av e al r eady l ook ed at t h i s i n
t h e p ast b u t , as w e ment i oned, at a t ab l e l ev el
onl y . W i t h ou t W H ERE, u p dat i ng i s of t en a b ad
i dea.
Comb i ne U PDA TE w i t h W H ERE t o sel ec t i v el y

u p dat e t h e f i r st name of t h e u ser w i t h t h e emai l
addr ess tom@web.com:
conn:
cursor.execute("UPDATE user set

first_name='Chris' where
email='tom@web.com'")
conn.commit()
rows = cursor.execute("SELECT * FROM

user")
for row in rows:
print(row)
Figure 8.14: Output of the update query

EXERCISE 113: RDBMS AND
DATAFRAMES
W e h av e l ook ed i nt o many f u ndament al asp ec t s
of st or i ng and qu er y i ng dat a f r om a dat ab ase, b u t
as a dat a w r angl i ng ex p er t , w e need ou r dat a t o
b e p ac k ed and p r esent ed as a Dat aFr ame so t h at
w e c an p er f or m qu i c k and c onv eni ent op er at i ons
on t h em:
1 . Im por t pandas u sing th e

follow ing code:
import pandas as pd
2 . Cr eate a colu m ns list w ith

email, first name, last
name, age, gender, and
comments as colu m n nam es.
A lso, cr eate an em pty data
list:
columns = ["Email",
"First Name", "Last
Name", "Age",
"Gender", "Comments"]
data = []
3 . Connect to chapter.db u sing

SQLite and obtain a cu r sor ,
as follow s:
with
er.db") as conn:
Use the execute method

from the cursor to set
"PRAGMA foreign_keys =
1"
foreign_keys = 1")
4 . Cr eate a sql v ar iable th at

w ill contain th e SELECT
com m and and u se th e join
com m and to join th e
databases:
sql = """
SELECT user.email,
user.first_name,
user.last_name,
user.age, user.gender,
comments.comments FROM
comments
JOIN user ON
comments.user_id =
user.email
WHERE user.email =
'tom@web.com'
"""
5. Use th e execute m eth od of

cu r sor to execu te th e sql
com m and:
rows =
cursor.execute(sql)
6 . A ppend th e r ow s to th e data
list:
for row in rows:
data.append(row)
7 . Cr eate a DataFr am e u sing

th e data list:
df =
pd.DataFrame(data,
columns=columns)
8. We h av e cr eated th e
DataFr am e u sing th e data
list. You can pr int th e v alu es
into th e DataFr am e u sing
df.head.
ACTIVITY 11: RETRIEVING

DATA CORRECTLY FROM
DATABASES
In t h i s ac t i v i t y , w e h av e t h e p er sons t ab l e:
Figure 8.15: The persons table
W e h av e t h e p et s t ab l e:
Figure 8.16: The pets table
A s w e c an see, t h e id c ol u mn i n t h e p er sons t ab l e
(w h i c h i s an i nt eger ) ser v es as t h e p r i mar y k ey
f or t h at t ab l e and as a f or ei gn k ey f or t h e p et
t ab l e, w h i c h i s l i nk ed v i a t h e owner_id c ol u mn.
Th e p er sons t ab l e h as t h e f ol l ow i ng c ol u mns:
first_name: Th e fir st nam e

of th e per son
last_name: Th e last nam e of
th e per son (can be "nu ll")
age: Th e age of th e per son
city: Th e city fr om w h er e
h e/sh e is fr om
zip_code: Th e zip code of th e

city
Th e p et s t ab l e h as t h e f ol l ow i ng c ol u mns:
pet_name: Th e nam e of th e
pet.
pet_type: Wh at ty pe of pet
it is, for exam ple, cat, dog,
and so on. Du e to a lack of
fu r th er infor m ation, w e do
not know w h ich nu m ber
r epr esents w h at, bu t it is an
integer and can be nu ll.
treatment_done: It is also an
integer colu m n, and 0 h er e
r epr esents "No", w h er eas 1
r epr esents "Yes".
Th e name of t h e SQLi t e DB i s petsdb and i t i s

su p p l i ed al ong w i t h t h e A c t i v i t y not eb ook .
Th ese st ep s w i l l h el p y ou c omp l et e t h i s ac t i v i t y :
1 . Connect to petsDB and ch eck

w h eth er th e connection h as
been su ccessfu l.
2 . Find th e differ ent age gr ou ps

in th e per sons database.
3 . Find th e age gr ou p th at h as
th e m axim u m nu m ber of
people.
4 . Find th e people w h o do not

h av e a last nam e.
5. Find ou t h ow m any people

h av e m or e th an one pet.
6 . Find ou t h ow m any pets

h av e r eceiv ed tr eatm ent.
7 . Find ou t h ow m any pets

h av e r eceiv ed tr eatm ent and
th e ty pe of pet is know n.
8. Find ou t h ow m any pets ar e

fr om th e city called east
port.
9 . Find ou t h ow m any pets ar e

fr om th e city called east
port and w h o r eceiv ed a
tr eatm ent.
Note

Summary
W e h av e c ome t o t h e end of t h e dat ab ase c h ap t er .
W e h av e l ear ned h ow t o c onnec t t o SQLi t e u si ng
Py t h on. W e h av e b r u sh ed u p on t h e b asi c s of
r el at i onal dat ab ases and l ear ned h ow t o op en and
c l ose a dat ab ase. W e t h en l ear ned h ow t o ex p or t
t h i s r el at i onal dat ab ase i nt o Py t h on Dat aFr ames.
In t h e nex t c h ap t er , w e w i l l b e p er f or mi ng dat a
w r angl i ng on r eal -w or l d dat aset s.
Chapter 9
Application of
Data Wrangling in Real
Life
Learning Objectives
Per for m data w r angling on

m u ltiple fu ll-fledged datasets
fr om r enow ned sou r ces
Cr eate a u nified dataset th at

can be passed on to a data
science team for m ach ine
lear ning and pr edictiv e
analy tics
Relate data w r angling to

v er sion contr ol,
container ization, clou d
ser v ices for data analy tics,
and big data tech nologies
su ch as A pach e Spar k and
Hadoop
In t h i s c h ap t er , y ou w i l l ap p l y y ou r gat h er ed
k now l edge on r eal -l i f e dat aset s and i nv est i gat e
v ar i ou s asp ec t s of i t .
Introduction
W e l ear ned ab ou t dat ab ases i n t h e p r ev i ou s
c h ap t er , so now i t i s t i me t o c omb i ne t h e
k now l edge of dat a w r angl i ng and Py t h on w i t h a
r eal -w or l d sc enar i o. In t h e r eal w or l d, dat a f r om
one sou r c e i s of t en i nadequ at e t o p er f or m
anal y si s. Gener al l y , a dat a w r angl er h as t o
di st i ngu i sh b et w een r el ev ant and non-r el ev ant
dat a and c omb i ne dat a f r om di f f er ent sou r c es.
Th e p r i mar y job of a dat a w r angl i ng ex p er t i s t o

p u l l dat a f r om mu l t i p l e sou r c es, f or mat and
c l ean i t (i mp u t e t h e dat a i f i t i s mi ssi ng), and
f i nal l y c omb i ne i t i n a c oh er ent manner t o
p r ep ar e a dat aset f or f u r t h er anal y si s b y dat a
sc i ent i st s or mac h i ne l ear ni ng engi neer s.
In t h i s t op i c , w e w i l l t r y t o mi mi c su c h a t y p i c al
t ask f l ow b y dow nl oadi ng and u si ng t w o
di f f er ent dat aset s f r om r ep u t ed w eb p or t al s.
Eac h of t h e dat aset s c ont ai ns p ar t i al dat a
p er t ai ni ng t o t h e k ey qu est i on t h at i s b ei ng
ask ed. Let 's ex ami ne i t mor e c l osel y .
Applying Your
Knowledge to a Real-life
Data Wrangling Task
Su p p ose y ou ar e ask ed t h i s qu est i on: I n I ndia,
did the e nrollme nt in
primary /se condary /te rtiary e ducation
incre ase with the improv e me nt of pe r capita
GDP in the past 15 y e ars? Th e ac t u al model i ng
and anal y si s w i l l b e done b y some seni or dat a
sc i ent i st , w h o w i l l u se mac h i ne l ear ni ng and
dat a v i su al i zat i on f or anal y si s. A s a dat a
w r angl i ng ex p er t , y our j ob will be to acquire
and prov ide a cle an datase t that contains
e ducational e nrollme nt and GDP data side by
side .
Su p p ose y ou h av e a l i nk f or a dat aset f r om t h e

U ni t ed N at i ons and y ou c an dow nl oad t h e dat aset
of edu c at i on (f or al l t h e nat i ons ar ou nd t h e
w or l d). Bu t t h i s dat aset h as some mi ssi ng v al u es
and mor eov er i t does not h av e any GDP
i nf or mat i on. Someone h as al so gi v en y ou anot h er
sep ar at e CSV f i l e (dow nl oaded f r om t h e W or l d
Bank si t e) w h i c h c ont ai ns GDP dat a b u t i n a
messy f or mat .
In t h i s ac t i v i t y , w e w i l l ex ami ne h ow t o h andl e
t h ese t w o sep ar at e sou r c es and c l ean t h e dat a t o
p r ep ar e a si mp l e f i nal dat aset w i t h t h e r equ i r ed
dat a and sav e i t t o t h e l oc al dr i v e as a SQL
dat ab ase f i l e:
Figure 9.1: Pictorial representation of the merging of
education and economic data
You ar e enc ou r aged t o f ol l ow al ong w i t h t h e c ode

and r esu l t s i n t h e not eb ook and t r y t o u nder st and
and i nt er nal i ze t h e nat u r e of t h e dat a w r angl i ng
f l ow . You ar e al so enc ou r aged t o t r y ex t r ac t i ng
v ar i ou s dat a f r om t h ese f i l es and answ er y ou r
ow n qu est i ons ab ou t a nat i ons' soc i o-ec onomi c
f ac t or s and t h ei r i nt er -r el at i onsh i p s.
Note
Com ing u p w ith inte re s ting q u e s tions a bou t
s oc ia l, e c onom ic , te c h nolog ic a l, a nd g e o-p olitic a l
top ic s a nd th e n a ns w e ring th e m u s ing fre e ly
a v a ila ble d a ta a nd a little bit of p rog ra m m ing
k now le d g e is one of m os t fu n w a y s to le a rn a bou t
a ny d a ta s c ie nc e top ic . You w ill g e t a fla re of th a t
p roc e s s in th is c h a p te r.
Data I mputation
Cl ear l y , w e ar e mi ssi ng some dat a. Let 's say w e

dec i de t o i mp u t e t h ese dat a p oi nt s b y si mp l e
l i near i nt er p ol at i on b et w een t h e av ai l ab l e dat a
p oi nt s. W e c an t ak e ou t a p en and p ap er or a
c al c u l at or and c omp u t e t h ose v al u es and
manu al l y c r eat e a dat aset . Bu t b ei ng a dat a
w r angl er , w e w i l l of c ou r se t ak e adv ant age of
Py t h on p r ogr ammi ng, and u se p andas i mp u t at i on
met h ods f or t h i s t ask .
Bu t t o do t h at , w e f i r st need t o c r eat e a
Dat aFr ame w i t h mi ssi ng v al u es i n i t , t h at i s, w e
need t o ap p end anot h er Dat aFr ame w i t h mi ssi ng
v al u es t o t h e c u r r ent Dat aFr ame.
Activity 12: Data

Wrangling Task – Fixing
UN Data
Su p p ose t h e agenda of t h e dat a anal y si s i s t o f i nd
ou t w h et h er t h e enr ol ment i n p r i mar y ,
sec ondar y , or t er t i ar y edu c at i on h as i nc r eased
w i t h t h e i mp r ov ement of p er c ap i t a GDP i n t h e
p ast 1 5 y ear s. For t h i s t ask , w e w i l l f i r st need t o
c l ean or w r angl e t h e t w o dat aset s, t h at i s, t h e
Edu c at i on Enr ol ment and GDP dat a.
Th e U N dat a i s av ai l ab l e on
W r angl i ng-w i t h -
Py t h on/b l ob /mast er /Ch ap t er 09/A c t i v i t y 1 2-
1 5/SYB61 _T07 _Edu c at i on.c sv .
Note
I f y ou d ow nloa d th e CS V file a nd op e n it u s ing
Exc e l, th e n y ou w ill s e e th a t th e Footnotes
c olu m n s om e tim e s c onta ins u s e fu l note s . We
m a y not w a nt to d rop it in th e be g inning . I f w e
a re inte re s te d in a p a rtic u la r c ou ntry 's d a ta (lik e
w e a re in th is ta s k ), th e n it m a y w e ll tu rn ou t th a t
Footnotes w ill be NaN, th a t is , bla nk . I n th a t c a s e ,
w e c a n d rop it a t th e e nd . Bu t for s om e c ou ntrie s
or re g ions , it m a y c onta in inform a tion.
Th ese st ep s w i l l gu i de y ou t o f i nd t h e sol u t i on:
1 . Dow nload th e dataset fr om

th e UN data fr om GitHu b
fr om th e follow ing link:
w ith -
r 09 /A ctiv ity 1 3 /India_Wor l
d_Bank_Info.csv .
Th e UN data h as m issing
v alu es. Clean th e data to
pr epar e a sim ple final
dataset w ith th e r equ ir ed
data and sav e it to th e local
dr iv e as a SQL database file.
2 . Use th e pd.read_csv m eth od

of pandas to cr eate a
DataFr am e.
3 . Since th e fir st r ow does not

contain u sefu l infor m ation,
skip it u sing th e skiprows
par am eter .
4 . Dr op th e colu m n
r egion/cou ntr y /ar ea and
sou r ce.
5. A ssign th e follow ing nam es

as colu m ns of DataFr am e:
Region/Cou nty /A r ea, Year ,
Data, V alu e, and Footnotes.
6 . Ch eck h ow m any u niqu e

v alu es ar e pr esent in
th e Footnotes colu m n.
7 . Ch eck th e ty pe of v alu e
colu m n.
8. Cr eate a fu nction to conv er t

th e v alu e colu m n into a
floating-point.

apply th is fu nction to a
v alu e.
1 0. Pr int th e u niqu e v alu es in

th e data colu m n.
Note:

Activity 13: Data
Wrangling Task –
Cleaning GDP Data
Th e GDP dat a i s av ai l ab l e on
h t t p s://dat a.w or l db ank .or g/ and i t i s av ai l ab l e
on Gi t H u b at
W r angl i ng-w i t h -
Py t h on/b l ob /mast er /Ch ap t er 09/A c t i v i t y 1 2-
1 5/Indi a_W or l d_Bank _Inf o.c sv .
In t h i s ac t i v i t y , w e w i l l c l ean t h e GDP dat a.
1 . Cr eate th r ee DataFr am es
fr om th e or iginal DataFr am e
u sing filter ing. Cr eate th e
df_primary,
df_secondary, and
df_tertiary DataFrames
for stu dents enr olled in
pr im ar y edu cation,
secondar y edu cation, and
ter tiar y edu cation in
th ou sands, r espectiv ely .
2 . Plot bar ch ar ts of th e
enr ollm ent of pr im ar y
stu dents in a low -incom e
cou ntr y like India and a
h igh er -incom e cou ntr y like
th e USA .
3 . Since th er e is m issing data,

u se pandas im pu tation
m eth ods to im pu te th ese
data points by sim ple linear
inter polation betw een data
points. To do th at, cr eate a
DataFr am e w ith m issing
v alu es inser ted and append a
new DataFr am e w ith
m issing v alu es to th e
cu r r ent DataFr am e.
4 . (For India) A ppend th e r ow s

cor r esponding to th e m issing
y ear s - 2004 - 2009, 2011 –
2013.
5. Cr eate a dictionar y of v alu es

w ith np.nan. Note th at th er e
ar e 9 m issing data points, so
w e need to cr eate a list w ith
identical v alu es r epeated 9
tim es.
6 . Cr eate a DataFr am e of
m issing v alu es (fr om th e
pr eceding dictionar y ) th at
w e can append.
7 . A ppend th e DataFr am es
togeth er .
8. Sor t by Year and r eset th e

indices u sing reset_index.
Use inplace=True to execu te
th e ch anges on th e
DataFr am e itself.
9 . Use th e inter polate m eth od

for linear inter polation. It
fills all th e NaNs by linear ly
inter polated v alu es. See th e
follow ing link for m or e
details abou t th is m eth od:
h ttp://pandas.py data.or g/p
andas-
docs/v er sion/0.1 7 /gener ate
d/pandas.DataFr am e.inter p
olate.h tm l.
1 0. Repeat th e sam e steps for

USA (or oth er cou ntr ies).
1 1 . If th er e ar e v alu es th at ar e
u nfilled, u se th e limit and
limit_direction
par am eter s w ith th e
inter polate m eth od to fill
th em in.
1 2 . Plot th e final gr aph u sing th e

new data.
1 3 . Read th e GDP data u sing th e

pandas read_csv m eth od. It
w ill gener ally th r ow an
er r or .
1 4 . To av oid er r or s, tr y th e
error_bad_lines = False
option.
1 5. Since th er e is no delim iter in
th e file, add th e \t delim iter .
1 6 . Use th e skiprows fu nction to

r em ov e r ow s th at ar e not
u sefu l.
1 7 . Exam ine th e dataset. Filter

th e dataset w ith infor m ation
th at states th at it is sim ilar
to th e pr ev iou s edu cation
dataset.
1 8. Reset th e index for th is new

dataset.
1 9 . Dr op th e not u sefu l r ow s and

r e-index th e dataset.
2 0. Renam e th e colu m ns
pr oper ly . Th is is necessar y
for m er ging th e tw o datasets.
2 1 . We w ill concentr ate only on

th e data fr om 2 003 to 2 01 6 .
Elim inate th e r em aining
data.
2 2 . Cr eate a new DataFr am e

called df_gdp w ith r ow s 4 3
to 56 .
Note

Activity 14: Data

Wrangling Task –
Merging UN Data and
GDP Data
Th e st ep s t o mer ge t h e dat ab ases i s as f ol l ow s:
1 . Reset th e indexes for

m er ging.
2 . Mer ge th e tw o DataFr am es,

primary_enrollment_india
and df_gdp, on th e Year
colu m n.
3 . Dr op th e data, footnotes, and
r egion/cou nty /ar ea.
4 . Rear r ange th e colu m ns for

pr oper v iew ing and
pr esentation.
Note

Activity 15: Data

Wrangling Task –
Connecting the New
Data to the Database
Th e st ep s t o c onnec t t h e dat a t o t h e dat ab ase i s as
f ol l ow s:
1 . Im por t th e sqlite3 m odu le

of Py th on and u se th e
connect fu nction to connect
to th e database. Th e m ain
database engine is
em bedded. Bu t for a differ ent
database like Postgresql or
MySQL, w e w ill need to
connect to th em u sing th ose
cr edentials. We designate
Year as th e PRIMARY KEY of
th is table.
2 . Th en, r u n a loop w ith th e

dataset r ow s one by one to
inser t th em into th e table.
3 . If w e look at th e cu r r ent
folder , w e sh ou ld see a file
called Education_GDP.db,
and if w e exam ine th at u sing
a database v iew er pr ogr am ,
w e can see th e data
tr ansfer r ed th er e.
Note

can be found on page 347 .
In t h i s not eb ook , w e ex ami ned a c omp l et e dat a
w r angl i ng f l ow , i nc l u di ng r eadi ng dat a f r om t h e
w eb and l oc al dr i v e, f i l t er i ng, c l eani ng, qu i c k
v i su al i zat i on, i mp u t at i on, i ndex i ng, mer gi ng,
and w r i t i ng b ac k t o a dat ab ase t ab l e. W e al so
w r ot e c u st om f u nc t i ons t o t r ansf or m some of t h e
dat a and saw h ow t o h andl e si t u at i ons w h er e w e
may get er r or s w h en r eadi ng t h e f i l e.
An Extension to Data
Wrangling
Th i s i s t h e c onc l u di ng c h ap t er of ou r b ook ,
w h er e w e w ant t o gi v e y ou a b r oad ov er v i ew of
some of t h e ex c i t i ng t ec h nol ogi es and
f r amew or k s t h at y ou may need t o l ear n b ey ond
dat a w r angl i ng t o w or k as a f u l l -st ac k dat a
sc i ent i st . Dat a w r angl i ng i s an essent i al p ar t of
t h e w h ol e dat a sc i enc e and anal y t i c s p i p el i ne,
b u t i t i s not t h e w h ol e ent er p r i se. You h av e
l ear ned i nv al u ab l e sk i l l s and t ec h ni qu es i n t h i s
b ook , b u t i t i s al w ay s good t o b r oaden y ou r
h or i zons and l ook b ey ond t o see w h at ot h er t ool s
t h at ar e ou t t h er e c an gi v e y ou an edge i n t h i s
c omp et i t i v e and ev er -c h angi ng w or l d.
ADDITIONAL SKILLS
REQUIRED TO BECOME A DATA
SCIENTIST
To p r ac t i c e as a f u l l y qu al i f i ed dat a
sc i ent i st /anal y st , y ou sh ou l d h av e some b asi c
sk i l l s i n y ou r r ep er t oi r e, i r r esp ec t i v e of t h e
p ar t i c u l ar p r ogr ammi ng l angu age y ou c h oose t o
f oc u s on. Th ese sk i l l s and k now -h ow s ar e
l angu age agnost i c and c an b e u t i l i zed w i t h any
f r amew or k t h at y ou h av e t o emb r ac e, dep endi ng
on y ou r or gani zat i on and b u si ness needs. W e
desc r i b e t h em i n b r i ef h er e:
Git and v ersion cont rol:

Git to v er sion contr ol is w h at
RDBMS is to data stor age and
qu er y . It sim ply m eans th at
th er e is a h u ge gap betw een
th e pr e and post Git er a of
v er sion contr olling y ou r
code. A s y ou m ay h av e
noticed, all th e notebooks for
th is book/book ar e h osted on
GitHu b, and th is w as done to
take adv antage of th e
pow er fu l Git V CS. It giv es
y ou , ou t of th e box, v er sion
contr ol, h istor y , br anch ing
facilities for differ ent code,
m er ging differ ent code
br anch es, and adv anced
oper ations like ch er r y
picking, diff, and so on. It is
an v er y essential tool to
m aster as y ou can be alm ost
su r e th at y ou w ill face it at
one point of tim e in y ou r
jou r ney . Packt h as a v er y
good book on it. You can
ch eck th at ou t for m or e
infor m ation.
Linux command line:

People com ing fr om a
Window s backgr ou nd (or
ev en Mac, if y ou h av e not
done any dev elopm ent
befor e) ar e not v er y
fam iliar , u su ally , w ith th e
com m and line. Th e su per ior
UI of th ose OSes h ides th e low
lev el details of inter action
w ith th e OS u sing a
com m and line. How ev er , as
a data pr ofessional, it is
im por tant th at y ou know th e
com m and line w ell. Th er e
ar e so m any oper ations th at
y ou can do by sim ply u sing
th e com m and line th at it is
astonish ing.
SQL and basic relat ional

dat abase concept s: We
dedicated an entir e ch apter
to SQL and RDBMS.
How ev er , as w e alr eady
m entioned th er e, it w as
r eally not enou gh . Th is is a
v ast su bject and needs y ear s
of stu dy to m aster it. Tr y to
r ead m or e abou t it
(Inclu ding Th eor y and
Pr actical) fr om books and
online sou r ces. Do not for get
th at, despite all th e oth er
sou r ces of data being u sed
now aday s, w e still h av e
h u ndr eds of m illions of by tes
of str u ctu r ed data stor ed in
legacy database sy stem s.
You can be su r e to com e
acr oss one, sooner or later .
Docker and
cont ainerizat ion: Since its
fir st r elease in 2 01 3 , Docker
h as ch anged th e w ay w e
distr ibu te and deploy
softw ar e in ser v er -based
applications. It giv es y ou a
clean and ligh tw eigh t
abstr action ov er th e
u nder ly ing OS and lets y ou
iter ate fast on dev elopm ent
w ith ou t th e h eadach e of
cr eating and m aintaining a
pr oper env ir onm ent. It is
v er y u sefu l in both th e
dev elopm ent and pr odu ction
ph ases. With ou t v ir tu ally no
com petitor pr esent, th ey ar e
becom ing th e defau lt in th e
indu str y v er y fast. We
str ongly adv ise y ou to
explor e it in gr eat detail.
BASIC FAMILIARITY WITH BIG

DATA AND CLOUD
TECHNOLOGIES
Bi g dat a and c l ou d p l at f or ms ar e t h e l at est t r end.
W e w i l l i nt r odu c e t h em h er e w i t h one or t w o
sh or t sent enc es and w e enc ou r age y ou t o go ah ead
and l ear n ab ou t t h em as mu c h as y ou c an. If y ou
ar e p l anni ng t o gr ow as a dat a p r of essi onal , t h en
y ou c an b e su r e t h at w i t h ou t t h ese nec essar y
sk i l l s i t w i l l b e h ar d f or y ou t o t r ansi t i on t o t h e
nex t l ev el :
Fundament al
charact erist ics of big
dat a: Big data is sim ply data
th at is v er y big in size. Th e
ter m size is a bit am bigu ou s
h er e. It can m ean one static
ch u nk of data (like th e detail
censu s data of a big cou ntr y
like India or th e US) or data
th at is dy nam ically
gener ated as tim e passes,
and each tim e it is h u ge. To
giv e an exam ple for th e
second categor y , w e can
th ink of h ow m u ch data is
gener ated by Facebook per
day . It's abou t 500+
Ter aby tes per day . You can
easily im agine th at w e w ill
need specialized tools to deal
w ith th at am ou nt of data.
Th er e ar e th r ee differ ent
categor ies of big data, th at is,
Str u ctu r ed, Unstr u ctu r ed,
and Sem i-Str u ctu r ed. Th e
m ain featu r es th at define big
data ar e V olu m e, V ar iety ,
V elocity , and V ar iability .
Hadoop ecosy st em: A pach e

Hadoop (and th e r elated
ecosy stem ) is a softw ar e
fr am ew or k th at aim s to u se
th e Map-Redu ce
pr ogr am m ing m odel to
sim plify th e stor age and
pr ocessing of big data. It h as
since becom e one of th e
backbones of big data
pr ocessing in th e indu str y .
Th e m odu les in Hadoop ar e
designed keeping in m ind
th at h ar dw ar e failu r es ar e
com m on occu r r ences, and
th ey sh ou ld be
au tom atically h andled by
th e fr am ew or k. Th e fou r
base m odu les of Hadoop ar e
com m on, HDFS, YA RN, and
MapRedu ce. Th e Hadoop
ecosy stem consists of A pach e
Pig, A pach e Hiv e, A pach e
Im pala, A pach e Zookeeper ,
A pach e HBase, and m or e.
Th ey ar e v er y im por tant
br icks in m any h igh dem and
and cu tting-edge data
pipelines. We encou r age y ou
to stu dy m or e abou t th em .
Th ey ar e essential in any
indu str y th at aim s to
lev er age data.
Apache Spark: A pach e
Spar k is a gener al pu r pose
Clu ster Com pu ting
fr am ew or k th at w as initially
dev eloped at th e Univ er sity
of Califor nia, Bar kley , and
r eleased in 2 01 4 . It giv es
y ou an inter face to pr ogr am
an entir e clu ster of
com pu ter s w ith bu ilt-in data
par allelism and fau lt
toler ance. It contains Spar k
Cor e, Spar k SQL, Spar k
Str eam ing, MLib (for
m ach ine lear ning), and
Gr aph X. It is now one of th e
m ain fr am ew or ks th at's u sed
in th e indu str y to pr ocess a
h u ge am ou nt of data in r eal
tim e based on str eam ing
data. We encou r age y ou to
r ead and m aster it if y ou
w ant to go tow ar d r eal tim e
data engineer ing.
Amazon Web serv ice

(AWS): A m azon Web
Ser v ices (often abbr ev iated
as A WS) ar e a bu nch of
m anaged ser v ices offer ed by
A m azon r anging fr om
infr astr u ctu r e-as-a-Ser v ice,
Database-as-a-Ser v ice,
Mach ineLear ning-as-a-
Ser v ice, Cach e, Load
Balancer , NoSQL database,
to Message Qu eu es and
sev er al oth er ty pes. Th ey ar e
v er y u sefu l for all sor ts of
applications. It can be a
sim ple w eb app or a m u lti-
clu ster data pipeline. Many
fam ou s com panies r u n th eir
entir e infr astr u ctu r e on
A WS (su ch as Netflix). Th ey
giv e u s on-dem and
pr ov ision, easy scaling, a
m anaged env ir onm ent, a
slick UI to contr ol
ev er y th ing, and also a v er y
pow er fu l com m and-line
client. Th ey also expose a
r ich set of A PIs and w e can
find an A WS A PI client in
v ir tu ally any pr ogr am m ing
langu age. Th e Py th on one is
called Boto3 . If y ou ar e
planning to becom e a data
pr ofessional, th en it can be
said w ith near cer tainty th at
y ou w ill end u p u sing m any
of th eir ser v ices at one point
or anoth er .
WHAT GOES WITH DATA

WRANGLING?
W e l ear ned i n Ch a p te r 1, I ntrod u c tion to Da ta
Wra ng ling w ith Py th on, t h at t h e p r oc ess of dat a
w r angl i ng l i es i n-b et w een dat a gat h er i ng and
adv anc ed anal y t i c s, i nc l u di ng v i su al i zat i on and
mac h i ne l ear ni ng. H ow ev er , t h e b ou ndar i es t h at
ex i st i n-b et w een t h ese p r oc esses may not al w ay s
b e st r i c t and r i gi d. It dep ends l ar gel y on t h e
or gani zat i onal c u l t u r e and t eam c omp osi t i on.
Th er ef or e, w e need t o not onl y b e aw ar e of t h e

dat a w r angl i ng b u t al so t h e ot h er c omp onent s of
t h e dat a sc i enc e p l at f or m t o w r angl e dat a
ef f ec t i v el y . Ev en i f y ou ar e p er f or mi ng p u r e
dat a w r angl i ng t ask s, h av i ng a good gr asp ov er
h ow dat a i s sou r c ed and u t i l i zed w i l l gi v e y ou an
edge f or c omi ng u p w i t h u ni qu e and ef f i c i ent
sol u t i ons t o c omp l ex dat a w r angl i ng p r ob l ems
and enh anc e t h e v al u e of t h ose sol u t i ons t o t h e
mac h i ne l ear ni ng sc i ent i st or t h e b u si ness
domai n ex p er t :
Figure 9.2: Process of data wrangling
N ow , w e h av e, i n f ac t , al r eady l ai d ou t a sol i d
gr ou ndw or k i n t h i s b ook f or t h e dat a p l at f or m
p ar t , assu mi ng t h at i t i s an i nt egr al p ar t of dat a
w r angl i ng w or k f l ow . For ex amp l e, w e h av e
c ov er ed w eb sc r ap i ng, w or k i ng w i t h RESTf u l
A PIs, and dat ab ase ac c ess and mani p u l at i on u si ng
Py t h on l i b r ar i es i n det ai l .
W e h av e al so t ou c h ed on b asi c v i su al i zat i on
t ec h ni qu es and p l ot t i ng f u nc t i ons i n Py t h on
u si ng mat p l ot l i b . H ow ev er , t h er e ar e ot h er
adv anc ed st at i st i c al p l ot t i ng l i b r ar i es su c h as
Se aborn t h at y ou c an mast er f or mor e
sop h i st i c at ed v i su al i zat i on f or dat a sc i enc e
t ask s.
Bu si ness l ogi c and domai n ex p er t i se i s t h e most

v ar i ed t op i c and i t c an onl y b e l ear ned on t h e
job , h ow ev er i t w i l l c ome ev ent u al l y w i t h
ex p er i enc e. If y ou h av e an ac ademi c b ac k gr ou nd
and/or w or k ex p er i enc e i n any domai n su c h as
f i nanc e, medi c i ne and h eal t h c ar e, and
engi neer i ng, t h at k now l edge w i l l c ome i n h andy
i n y ou r dat a sc i enc e c ar eer .
Th e f r u i t of t h e h ar d w or k of dat a w r angl i ng i s
r eal i zed f u l l y i n t h e domai n of mac h i ne
l ear ni ng. It i s t h e sc i enc e and engi neer i ng of
mak i ng mac h i nes l ear n p at t er ns and i nsi gh t s
f r om dat a f or p r edi c t i v e anal y t i c s and
i nt el l i gent , au t omat ed dec i si on-mak i ng w i t h a
del u ge of dat a, w h i c h c annot b e anal y zed
ef f i c i ent l y b y h u mans. Mac h i ne l ear ni ng h as
b ec ome one of t h e most sou gh t -af t er sk i l l s i n t h e
moder n t ec h nol ogy l andsc ap e. It h as t r u l y
b ec ome one of t h e most ex c i t i ng and p r omi si ng
i nt el l ec t u al f i el ds, w i t h ap p l i c at i ons r angi ng
f r om e-c ommer c e t o h eal t h c ar e and v i r t u al l y
ev er y t h i ng i n-b et w een. Dat a w r angl i ng i s
i nt r i nsi c al l y l i nk ed w i t h mac h i ne l ear ni ng as i t
p r ep ar es t h e dat a so t h at i t 's su i t ab l e f or
i nt el l i gent al gor i t h ms t o p r oc ess. Ev en i f y ou
st ar t y ou r c ar eer i n dat a w r angl i ng, i t c ou l d b e a
nat u r al p r ogr essi on t o mov e t o mac h i ne
l ear ni ng.
Pac k t h as p u b l i sh ed nu mer ou s b ook s and b ook s

on t h i s t op i c t h at y ou sh ou l d ex p l or e. In t h e nex t
sec t i on, w e w i l l t ou c h u p on some ap p r oac h es t o
adop t and Py t h on l i b r ar i es t o c h ec k ou t f or
gi v i ng y ou a b oost i n y ou r l ear ni ng.
TIPS AND TRICKS FOR

MASTERING MACHINE
LEARNING
Mac h i ne l ear ni ng i s di f f i c u l t t o st ar t w i t h . W e
h av e l i st ed some st r u c t u r ed MOOCs and
i nc r edi b l e f r ee r esou r c es t h at ar e av ai l ab l e so
t h at y ou c an b egi n y ou r jou r ney :
Under stand th e definition of

and differ entiation betw een
th e bu zzw or ds — ar tificial
intelligence, m ach ine
lear ning, deep lear ning, and
data science. Cu ltiv ate th e
h abit of r eading gr eat posts
or listening to th e exper t
talks, on th ese topics, and
u nder stand th eir tr u e r each
and applicability in som e
bu siness pr oblem .
Stay u pdated w ith th e r ecent

tr ends by w atch ing v ideos,
r eading books like The Master
Algorithm: How the Quest for
the Ultimate Learning Machine
Will Remake Our World, and
ar ticles and follow ing
influ ential blogs like
KDnu ggets, Br andon
Roh r er 's blog, Open A I's blog
abou t th eir r esear ch ,
Tow ar ds Data Science
pu blication on Mediu m , and
so on.
A s y ou lear n new algor ith m s

or concepts, pau se and
analy ze h ow y ou can apply
th ese m ach ine lear ning
concepts or algor ith m in
y ou r daily w or k. Th is is th e
best m eth od for lear ning and
expanding y ou r know ledge
base.
If y ou ch oose Py th on as y ou r
pr efer r ed langu age for
m ach ine lear ning tasks, y ou
h av e a gr eat ML libr ar y in
scikit -learn. It is th e m ost
w idely u sed gener al m ach ine
lear ning package in th e
Py th on ecosy stem . scikit-
lear n h as a w ide v ar iety of
su per v ised and u nsu per v ised
lear ning algor ith m s, w h ich
ar e exposed v ia a stable
consistent inter face.
Mor eov er , it is specifically
designed to inter face
seam lessly w ith oth er
popu lar data w r angling and
nu m er ical libr ar ies su ch as
Nu m Py and pandas.
A noth er h ot skill in today 's

job m ar ket is deep lear ning.
Packt h as m any books and
books on th is topic and th er e
ar e excellent MOOC books
fr om Bookr a w h er e y ou can
stu dy deep lear ning. For
Py th on libr ar ies, y ou can
lear n and pr actice w ith
TensorFlow, Keras, or
Py Torch for deep lear ning.
Summary
Dat a i s ev er y w h er e and i t i s al l ar ou nd u s. In
t h ese ni ne c h ap t er s, w e h av e l ear ned ab ou t h ow
dat a f r om di f f er ent t y p es and sou r c es c an b e
c l eaned, c or r ec t ed, and c omb i ned. U si ng t h e
p ow er of Py t h on and t h e k now l edge of dat a
w r angl i ng and ap p l y i ng t h e t r i c k s and t i p s t h at
y ou h av e st u di ed i n t h i s b ook , y ou ar e r eady t o b e
a dat a w r angl er .
Appendix
About
Th i s sec t i on i s i nc l u ded t o assi st t h e st u dent s t o
p er f or m t h e ac t i v i t i es i n t h e b ook . It i nc l u des
det ai l ed st ep s t h at ar e t o b e p er f or med b y t h e
st u dent s t o ac h i ev e t h e ob jec t i v es of t h e
ac t i v i t i es.
SOLUTION OF ACTIVITY 1:
HANDLING LISTS
Th ese ar e t h e st ep s t o c omp l et e t h i s ac t i v i t y :
1 . Im por t th e random libr ar y :
import random
2 . Set th e m axim u m nu m ber of

r andom nu m ber s:
LIMIT = 100
3 . Use th e randint fu nction

fr om th e random libr ar y to
cr eate 1 00 r andom
nu m ber s. Tip: tr y getting a
list w ith th e least nu m ber of
du plicates:
random_number_list =
[random.randint(0,
LIMIT) for x in
range(0, LIMIT)]
4 . Pr int random_number_list:
random_number_list

follow s:
Figure 1.16: Section of output for
random_number_list
5. Cr eate a
list_with_divisible_by_3
list fr om
random_number_list, w h ich
w ill contain only nu m ber s
th at ar e div isible by 3:
list_with_divisible_by
_3 = [a for a in
random_number_list if
a % 3 == 0]
_3

follow s:
Figure 1.17: Section of output for

random_number_list divisible by 3
6 . Use th e len fu nction to

m easu r e th e length of th e
fir st list and th e second list,
and stor e th em in tw o
differ ent v ar iables,
length_of_random_list
and
length_of_3_divisible_li
st. Calcu late th e differ ence
in length in a v ar iable called
difference:
=
len(random_number_list
)
length_of_3_divisible_
list =
len(list_with_divisibl
e_by_3)
difference =
-
list
difference

follow s:
62
7 . Com bine th e tasks w e h av e

per for m ed so far and add a
w h ile loop to it. Ru n th e loop
1 0 tim es and add th e v alu es
of th e differ ence v ar iables to
a list:
NUMBER_OF_EXPERIMENTS
= 10
difference_list = []
for i in range(0,
NUMBER_OF_EXPERIEMENTS
):
random_number_list =
[random.randint(0,
LIMIT) for x in
range(0, LIMIT)]
_3 = [a for a in
random_number_list if
a % 3 == 0]
=
len(random_number_list
)
list =
len(list_with_divisibl
e_by_3)
difference =
-
list
difference_list.append
(difference)
difference_list

follow s:
[64, 61, 67, 60, 73,

66, 66, 75, 70, 61]
8. Th en, calcu late th e

ar ith m etic m ean (com m on
av er age) for th e differ ences
in th e length s th at y ou h av e:
avg_diff =
sum(difference_list) /
float(len(difference_l
ist))
avg_diff

follow s:
66.3
ANALYZE A MULTILINE STRING
AND GENERATE THE UNIQUE
WORD COUNT
1 . Cr eate a str ing called

multiline_text and copy
th e text pr esent in th e fir st
ch apter of Pride and
Prejudice. Use Ctrl + A to
select th e entir e text and
th en Ctrl + C to copy it and
paste th e text y ou ju st copied
into it:
Figure 1.18: Initializing the mutliline_text

string
2 . Find th e ty pe of th e str ing

u sing th e type fu nction:
type(multiline_text)
str
3 . Now , find th e length of th e

str ing, u sing th e len
fu nction:
len(multiline_text)
4475
4 . Use str ing m eth ods to get r id

of all th e new lines (\n or
\r),and sy m bols. Rem ov e all
new lines by r eplacing th em
w ith th is:
multiline_text =
multiline_text.replace
('\n', "")
Th en, w e w ill pr int and

ch eck th e ou tpu t:
multiline_text
Figure 1.19: The multiline_text string

a er removing the new lines
5. Rem ov ing th e special

ch ar acter s and pu nctu ation:
# remove special
chars, punctuation
etc.
cleaned_multiline_text
= ""
for char in
multiline_text:
if char == " ":

+= char
elif char.isalnum(): #
using the isalnum()
method of strings.
+= char
else:
+= " "
6 . Ch eck th e content of
cleaned_multiline_text:
Figure 1.20: The cleaned_multiline_text

string
7 . Gener ate a list of all th e

w or ds fr om th e cleaned
str ing u sing th e follow ing
com m and:
list_of_words =
.split()
list_of_words
Figure 1.21: The section of output

displaying the list_of_words
8. Find th e nu m ber of w or ds:
len(list_of_words)
Th e ou tpu t is 852.
9 . Cr eate a list fr om th e list y ou

ju st cr eated, w h ich inclu des
only u niqu e w or ds:
unique_words_as_dict =
dict.fromkeys(list_of_
words)
len(list(unique_words_
as_dict.keys()))
Th e ou tpu t is 340.
1 0. Cou nt th e nu m ber of tim es

each of th e u niqu e w or ds
appear ed in th e cleaned text:
for word in
list_of_words:
if
unique_words_as_dict[w
ord] is None:
ord] = 1
else:
ord] += 1
unique_words_as_dict
Figure 1.22: Section of output showing

unique_words_as_dict
You ju st cr eated, step by

step, a u niqu e w or d cou nter
u sing all th e neat tr icks th at
y ou ju st lear ned.
1 1 . Find th e top 2 5 w or ds fr om
th e unique_words_as_dict.
top_words =
sorted(unique_words_as
_dict.items(),
key=lambda
key_val_tuple:
key_val_tuple[1],
reverse=True)
top_words[:25]
Th ese ar e th e steps to
com plete th is activ ity :
Figure 1.23: Top 25 unique words from multiline_text
PERMUTATION, ITERATOR,
LAMBDA, LIST
Th ese ar e t h e st ep s t o sol v e t h i s ac t i v i t y :
1 . Look u p th e definition of
permutations and
dropwhile fr om itertools.
Th er e is a w ay to look u p th e
definition of a fu nction inside
Ju py ter itself. Ju st ty pe th e
fu nction nam e, follow ed by
?, and pr ess Shift + Enter:

permutations,
dropwhile
permutations?
dropwhile?
You w ill see a long list of

definitions after each ?. We
w ill skip it h er e.
2 . Wr ite an expr ession to

gener ate all th e possible
th r ee-digit nu m ber s u sing 1 ,
2 , and 3 :
permutations(range(3))
<itertools.permutation
s at 0x7f6c6c077af0>
3 . Loop ov er th e iter ator

expr ession y ou gener ated
befor e. Use pr int to pr int
each elem ent r etu r ned by
th e iter ator . Use assert and
isinstance to m ake su r e
th at th e elem ents ar e tu ples:
for number_tuple in
:
print(number_tuple)
assert
isinstance(number_tupl
e, tuple)
(0, 1, 2)
(0, 2, 1)
(1, 0, 2)
(1, 2, 0)
(2, 0, 1)
(2, 1, 0)
4 . Wr ite th e loop again. Bu t

th is tim e, u se dropwhile
w ith a lam bda expr ession to
dr op any leading zer os fr om
th e tu ples. A s an exam ple,
(0, 1, 2) w ill becom e [0,
2]. A lso, cast th e ou tpu t of
th e dropwhile to a list.
A n extr a task can be to ch eck
th e actu al ty pe th at
dropwhile r etu r ns w ith ou t
casting:
for number_tuple in
:
print(list(dropwhile(l
ambda x: x <= 0,
number_tuple)))
[1, 2]
[2, 1]
[1, 0, 2]
[1, 2, 0]
[2, 0, 1]
[2, 1, 0]
5. Wr ite all th e logic y ou w r ote

befor e, bu t th is tim e w r ite a
separ ate fu nction w h er e y ou
w ill be passing th e list
gener ated fr om dropwhile,
and th e fu nction w ill r etu r n
th e w h ole nu m ber contained
in th e list. A s an exam ple, if
y ou pass [1, 2] to th e
fu nction, it w ill r etu r n 12.
Make su r e th at th e r etu r n
ty pe is indeed a nu m ber and
not a str ing. A lth ou gh th is
task can be ach iev ed u sing
oth er tr icks, w e r equ ir e th at
y ou tr eat th e incom ing list
as a stack in th e fu nction and
gener ate th e nu m ber th er e:
import math
def
convert_to_number(numb
er_stack):
final_number = 0
for i in range(0,
len(number_stack)):
final_number +=
(number_stack.pop() *
(math.pow(10, i)))
return final_number
for number_tuple in
:
number_stack =
list(dropwhile(lambda
x: x <= 0,
number_tuple))
print(convert_to_numbe
r(number_stack))
12.0
21.0
102.0
120.0
201.0
210.0
DESIGN YOUR OWN CSV
PARSER
1 . Im por t zip_longest fr om
itertools:

zip_longest
2 . Define th e
return_dict_from_csv_lin
e fu nction so th at it contains
header, line, and
fillvalue as None, and add
it to a dict:
def
return_dict_from_csv_l
ine(header, line):
# Zip them
zipped_line =
zip_longest(header,
line, fillvalue=None)
# Use dict
comprehension to
generate the final
dict
ret_dict = {kv[0]:
kv[1] for kv in
zipped_line}
return ret_dict
3 . Open th e accom pany ing

sales_record.csv file u sing
r m ode inside a w ith block.
Fir st, ch eck th at it is opened,
r ead th e fir st line, and u se
str ing m eth ods to gener ate a
list of all th e colu m n nam es
w ith
open("sales_record.csv",
"r") as fd. Wh en y ou r ead
each line, pass th at line to a
fu nction along w ith th e list
of th e h eader s. Th e w or k of
th e fu nction is to constr u ct a
dict ou t of th ese tw o and fill
u p th e key:values. Keep in
m ind th at a m issing v alu e
sh ou ld r esu lt in a None:
first_line =
fd.readline()
header =
first_line.replace("\n
", "").split(",")
for i, line in
enumerate(fd):
line =
line.replace("\n",
"").split(",")
d =
return_dict_from_csv_l
ine(header, line)
print(d)
if i > 10:
break
Figure 2.10: Section of output
GENERATING STATISTICS
FROM A CSV FILE
1 . Load th e necessar y libr ar ies:
import numpy as np
import pandas as pd
import
plt
2 . Read in th e Boston h ou sing

fr om th e local dir ection:
# Hint: The Pandas
function for reading a
CSV file is
'read_csv'.
# Don't forget that

all functions in
Pandas can be accessed
by syntax like pd.
{function_name}
df=pd.read_csv("Boston
_housing.csv")
3 . Ch eck th e fir st 1 0 r ecor ds:
df.head(10)
Figure 3.23: Output displaying the first 10
records
4 . Find th e total nu m ber of

r ecor ds:
df.shape
(506, 14)
5. Cr eate a sm aller DataFr am e

w ith colu m ns th at do not
inclu de CHAS, NOX, B, and
LSTAT:
df1=df[['CRIM','ZN','I
NDUS','RM','AGE','DIS'
,'RAD','TAX','PTRATIO'
,'PRICE']]
6 . Ch eck th e last 7 r ecor ds of

th e new DataFr am e y ou ju st
cr eated:
df1.tail(7)
Figure 3.24: Last seven records of the

DataFrame
7 . Plot h istogr am s of all th e

v ar iables (colu m ns) in th e
new DataFr am e by u sing a
for loop:
for c in df1.columns:
plt.title("Plot of
"+c,fontsize=15)
plt.hist(df1[c],bins=2
0)
plt.show()
Figure 3.25: Plot of all variables using a
for loop
8. Cr im e r ate cou ld be an
indicator of h ou se pr ice
(people don't w ant to liv e in
h igh -cr im e ar eas). Cr eate a
scatter plot of cr im e r ate
v er su s pr ice:
plt.scatter(df1['CRIM'
],df1['PRICE'])
plt.show()
Figure 3.26: Scatter plot of crime rate

versus price
We can u nder stand th e

r elationsh ip better if w e plot
log1 0(cr im e) v er su s pr ice.
9 . Cr eate th at plot of
log1 0(cr im e) v er su s pr ice:
plt.scatter(np.log10(d
f1['CRIM']),df1['PRICE
'],c='red')
plt.title("Crime rate
(Log) vs. Price plot",
fontsize=18)
plt.xlabel("Log of
Crime
rate",fontsize=15)
plt.ylabel("Price",fon
tsize=15)
plt.grid(True)
plt.show()
Figure 3.27: Scatter plot of crime rate

(Log) versus price
1 0. Calcu late th e m ean r oom s

per dw elling:
df1['RM'].mean()
Th e ou tpu t is
6.284634387351788.
1 1 . Calcu late th e m edian age:
df1['AGE'].median()
Th e ou tpu t is 77.5.
1 2 . Calcu late th e av er age

(m ean) distances to fiv e
Boston em ploy m ent center s:
df1['DIS'].mean()
Th e ou tpu t is
3.795042687747034.
1 3 . Calcu late th e per centage of

h ou ses w ith low pr ice (<
$20,000):
# Create a Pandas
series and directly
compare it with 20
# You can do this

because Pandas series
is basically NumPy
array and you have
seen how to filter
NumPy array
low_price=df1['PRICE']
<20
# This creates a
Boolean array of True,
False
print(low_price)
# True = 1, False = 0,
so now if you take an
average of this NumPy
array, you will know
how many 1's are
there.
# That many houses are

priced below 20,000.
So that is the answer.
# You can convert that

into percentage by
multiplying with 100
pcnt=low_price.mean()*
100
print("\nPercentage of
house with <20,000
price is: ",pcnt)
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 True
9 True
10 True
500 True
501 False
502 False
503 False
504 False
505 True
Name: PRICE, Length:

506, dtype: bool
Percentage of house
with <20,000 price is:
41.50197628458498
WORKING WITH THE ADULT
INCOME DATASET (UCI)
1 . Load th e necessar y libr ar ies:
import numpy as np
import pandas as pd
import
plt
2 . Read in th e adu lt incom e

fr om th e local dir ector y and
ch eck th e fir st 5 r ecor ds:
df =
pd.read_csv("adult_inc
ome_data.csv")
df.head()
Figure 4.61: DataFrame displaying the
first five records from the .csv file
3 . Cr eate a scr ipt th at w ill r ead

a text file line by line and
extr acts th e fir st line, w h ich
is th e h eader of th e .csv file:
names = []
with
open('adult_income_nam
es.txt','r') as f:
for line in f:
f.readline()
var=line.split(":")[0]
names.append(var)
names
Figure 4.62: Names of the columns in the

database
4 . A dd a nam e of Income for th e

r esponse v ar iable (last
colu m n) to th e dataset by
u sing th e append com m and:
names.append('Income')
5. Read th e new file again u sing

df =
pd.read_csv("adult_inc
ome_data.csv",names=na
mes)
df.head()
Figure 4.63: DataFrame with the income

column added
6 . Use th e describe com m and

to get th e statistical
su m m ar y of th e dataset:
df.describe()
Figure 4.64: Statistical summary of the
dataset
Note th at only a sm all

nu m ber of colu m ns ar e
inclu ded. Many v ar iables in
th e dataset h av e m u ltiple
factor s or classes.
7 . Make a list of all th e

v ar iables in th e classes by
com m and:
# Make a list of all

variables with classes
vars_class =
['workclass','educatio
n','marital-status',
'occupation','relation
ship','sex','native-
country']
8. Cr eate a loop to cou nt and

pr int th em by u sing th e
for v in vars_class:
classes=df[v].unique()
num_classes =
df[v].nunique()
print("There are {}
classes in the \"{}\"
column. They are:
{}".format(num_classes
,v,classes))
print("-"*100)
Figure 4.65: Output of different factors or

classes
9 . Find th e m issing v alu es by

com m and:
df.isnull().sum()
Figure 4.66: Finding the missing values
1 0. Cr eate a DataFr am e w ith

only age, edu cation, and
occu pation by u sing
su bsetting:
df_subset =
df[['age','education',
'occupation']]
df_subset.head()
Figure 4.67: Subset DataFrame
1 1 . Plot a h istogr am of age w ith

a bin size of 2 0:
df_subset['age'].hist(
bins=20)
<matplotlib.axes._subp
lots.AxesSubplot at
0x19dea8d0>
Figure 4.68: Histogram of age with a bin
size of 20
1 2 . Plot boxplots for age gr ou ped

by education (u se a long
figu r e size 2 5x1 0 and m ake x
ticks font size 1 5):
df_subset.boxplot(colu
mn='age',by='education
',figsize=(25,10))
)
plt.xlabel("Education"
,fontsize=20)
plt.show()
Figure 4.69: Boxplot of age grouped by
education
Befor e doing any fu r th er

oper ations, w e need to u se
th e apply m eth od w e
lear ned in th is ch apter . It
tu r ns ou t th at w h en r eading
th e dataset fr om th e CSV file,
all th e str ings cam e w ith a
w h itespace ch ar acter in
fr ont. So, w e need to r em ov e
th at w h itespace fr om all th e
str ings.
1 3 . Cr eate a fu nction to str ip th e

w h itespace ch ar acter s:
def
strip_whitespace(s):
return s.strip()
1 4 . Use th e apply m eth od to

apply th is fu nction to all th e
colu m ns w ith str ing v alu es,
cr eate a new colu m n, copy
th e v alu es fr om th is new
colu m n to th e old colu m n,
and dr op th e new colu m n.
Th is is th e pr efer r ed m eth od
so th at y ou don't
accidentally delete v alu able
data. Most of th e tim e, y ou
w ill need to cr eate a new
colu m n w ith a desir ed
oper ation and th en copy it
back to th e old colu m n if
necessar y . Ignor e any
w ar ning m essages th at ar e
pr inted:
# Education column
df_subset['education_s
tripped']=df['educatio
n'].apply(strip_whites
pace)
df_subset['education']
=df_subset['education_
stripped']
df_subset.drop(labels=
['education_stripped']
,axis=1,inplace=True)
# Occupation column
df_subset['occupation_
stripped']=df['occupat
ion'].apply(strip_whit
espace)
df_subset['occupation'
]=df_subset['occupatio
n_stripped']
df_subset.drop(labels=
['occupation_stripped'
],axis=1,inplace=True)
Th is is th e sam ple w ar ning

m essage, w h ich y ou sh ou ld
ignor e:
Figure 4.70: Warning message to be

ignored
1 5. Find th e nu m ber of people

w h o ar e aged betw een 3 0
and 50 (inclu siv e) by u sing
# Conditional clauses
and join them by &
(AND)
df_filtered=df_subset[
(df_subset['age']>=30)
& (df_subset['age']
<=50)]
Ch eck th e contents of th e
new dataset:
df_filtered.head()
Figure 4.71: Contents of new DataFrame
1 6 . Find th e shape of th e filter ed

DataFr am e and specify th e
index of th e tu ple as 0 to
r etu r n th e fir st elem ent:
answer_1=df_filtered.s
hape[0]
answer_1
1630
1 7 . Pr int th e nu m ber of black

people aged betw een 3 0 and
50 u sing th e follow ing
com m and:
print("There are {}
people of age between
30 and 50 in this
dataset.".format(answe
r_1))
There are 1630 black

of age between 30 and
50 in this dataset.
1 8. Gr ou p th e r ecor ds based on
occu pation to find h ow th e
m ean age is distr ibu ted:
df_subset.groupby('occ
upation').describe()
['age']
Figure 4.72: DataFrame with data

grouped by age and education
Th e code r etu r ns 79 rows ×

1 columns.
1 9 . Gr ou p by occu pation and

sh ow th e su m m ar y statistics
of age. Find w h ich pr ofession
h as th e oldest w or ker s on
av er age and w h ich
pr ofession h as its lar gest
sh ar e of w or kfor ce abov e th e
7 5th per centile:
df_subset.groupby('occ
upation').describe()
['age']
Figure 4.73: DataFrame showing
summary statistics of age
Is th er e a par ticu lar

occu pation gr ou p th at h as
v er y low r epr esentation?
Per h aps w e sh ou ld r em ov e
th ose pieces of data becau se
w ith v er y low data, th e
gr ou p w on't be u sefu l in
analy sis. A ctu ally , ju st by
looking at th e pr eceding
table, y ou sh ou ld be able to
see th at th e Armed-Forces
gr ou p h as only got a 9 cou nt,
th at is, 9 data points. Bu t
h ow can w e detect th is? By
plotting th e cou nt colu m n in
a bar ch ar t. Note h ow th e
fir st ar gu m ent to th e barh
fu nction is th e index of th e
DataFr am e, w h ich is th e
su m m ar y stats of th e
occu pation gr ou ps. We can
see th at th e Armed-Forces
gr ou p h as alm ost no data.
Th is exer cise teach es y ou
th at, som etim es, th e ou tlier
is not ju st a v alu e, bu t can be
a w h ole gr ou p. Th e data of
th is gr ou p is fine, bu t it is too
sm all to be u sefu l for any
analy sis. So, it can be tr eated
as an ou tlier in th is case. Bu t
alw ay s u se y ou r bu siness
know ledge and engineer ing
ju dgem ent for su ch ou tlier
detection and h ow to pr ocess
th em .
2 0. Use su bset and gr ou pby to

find th e ou tlier s:
occupation_stats=
df_subset.groupby(
'occupation').describe
()['age']
2 1 . Plot th e v alu es on a bar

ch ar t:
plt.figure(figsize=
(15,8))
plt.barh(y=occupation_
stats.index,
width=occupation_stats
['count'])
)
plt.show()
Figure 4.74: Bar chart displaying
occupation statistics
2 2 . Pr actice m er ging by
com m on key s. Su ppose y ou
ar e giv en tw o datasets w h er e
th e com m on key is
occupation. Fir st, cr eate
tw o su ch disjoint datasets by
taking r andom sam ples fr om
th e fu ll dataset and th en tr y
m er ging. Inclu de at least tw o
oth er colu m ns, along w ith
th e com m on key colu m n for
each dataset. Notice h ow th e
r esu lting dataset, after
m er ging, m ay h av e m or e
data points th an eith er of th e
tw o star ting datasets if y ou r
com m on key is not u niqu e:
df_1 = df[['age',
'workclass',
'occupation']].sample(
5,random_state=101)
df_1.head()
Figure 4.75: Output a er merging the common keys
Th e sec ond dat aset i s as f ol l ow s:
df_2 = df[['education',
'occupation']].sample(5,random_state=101)
df_2.head()
Figure 4.76: Output a er merging the common keys
Mer gi ng t h e t w o dat aset s t oget h er :
df_merged = pd.merge(df_1,df_2,
on='occupation',
how='inner').drop_duplicates()
df_merged
Figure 4.77: Output of distinct occupation values
READING TABULAR DATA
FROM A WEB PAGE AND
CREATING DATAFRAMES
1 . Im por t Beau tifu lSou p and

load th e data by u sing th e
from bs4 import

BeautifulSoup
import pandas as pd
2 . Open th e Wikipedia file by

com m and:
fd = open("List of
countries by GDP
(nominal) -
Wikipedia.htm", "r")
soup =
BeautifulSoup(fd)
fd.close()
3 . Calcu late th e tables by u sing

all_tables =
soup.find_all("table")
print("Total number of
tables are {}
".format(len(all_table
s)))
Th er e ar e 9 tables in total.
4 . Find th e r igh t table u sing
th e class attr ibu te by u sing
data_table =
soup.find("table",
{"class":
'"wikitable"|}'})
print(type(data_table)
)
<class
'bs4.element.Tag'>
5. Separ ate th e sou r ce and th e

actu al data by u sing th e
sources =
data_table.tbody.findA
ll('tr',
recursive=False)[0]
sources_list = [td for

td in
sources.findAll('td')]
print(len(sources_list
))
Total number of tables

are 3.
6 . Use findAll fu nction to find

th e data fr om th e
data_table's body tag, u sing
data =
data_table.tbody.findA
ll('tr',
recursive=False)
[1].findAll('td',
recursive=False)
7 . Use th e findAll fu nction to

find th e data fr om th e
data_table td tag by u sing
data_tables = []
for td in data:
data_tables.append(td.
findAll('table'))
8. Find th e length of
data_tables by u sing th e
len(data_tables)
9 . Ch eck h ow to get th e sou r ce

nam es by u sing th e follow ing
com m and:
source_names =
[source.findAll('a')
[0].getText() for
source in
sources_list]
print(source_names)
['International
Monetary Fund', 'World
Bank', 'United
Nations']
1 0. Separ ate th e h eader and

data for th e fir st sou r ce:
header1 =
[th.getText().strip()
for th in
data_tables[0]
[0].findAll('thead')
[0].findAll('th')]
header1
['Rank', 'Country',
'GDP(US$MM)']
1 1 . Find th e r ow s fr om
data_tables u sing findAll:
rows1 = data_tables[0]
[0].findAll('tbody')
[0].findAll('tr')[1:]
1 2 . Find th e data fr om rows1
u sing th e strip fu nction for
each td tag:
data_rows1 =
[[td.get_text().strip(
) for td in
tr.findAll('td')] for
tr in rows1]
1 3 . Find th e DataFr am e:
df1 =
pd.DataFrame(data_rows
1, columns=header1)
df1.head()
Figure 5.35: DataFrame created from

Web page
1 4 . Do th e sam e for th e oth er tw o

sou r ces by u sing th e
header2 =
for th in
data_tables[1]
[0].findAll('th')]
header2
['Rank', 'Country',
'GDP(US$MM)']
1 5. Find th e r ow s fr om
data_tables u sing findAll
com m and:
1 6 . Define find_right_text
u sing th e strip fu nction by
com m and:
def find_right_text(i,
td):
if i == 0:
return
td.getText().strip()
elif i == 1:
return
td.getText().strip()
else:
index =
td.text.find("♠")
return
td.text[index+1:].stri
p()
data_rows u sing
find_right_text by u sing
data_rows2 =
[[find_right_text(i,
td) for i, td in
enumerate(tr.findAll('
td'))] for tr in
rows2]
1 8. Calcu late th e df2

DataFr am e by u sing th e
df2 =
2, columns=header2)
df2.head()
Figure 5.36: Output of the DataFrame
1 9 . Now , per for m th e sam e

oper ations for th e th ir d
header3 =
for th in
data_tables[2]
[0].findAll('th')]
header3
['Rank', 'Country',
'GDP(US$MM)']
2 0. Find th e r ow s fr om
data_tables u sing findAll
com m and:
data_rows3 by u sing
find_right_text:
data_rows3 =
[[find_right_text(i,
td) for i, td in
enumerate(tr.findAll('
td'))] for tr in
rows2]
2 2 . Calcu late th e df3

df3 =
3, columns=header3)
df3.head()
Figure 5.37: The third DataFrame
HANDLING OUTLIERS AND
MISSING DATA
1 . Load th e data:
import pandas as pd
import numpy as np
import
plt
%matplotlib inline
2 . Read th e .csv file:
df =
pd.read_csv("visit_dat
a.csv")
3 . Pr int th e data fr om th e
DataFr am e:
df.head()
Figure 6.10: The contents of the CSV file
A s w e can see, th er e is data

w h er e som e v alu es ar e
m issing, and if w e exam ine
th is, w e w ill see som e
ou tlier s.
4 . Ch eck for du plicates by u sing

print("First name is
duplicated -
{}".format(any(df.firs
t_name.duplicated())))
print("Last name is
duplicated -
{}".format(any(df.last
_name.duplicated())))
print("Email is
duplicated -
{}".format(any(df.emai
l.duplicated())))
First name is
duplicated - True
Last name is
duplicated - True
Email is duplicated -
False
Th er e ar e du plicates in both
th e fir st and last nam es,
w h ich is nor m al. How ev er ,
as w e can see, th er e is no
du plicate in em ail. Th at's
good.
5. Ch eck if any essential

colu m n contains NaN:
# Notice that we have
different ways to
format boolean values
for the % operator
print("The column
Email contains NaN -
%r " %
df.email.isnull().valu
es.any())
print("The column IP
Address contains NaN -
%s " %
df.ip_address.isnull()
.values.any())
print("The column
Visit contains NaN -
%s " %
df.visit.isnull().valu
es.any())
The column Email

contains NaN - False
The column IP Address

contains NaN - False
The column Visit

contains NaN - True
Th e colu m n v isit contains

som e None v alu es. Giv en
th at th e final task at h and
w ill pr obably be pr edicting
th e nu m ber of v isits, w e
cannot do any th ing w ith
r ow s th at do not h av e th at
infor m ation. Th ey ar e a ty pe
of ou tlier . Let's get r id of
th em .
6 . Get r id of th e ou tlier s:
# There are various

ways to do this. This
is just one way. We
encourage you to
explore other ways.
# But before that we

need to store the
previous size of the
data set and we will
compare it with the
new size
size_prev = df.shape
df =
df[np.isfinite(df['vis
it'])] #This is an
inplace operation.
After this operation
the original DataFrame
is lost.
size_after = df.shape
7 . Repor t th e size differ ence:
# Notice how
parameterized format
is used and then the
indexing is working
inside the quote marks
print("The size of
previous data was -
{prev[0]} rows and the
size of the new one is
- {after[0]} rows".
format(prev=size_prev,
after=size_after))
The size of previous

data was - 1000 rows
and the size of the
new one is - 974 rows
8. Plot a boxplot to find if th e

data h as ou tlier s.
plt.boxplot(df.visit,
notch=True)
{'whiskers':
[<matplotlib.lines.Lin
e2D at
0x7fa04cc08668>,
<matplotlib.lines.Line
2D at
0x7fa04cc08b00>],
'caps':
e2D at
0x7fa04cc08f28>,
<matplotlib.lines.Line
2D at
0x7fa04cc11390>],
'boxes':
e2D at
0x7fa04cc08518>],
'medians':
e2D at
0x7fa04cc117b8>],
'fliers':
e2D at
0x7fa04cc11be0>],
'means': []}
Th e boxplot is as follow s:
Figure 6.43: Boxplot using the data
A s w e can see, w e h av e data

in th is colu m n in th e
inter v al (0, 3 000).
How ev er , th e m ain
concentr ation of th e data is
betw een ~7 00 and ~2 3 00.
9 . Get r id of v alu es bey ond

2 9 00 and below 1 00 – th ese
ar e ou tlier s for u s. We need
to get r id of th em :
df1 = df[(df['visit']
<= 2900) &
(df['visit'] >= 100)]
# Notice the powerful
& operator
# Here we abuse the

fact the number of
variable can be
greater than the
number of replacement
targets
print("After getting
rid of outliers the
new size of the data
is -
{}".format(*df1.shape)
)
A fter getting r id of th e
ou tlier s, th e new size of th e
data is 923.
Th is is th e end of th e activ ity

for th is ch apter .
EXTRACTING THE TOP 100
EBOOKS FROM GUTENBERG
1 . Im por t th e necessar y
libr ar ies, inclu ding regex
and beautifulsoup:
urllib.parse,
urllib.error
import requests
from bs4 import

BeautifulSoup
import ssl
import re
2 . Ch eck th e SSL cer tificate:
# Ignore SSL
certificate errors
ctx =
ssl.create_default_con
text()
ctx.check_hostname =
False
ctx.verify_mode =
ssl.CERT_NONE
3 . Read th e HTML fr om th e
URL:
# Read the HTML from

the URL and pass on to
BeautifulSoup
top100url =
'https://www.gutenberg
.org/browse/scores/top
'
response =
requests.get(top100url
)
4 . Wr ite a sm all fu nction to

ch eck th e statu s of th e w eb
r equ est:
print("Success!")
return 1
else:
print("Failed!")
return -1
5. Ch eck th e statu s of
response:
Success!
6 . Decode th e r esponse and pass

it on to BeautifulSoup for
HTML par sing:
contents =
response.content.decod
e(response.encoding)
soup =
, 'html.parser')
7 . Find all th e href tags and

stor e th em in th e list of links.
Ch eck w h at th e list looks like
– pr int th e fir st 3 0 elem ents:
# Empty list to hold

all the http links in
the HTML page
lst_links=[]
# Find all the href

tags and store them in
the list of links
for link in
soup.find_all('a'):
#print(link.get('href'
))
lst_links.append(link.
get('href'))
8. Pr int th e links by u sing th e

lst_links[:30]
['/wiki/Main_Page',
'/catalog/',
'/ebooks/',
'/browse/recent/last1'
,
'/browse/scores/top',
'/wiki/Gutenberg:Offli
ne_Catalogs',
'/catalog/world/mybook
marks',
'/wiki/Main_Page',
'https://www.paypal.co
m/xclick/business=dona
te%40gutenberg.org&ite
m_name=Donation+to+Pro
ject+Gutenberg',
'/wiki/Gutenberg:Proje
ct_Gutenberg_Needs_You
r_Donation',
'http://www.ibiblio.or
g',
'http://www.pgdp.net/'
,
'pretty-pictures',
'#books-last1',
'#authors-last1',
'#books-last7',
'#authors-last7',
'#books-last30',
'#authors-last30',
'/ebooks/1342',
'/ebooks/84',
'/ebooks/1080',
'/ebooks/46',
'/ebooks/219',
'/ebooks/2542',
'/ebooks/98',
'/ebooks/345',
'/ebooks/2701',
'/ebooks/844',
'/ebooks/11']
9 . Use a r egu lar expr ession to

find th e nu m er ic digits in
th ese links. Th ese ar e th e file
nu m ber s for th e top 1 00
books. Initialize th e em pty
list to h old th e file nu m ber s:
booknum=[]
1 0. Nu m ber s 1 9 to 1 1 8 in th e
or iginal list of links h av e th e
top 1 00 eBooks' nu m ber s.
Loop ov er th e appr opr iate
r ange and u se a r egex to find
th e nu m er ic digits in th e
link (h r ef) str ing. Use th e
findall() m eth od:
for i in
range(19,119):
link=lst_links[i]
link=link.strip()
# Regular expression
to find the numeric
digits in the link
(href) string
n=re.findall('[0-
9]+',link)
if len(n)==1:
# Append the
filenumber casted as
integer
booknum.append(int(n[0
]))
1 1 . Pr int th e file nu m ber s:
print ("\nThe file

numbers for the top
100 ebooks on
Gutenberg are shown
below\n"+"-"*70)
print(booknum)
The file numbers for

the top 100 ebooks on
Gutenberg are shown
below
----------------------
----------------------
----------------------
----
[1342, 84, 1080, 46,

219, 2542, 98, 345,
2701, 844, 11, 5200,
43, 16328, 76, 74,
1952, 6130, 2591,
1661, 41, 174, 23,
1260, 1497, 408, 3207,
1400, 30254, 58271,
1232, 25344, 58269,
158, 44881, 1322, 205,
2554, 1184, 2600, 120,
16, 58276, 5740,
34901, 28054, 829, 33,
2814, 4300, 100, 55,
160, 1404, 786, 58267,
3600, 19942, 8800,
514, 244, 2500, 2852,
135, 768, 58263, 1251,
3825, 779, 58262, 203,
730, 20203, 35, 1250,
45, 161, 30360, 7370,
58274, 209, 27827,
58256, 33283, 4363,
375, 996, 58270, 521,
58268, 36, 815, 1934,
3296, 58279, 105,
2148, 932, 1064,
13415]
1 2 . Wh at does th e sou p object's

text look like? Use th e .text
m eth od and pr int only th e
fir st 2 ,000 ch ar acter s (do
not pr int th e w h ole th ing as
it is too long).
You w ill notice a lot of em pty

spaces/blanks h er e and
th er e. Ignor e th em . Th ey ar e
par t of th e HTML page's
m ar ku p and its w h im sical
natu r e:
print(soup.text[:2000]
)
if (top != self) {
top.location.replace
(http://www.gutenberg.
org);
alert ('Project
Gutenberg is a FREE
service with NO
membership required.
If you paid somebody
else to get here, make
them give you your
money back!');
Top 100 - Project

Gutenberg
Online Book Catalog
Book Search
-- Recent Books
-- Top 100
-- Offline Catalogs
-- My Bookmarks
Main Page
Pretty Pictures
Top 100 EBooks

yesterday —
Top 100 Authors

yesterday —
Top 100 EBooks last 7

days —
Top 100 Authors last 7

days —
Top 100 EBooks last 30

days —
Top 100 Authors last

30 days
Top 100 EBooks

yesterday
Pride and Prejudice by

Jane Austen (1826)
Frankenstein; Or, The

Modern Prometheus by
Mary Wollstonecraft
Shelley (1367)
A Modest Proposal by
Jonathan Swift (1020)
A Christmas Carol in
Prose; Being a Ghost
Story of Christmas by
Charles Dickens (953)
Heart of Darkness by
Joseph Conrad (887)
Et dukkehjem. English
by Henrik Ibsen (761)
A Tale of Two Cities

by Charles Dickens
(741)
Dracula by Bram Stoker

(732)
Moby Dick; Or, The

Whale by Herman
Melville (651)
The Importance of
Being Earnest: A
Trivial Comedy for
Serious People by
Oscar Wilde (646)
Alice's Adventures in
Wonderland by Lewis
Carrol
1 3 . Sear ch th e extr acted text

(u sing r egu lar expr ession)
fr om th e sou p object to find
th e nam es of top 1 00 eBooks
(y ester day 's r ank):
# Temp empty list of

Ebook names
lst_titles_temp=[]
1 4 . Cr eate a star ting index. It

sh ou ld point at th e text Top
100 Ebooks y est erday . Use
th e splitlines m eth od of
soup.text. It splits th e lines
of th e text of th e sou p object:
start_idx=soup.text.sp
litlines().index('Top
100 EBooks yesterday')
1 5. Loop 1 -1 00 to add th e str ings

of th e next 1 00 lines to th is
tem por ar y list. Hint: u se th e
splitlines m eth od:
for i in range(100):
lst_titles_temp.append
(soup.text.splitlines(
)[start_idx+2+i])
1 6 . Use a r egu lar expr ession to

extr act only text fr om th e
nam e str ings and append
th em to an em pty list. Use
m atch and span to find th e
indices and u se th em :
lst_titles=[]
for i in range(100):
id1,id2=re.match('^[a-
zA-Z
]*',lst_titles_temp[i]
).span()
lst_titles.append(lst_
titles_temp[i]
[id1:id2])
1 7 . Pr int th e list of titles:
for l in lst_titles:
print(l)
Pride and Prejudice by

Jane Austen
Frankenstein
A Modest Proposal by
Jonathan Swift
A Christmas Carol in
Prose
Heart of Darkness by
Joseph Conrad
Et dukkehjem
A Tale of Two Cities

by Charles Dickens
Dracula by Bram Stoker
Moby Dick
The Importance of
Being Earnest
Alice
Metamorphosis by Franz
Kafka
The Strange Case of Dr
Beowulf
The Russian Army and

the Japanese War
Calculus Made Easy by
Silvanus P
Beyond Good and Evil

by Friedrich Wilhelm
Nietzsche
An Occurrence at Owl
Creek Bridge by
Ambrose Bierce
Don Quixote by Miguel

de Cervantes Saavedra
Blue Jackets by Edward

Greey
The Life and

Adventures of Robinson
Crusoe by Daniel Defoe
The Waterloo Campaign
The War of the Worlds

by H
Democracy in America
Songs of Innocence
The Confessions of St
Modern French Masters

by Marie Van Vorst
Persuasion by Jane
Austen
The Works of Edgar

Allan Poe
The Fall of the House

of Usher by Edgar
Allan Poe
The Masque of the Red

Death by Edgar Allan
Poe
The Lady with the Dog

and Other Stories by
Anton Pavlovich
Chekhov

EXTRACTING THE TOP 100
EBOOKS FROM
GUTENBERG.ORG
1 . Im por t urllib.request,
urllib.parse,
urllib.error, and json:
urllib.parse,
urllib.error
import json
2 . Load th e secr et A PI key (y ou

h av e to get one fr om th e
OMDB w ebsite and u se th at;
it h as a 1 ,000 daily lim it)
fr om a JSON file, stor ed in
th e sam e folder into a
v ar iable, by u sing
json.loads():
Note
The follow ing cell w ill not be

executed in the solution
notebook because the author
cannot give out their private
API key.
3 . Th e
stu dents/u ser s/instr u ctor
w ill need to obtain a key and
stor e it in a JSON file. We ar e
calling th is file
APIkeys.json.
4 . Open th e APIkeys.json file

com m and:
with
open('APIkeys.json')
as f:
keys = json.load(f)
omdbapi =
keys['OMDBapi']
Th e final URL to be passed

sh ou ld look like th is:
h ttp://w w w .om dbapi.com /?
t= m ov ie_nam e&apikey = sec
r etapikey .
5. A ssign th e OMDB por tal

(h ttp://w w w .om dbapi.com /
?) as a str ing to a v ar iable
called serviceurl by u sing
serviceurl =
'http://www.omdbapi.co
m/?'
6 . Cr eate a v ar iable called

apikey w ith th e last por tion
of th e URL
(&apikey=secretapikey),
w h er e secretapikey is y ou r
ow n A PI key . Th e m ov ie
nam e por tion is
t=movie_name, and w ill be
addr essed later :
apikey =
'&apikey='+omdbapi

called print_json to pr int
th e m ov ie data fr om a JSON
file (w h ich w e w ill get fr om
th e por tal). Her e ar e th e
key s of a JSON file: 'Title',
'Year ', 'Rated', 'Released',
'Ru ntim e', 'Genr e', 'Dir ector ',
'Wr iter ', 'A ctor s', 'Plot',
'Langu age','Cou ntr y ',
'A w ar ds', 'Ratings',
'Metascor e', 'im dbRating',
'im dbV otes', and 'im dbID':
def
print_json(json_data):
list_keys=['Title',
'Year', 'Rated',
'Released', 'Runtime',
'Genre', 'Director',
'Writer',
'Actors', 'Plot',
'Language', 'Country',
'Awards', 'Ratings',
'Metascore',
'imdbRating',
'imdbVotes', 'imdbID']
print("-"*50)
for k in list_keys:
if k in
list(json_data.keys())
:
print(f"{k}:
{json_data[k]}")
print("-"*50)
8. Wr ite a u tility fu nction to

dow nload a poster of th e
m ov ie based on th e
infor m ation fr om th e JSON
dataset and sav e it in y ou r
local folder . Use th e os
m odu le. Th e poster data is
stor ed in th e JSON key
Poster. You m ay w ant to
split th e nam e of th e Poster
file and extr act th e file
extension only . Let's say th at
th e extension is jpg. We
w ou ld later join th is
extension to th e m ov ie nam e
and cr eate a filenam e su ch
as movie.jpg. Use th e open
Py th on com m and open to
open a file and w r ite th e
poster data. Close th e file
after y ou 'r e done. Th is
fu nction m ay not r etu r n
any th ing. It ju st sav es th e
poster data as an im age file:
def
save_poster(json_data)
:
import os
title =
json_data['Title']
poster_url =
json_data['Poster']
# Splits the poster

url by '.' and picks
up the last string as
file extension
poster_file_extension=
poster_url.split('.')
[-1]
# Reads the image file

from web
poster_data =
(poster_url).read()
savelocation=os.getcwd
()+'\\'+'Posters'+'\\'
# Creates new
directory if the
directory does not
exist. Otherwise, just
use the existing path.
if not
os.path.isdir(saveloca
tion):
os.mkdir(savelocation)
filename=savelocation+
str(title)+'.'+poster_
file_extension
f=open(filename,'wb')
f.write(poster_data)
f.close()

called search_movie to
sear ch a m ov ie by its nam e,
pr int th e dow nloaded JSON
data (u se th e print_json
fu nction for th is), and sav e
th e m ov ie poster in th e local
folder (u se th e save_poster
fu nction for th is). Use a try-
except loop for th is, th at is,
tr y to connect to th e w eb
por tal. If su ccessfu l, pr oceed,
bu t if not (th at is, if an
exception is r aised), th en ju st
pr int an er r or m essage. Use
th e pr ev iou sly cr eated
v ar iables serviceurl and
apikey. You h av e to pass on
a dictionar y w ith a key , t,
and th e m ov ie nam e as th e
cor r esponding v alu e to th e
fu nction and th en add th e
serviceurl and apikey to
th e ou tpu t of th e fu nction to
constr u ct th e fu ll URL. Th is
URL w ill be u sed for
accessing th e data. Th e
JSON data h as a key called
Response. If it is True, th at
m eans th at th e r ead w as
su ccessfu l. Ch eck th is befor e
pr ocessing th e data. If it w as
not su ccessfu l, th en pr int th e
JSON key Error, w h ich w ill
contain th e appr opr iate
er r or m essage th at's
r etu r ned by th e m ov ie
database:
def
search_movie(title):
try:
url = serviceurl +
({'t':
str(title)})+apikey
print(f'Retrieving the
data of "{title}"
now... ')
print(url)
uh =
(url)
data = uh.read()
json_data=json.loads(d
ata)
if
json_data['Response']=
='True':
print_json(json_data)
# Asks user whether to

download the poster of
the movie
if
json_data['Poster']!='
N/A':
save_poster(json_data)
else:
print("Error
encountered:
",json_data['Error'])
except
urllib.error.URLError
as e:
print(f"ERROR:
{e.reason}"
1 0. Test th e search_movie
Titanic:
search_movie("Titanic"
)
Th e follow ing is th e r etr iev ed

data for Titanic:
http://www.omdbapi.com
/?
t=Titanic&apikey=17cdc
959
----------------------
----------------------
------
Title: Titanic
Year: 1997
Rated: PG-13
Released: 19 Dec 1997
Runtime: 194 min
Genre: Drama, Romance
Director: James
Cameron
Writer: James Cameron
Actors: Leonardo
DiCaprio, Kate
Winslet, Billy Zane,
Kathy Bates
Plot: A seventeen-
year-old aristocrat
falls in love with a
kind but poor artist
aboard the luxurious,
ill-fated R.M.S.
Titanic.
Language: English,
Swedish
Country: USA
Awards: Won 11 Oscars.

Another 111 wins & 77
nominations.
Ratings: [{'Source':
'Internet Movie
Database', 'Value':
'7.8/10'}, {'Source':
'Rotten Tomatoes',
'Value': '89%'},
{'Source':
'Metacritic', 'Value':
'75/100'}]
Metascore: 75
imdbRating: 7.8
imdbVotes: 913,780
imdbID: tt0120338
----------------------
----------------------
------
1 1 . Test th e search_movie
"Random_error" (obv iou sly ,
th is w ill not be fou nd, and
y ou sh ou ld be able to ch eck
w h eth er y ou r er r or catch ing
code is w or king pr oper ly ):
search_movie("Random_e
rror")
Retr iev e th e data of

"Random_error":
http://www.omdbapi.com
/?
t=Random_error&apikey=
17cdc959
Error encountered:
Movie not found!
Look f or a f ol der c al l ed Posters i n t h e same

di r ec t or y y ou ar e w or k i ng i n. It sh ou l d c ont ai n a
f i l e c al l ed Titanic.jpg. Ch ec k t h e f i l e.

RETRIEVING DATA CORRECTLY
FROM DATABASES
1 . Connect to th e su pplied
petsDB database:
import sqlite3
conn =
sqlite3.connect("petsd
b")
2 . Wr ite a fu nction to ch eck

w h eth er th e connection h as
been su ccessfu l:
# a tiny function to
make sure the
connection is
successful
def is_opened(conn):
try:
conn.execute("SELECT *
FROM persons LIMIT 1")
return True
except
sqlite3.ProgrammingErr
or as e:
print("Connection
closed {}".format(e))
return False
print(is_opened(conn))
True
3 . Close th e connection:
conn.close()
4 . Ch eck w h eth er th e
connection is open or closed:
print(is_opened(conn))
False
5. Find ou t th e differ ent age

gr ou ps ar e in th e persons
database. Connect to th e
su pplied petsDB database:
conn =
sqlite3.connect("petsd
b")
c = conn.cursor()
6 . Execu te th e follow ing

com m and:
for ppl, age in

c.execute("SELECT
count(*), age FROM
persons GROUP BY
age"):
print("We have {}
people aged
{}".format(ppl, age))
Figure 8.17: Section of output grouped by
age
7 . To find ou t w h ich age gr ou p

h as th e h igh est nu m ber of
people, execu te th e follow ing
com m and:
sfor ppl, age in

c.execute(
"SELECT count(*), age

FROM persons GROUP BY
age ORDER BY count(*)
DESC"):
print("Highest number
of people is {} and
came from {} age
group".format(ppl,
age))
break
Highest number of
people is 5 and came
from 73 age group
8. To find ou t h ow m any people
do not h av e a fu ll nam e (th e
last nam e is blank/nu ll),
com m and:
res =
c.execute("SELECT
count(*) FROM persons
WHERE last_name IS
null")
for row in res:
print(row)
(60,)
9 . To find ou t h ow m any people

h av e m or e th an one pet,
com m and:
res =
c.execute("SELECT
count(*) FROM (SELECT
count(owner_id) FROM
pets GROUP BY owner_id
HAVING count(owner_id)
>1)")
for row in res:
print("{} People has

more than one
pets".format(row[0]))
43 People has more

than one pets
1 0. To find ou t h ow m any pets

h av e r eceiv ed tr eatm ent,
com m and:
res =
c.execute("SELECT
count(*) FROM pets
WHERE
treatment_done=1")
for row in res:
print(row)
(36,)
1 1 . To find ou t h ow m any pets

h av e r eceiv ed tr eatm ent and
th e ty pe of pet is know n,
com m and:
res =
c.execute("SELECT
count(*) FROM pets
WHERE treatment_done=1
AND pet_type IS NOT
null")
for row in res:
print(row)
(16,)

ar e fr om th e city called "east
por t", execu te th e follow ing
com m and:
res =
c.execute("SELECT
count(*) FROM pets
JOIN persons ON
pets.owner_id =
persons.id WHERE
persons.city='east
port'")
for row in res:
print(row)
(49,)

ar e fr om th e city called "east
por t" and w h o r eceiv ed
tr eatm ent, execu te th e
res =
c.execute("SELECT
count(*) FROM pets
JOIN persons ON
pets.owner_id =
persons.id WHERE
persons.city='east
port' AND
pets.treatment_done=1"
)
for row in res:
print(row)
(11,)

DATA WRANGLING TASK –
FIXING UN DATA
1 . Im por t th e r equ ir ed
libr ar ies:
import numpy as np
import pandas as pd
import
plt
import warnings
warnings.filterwarning
s('ignore')s
2 . Sav e th e URL of th e dataset

and u se th e pandas read_csv
m eth od to dir ectly pass th is
link and cr eate a
DataFr am e:
education_data_link="h
ttp://data.un.org/_Doc
s/SYB/CSV/SYB61_T07_Ed
ucation.csv"
df1 =
pd.read_csv(education_
data_link)
3 . Pr int th e data in th e
DataFr am e:
df1.head()
Figure 9.3: DataFrame from the UN data
4 . A s th e fir st r ow does not

contain u sefu l infor m ation,
u se th e skiprows par am eter
to r em ov e th e fir st r ow :
df1 =
pd.read_csv(education_
data_link,skiprows=1)
5. Pr int th e data in th e
DataFr am e:
df1.head()
Figure 9.4: DataFrame a er removing

the first row
6 . Dr op th e colu m n
Region/Cou ntr y /A r ea and
Sou r ce as th ey w ill not be
v er y h elpfu l:
df2 =
df1.drop(['Region/Coun
try/Area','Source'],ax
is=1)
7 . A ssign th e follow ing nam es

as th e colu m ns of th e
DataFr am e:
['Region/Country/Area','
Year','Data','Value','Fo
otnotes']
df2.columns=
['Region/Country/Area'
,'Year','Data','Enroll
ments
(Thousands)','Footnote
s']
8. Pr int th e data in th e
DataFr am e:
df1.head()
Figure 9.5: DataFrame a er dropping

Region/Country/Area and Source columns
9 . Ch eck h ow m any u niqu e

v alu es th e Footnotes
colu m n contains:
df2['Footnotes'].uniqu
e()
Figure 9.6: Unique values of the
Footnotes column
1 0. Conv er t th e Value colu m n

data into a nu m er ic one for
fu r th er pr ocessing:
type(df2['Enrollments
(Thousands)'][0])
str
1 1 . Cr eate a u tility fu nction to

conv er t th e str ings in th e
V alu e colu m n into floating-
point nu m ber s:
def to_numeric(val):
"""
Converts a given
string (with one or
more commas) to a
numeric value
"""
if ',' not in
str(val):
result = float(val)
else:
val=str(val)
val=''.join(str(val).s
plit(','))
result=float(val)
return result
1 2 . Use th e apply m eth od to

apply th is fu nction to th e
Value colu m n data:
df2['Enrollments
(Thousands)']=df2['Enr
ollments
(Thousands)'].apply(to
_numeric)
1 3 . Pr int th e u niqu e ty pes of
data in th e Data colu m n:
df2['Data'].unique()
Figure 9.7:Unique values in a column
1 4 . Cr eate th r ee DataFr am es by
filter ing and selecting th em
fr om th e or iginal
DataFr am e:
1 . df_primary :
Only stu dents
enr olled in
pr im ar y
edu cation
(th ou sands)
2 . df_secondary :
Only stu dents
enr olled in
secondar y
edu cation
(th ou sands)
3 . df_t ert iary :

Only stu dents
enr olled in
ter tiar y
edu cation
(th ou sands):
df_primary =
df2[df2['Data
']=='Students
enrolled in
primary
education
(thousands)']
df_secondary
=
df2[df2['Data
']=='Students
enrolled in
secondary
education
(thousands)']
df_tertiary =
df2[df2['Data
']=='Students
enrolled in
tertiary
education
(thousands)']
1 5. Com par e th em u sing bar

ch ar ts of th e pr im ar y
stu dents' enr ollm ent of a low -
incom e cou ntr y and a h igh -
incom e cou ntr y :
primary_enrollment_ind
ia =
df_primary[df_primary[
'Region/Country/Area']
=='India']
primary_enrollment_USA
=
df_primary[df_primary[
'Region/Country/Area']
=='United States of
America']
1 6 . Pr int th e
data:
ia
Figure 9.8: Data for the enrollment in
primary education in India
1 7 . Pr int th e
data:

primary education in USA
1 8. Plot th e data for India:
plt.figure(figsize=
(8,4))
plt.bar(primary_enroll
ment_india['Year'],pri
mary_enrollment_india[
'Enrollments
(Thousands)'])
plt.title("Enrollment
in primary
education\nin India
(in
thousands)",fontsize=1
6)
plt.grid(True)
)
)
plt.xlabel("Year",
fontsize=15)
plt.show()
Figure 9.10: Bar plot for the enrollment in

1 9 . Plot th e data for th e USA :
plt.figure(figsize=
(8,4))
ment_USA['Year'],prima
ry_enrollment_USA['Enr
ollments
(Thousands)'])
in primary
education\nin the
United States of
America (in
6)
plt.grid(True)
)
)
plt.xlabel("Year",
fontsize=15)
plt.show()

primary education in the USA
Data im pu tation: Clear ly ,

w e ar e m issing som e data.
Let's say w e decide to im pu te
th ese data points by sim ple
linear inter polation betw een
th e av ailable data points. We
can take ou t a pen and paper
or a calcu lator and com pu te
th ose v alu es and m anu ally
cr eate a dataset som eh ow .
Bu t being a data w r angler ,
w e w ill of cou r se take
adv antage of Py th on
pr ogr am m ing, and u se
pandas im pu tation m eth ods
for th is task. Bu t to do th at,
w e fir st need to cr eate a
v alu es inser ted – th at is, w e
need to append anoth er
v alu es to th e cu r r ent
DataFr am e.
(For India) Append t he

rows corresponding t o
missing t he y ears – 2004
- 2009, 2011 – 2013.
2 0. Find th e m issing y ear s:
missing_years = [y for
y in
range(2004,2010)]+[y
for y in
range(2011,2014)]
2 1 . Pr int th e v alu e in th e
missing_years variable:
missing_years
[2004, 2005, 2006,

2007, 2008, 2009,
2011, 2012, 2013]
2 2 . Cr eate a dictionar y of v alu es

w ith np.nan. Note th at th er e
ar e 9 m issing data points, so
w e need to cr eate a list w ith
identical v alu es r epeated 9
tim es:
dict_missing =
{'Region/Country/Area'
:
['India']*9,'Year':mis
sing_years,
'Data':'Students
enrolled in primary
education
(thousands)'*9,
'Enrollments
(Thousands)':
[np.nan]*9,'Footnotes'
:[np.nan]*9}
2 3 . Cr eate a DataFr am e of
m issing v alu es (fr om th e
pr eceding dictionar y ) th at
w e can append:
df_missing =
pd.DataFrame(data=dict
_missing)
2 4 . A ppend th e new DataFr am es

to pr ev iou sly existing ones:
ia=primary_enrollment_
india.append(df_missin
g,ignore_index=True,so
rt=True)
2 5. Pr int th e data in
:
ia

primary education in India a er
appending the data
2 6 . Sor t by year and r eset th e

indices u sing reset_index.
Use inplace=True to execu te
th e ch anges on th e
DataFr am e itself:
ia.sort_values(by='Yea
r',inplace=True)
ia.reset_index(inplace
=True,drop=True)
2 7 . Pr int th e data in
:
ia

primary education in India a er sorting
the data
2 8. Use th e interpolate m eth od

for linear inter polation. It
fills all th e NaN by linear ly
inter polated v alu es. Ch eck
ou t th is link for m or e details
abou t th is m eth od:
h ttp://pandas.py data.or g/p
andas-
docs/v er sion/0.1 7 /gener ate
d/pandas.DataFr am e.inter p
olate.h tm l:
ia.interpolate(inplace
=True)
:
ia

primary education in India a er
interpolating the data
3 0. Plot th e data:
plt.figure(figsize=
(8,4))
ment_india['Year'],pri
mary_enrollment_india[
'Enrollments
(Thousands)'])
in primary
education\nin India
(in
6)
plt.grid(True)
)
)
plt.xlabel("Year",
fontsize=15)
plt.show()

3 1 . Repeat th e sam e steps for th e

USA :
missing_years =
[2004]+[y for y in
range(2006,2010)]+[y
for y in
range(2011,2014)]+
[2016]
3 2 . Pr int th e v alu e in
missing_years.
missing_years
[2004, 2006, 2007,

2008, 2009, 2011,
2012, 2013, 2016]
3 3 . Cr eate dict_missing, as
follow s:
dict_missing =
{'Region/Country/Area'
:['United States of
America']*9,'Year':mis
sing_years,
'Data':'Students
enrolled in primary
education
(thousands)'*9,
'Value':
[np.nan]*9,'Footnotes'
:[np.nan]*9}
3 4 . Cr eate th e DataFr am e fpr

df_missing, as follow s:
df_missing =
pd.DataFrame(data=dict
_missing)
3 5. A ppend th is to th e
v ar iable, as follow s:
=primary_enrollment_US
A.append(df_missing,ig
nore_index=True,sort=T
rue)
3 6 . Sor t th e v alu es in th e
.sort_values(by='Year'
,inplace=True)
3 7 . Reset th e index of th e
.reset_index(inplace=T
rue,drop=True)
3 8. Inter polate th e
.interpolate(inplace=T
rue)
3 9 . Pr int th e
v ar iable:

primary education in USA a er all
operations have been completed
4 0. Still, th e fir st v alu e is

u nfilled. We can u se th e
limit and limit_direction
par am eter s w ith th e
inter polate m eth od to fill
th at. How did w e know th is?
By sear ch ing on Google and
looking at th is
StackOv er flow page. A lw ay s
sear ch for th e solu tion to
y ou r pr oblem and look for
w h at h as alr eady been done
and tr y to im plem ent it:
.interpolate(method='l
inear',limit_direction
='backward',limit=1)

primary education in the USA a er
limiting the data
pr im ar y _enr ollm ent_USA :
primary education in USA
4 2 . Plot th e data:
plt.figure(figsize=
(8,4))
ment_USA['Year'],prima
ry_enrollment_USA['Enr
ollments
(Thousands)'])
in primary
education\nin the
United States of
America (in
6)
plt.grid(True)
)
)
plt.xlabel("Year",
fontsize=15)
plt.show()
Figure 9.19: Bar plot for the enrollment in primary

education in the USA
ACTIVITY 13: DATA

WRANGLING TASK –
CLEANING GDP DATA
1 . GDP data for India: We w ill

tr y to r ead th e GDP data for
India fr om a CSV file th at
w as fou nd in a Wor ld Bank
por tal. It is giv en to y ou and
also h osted on th e Packt
GitHu b r epositor y . Bu t th e
Pandas read_csv m eth od
w ill th r ow an er r or in w e tr y
to r ead it nor m ally . Let's look
at a step-by -step gu ide on
h ow w e can r ead u sefu l
infor m ation fr om it:
df3=pd.read_csv("India
_World_Bank_Info.csv")
----------------------
----------------------
----------------------
---------
ParserError Traceback
(most recent call
last)
<ipython-input-45-
9239cae67df7> in
<module>()
…..
ParserError: Error
tokenizing data. C
error: Expected 1
fields in line 6, saw
3
We can tr y and u se th e
error_bad_lines=False
option in th is kind of
situ ation.
2 . Read th e India Wor ld Bank

Infor m ation .csv file:
_World_Bank_Info.csv",
error_bad_lines=False)
df3.head(10)
Figure 9.20: DataFrame from the India
World Bank Information
Note:
At times, the output may not

found because there are three
row s instead of the expected
one row .
3 . Clear ly , th e delim iter in th is

file is tab (\t):
error_bad_lines=False,
delimiter='\t')
df3.head(10)
World Bank Information a er using a
delimiter
4 . Use th e skiprows par am eter

to skip th e fir st 4 r ow s:
error_bad_lines=False,
delimiter='\t',skiprow
s=4)
df3.head(10)

World Bank Information a er using
skiprows
5. Closely exam ine th e dataset:
In th is file, th e colu m ns ar e
th e y ear ly data and r ow s ar e
th e v ar iou s ty pes of
infor m ation. Upon
exam ining th e file w ith
Excel, w e find th at th e
colu m n Indicator Name is
th e one w ith th e nam e of th e
par ticu lar data ty pe. We
filter th e dataset w ith th e
infor m ation w e ar e
inter ested in and also
tr anspose (th e r ow s and
colu m ns ar e inter ch anged) it
to m ake it a sim ilar for m at
as ou r pr ev iou s edu cation
dataset:
df4=df3[df3['Indicator
Name']=='GDP per
capita (current
US$)'].T
df4.head(10)
Figure 9.23: DataFrame focusing on GDP

per capita
6 . Th er e is no index, so let's u se
reset_index again:
df4.reset_index(inplac
e=True)
df4.head(10)

World Bank Information using reset_index
7 . Th e fir st 3 r ow s ar en't u sefu l.

We can r edefine th e
DataFr am e w ith ou t th em .
Th en, w e r e-index again:
df4.drop([0,1,2],inpla
ce=True)
df4.reset_index(inplac
e=True,drop=True)
df4.head(10)

World Bank Information a er dropping
and resetting the index
8. Let's r enam e th e colu m ns
pr oper ly (th is is necessar y
for m er ging, w h ich w e w ill
look at sh or tly ):
df4.columns=
['Year','GDP']
df4.head(10)
Figure 9.26: DataFrame focusing on Year

and GDP
9 . It looks like th at w e h av e
GDP data fr om 1 9 6 0
onw ar d. Bu t w e ar e
inter ested in 2 003 - 2 01 6 .
Let's exam ine th e last 2 0
r ow s:
df4.tail(20)
1 0. So, w e sh ou ld be good w ith

r ow s 4 3 -56 . Let's cr eate a
DataFr am e called df_gdp:
df_gdp=df4.iloc[[i for
i in range(43,57)]]
df_gdp
1 1 . We need to r eset th e index

again (for m er ging):
df_gdp.reset_index(inp
lace=True,drop=True)
df_gdp
1 2 . Th e y ear in th is DataFr am e
is not of th e int ty pe. So, it
w ill h av e pr oblem s m er ging
w ith th e edu cation
DataFr am e:
df_gdp['Year']
Figure 9.30: DataFrame focusing on year
1 3 . Use th e apply m eth od w ith

Py th on's bu ilt-in int
fu nction. Ignor e any
w ar nings th at ar e th r ow n:
df_gdp['Year']=df_gdp[
'Year'].apply(int)

DATA WRANGLING TASK –
MERGING UN DATA AND GDP
DATA
1 . Now , m er ge th e tw o
DataFr am es, th at is,
and df_gdp, on th e Year
colu m n:
primary_enrollment_wit
h_gdp=primary_enrollme
nt_india.merge(df_gdp,
on='Year')
h_gdp
Figure 9.31: Merged data

2 . Now , w e can dr op th e Data,
Footnotes, and
Region/Country/Area
colu m ns:
h_gdp.drop(['Data','Fo
otnotes','Region/Count
ry/Area'],axis=1,inpla
ce=True)
h_gdp
Figure 9.32: Merged data a er dropping

the Data, Footnotes, and
Region/Country/Area columns
3 . Rear r ange th e colu m ns for

pr oper v iew ing and
pr esentation to a data
scientist:
h_gdp =
h_gdp[['Year','Enrollm
ents
(Thousands)','GDP']]
h_gdp
Figure 9.33: Merged data a er

rearranging the columns
4 . Plot th e data:
plt.figure(figsize=
(8,5))
plt.title("India's GDP
per capita vs primary
education
enrollment",fontsize=1
6)
plt.scatter(primary_en
rollment_with_gdp['GDP
'],
h_gdp['Enrollments
(Thousands)'],
edgecolor='k',color='o
range',s=200)
plt.xlabel("GDP per
capita (US
$)",fontsize=15)
plt.ylabel("Primary
enrollment
(thousands)",fontsize=
15)
)
)
plt.grid(True)
plt.show()
Figure 9.34: Scatter plot of merged data
ACTIVITY 15: DATA

WRANGLING TASK –
CONNECTING THE NEW DATA
TO A DATABASE
1 . Connect to a database and

w r iting v alu es it. We star t
by im por ting th e sqlite3
m odu le of Py th on and th en
u se th e connect fu nction to
connect to a database.
Designate Year as th e
PRIMARY KEY of th is table:
import sqlite3
with
sqlite3.connect("Educa
tion_GDP.db") as conn:
cursor.execute("CREATE
TABLE IF NOT EXISTS \
education_gdp(Year
INT, Enrollment FLOAT,
GDP FLOAT, PRIMARY KEY
(Year))")
2 . Ru n a loop w ith th e dataset

r ow s one by one to inser t
th em in th e table:
with
sqlite3.connect("Educa
tion_GDP.db") as conn:
for i in range(14):
year =
int(primary_enrollment
_with_gdp.iloc[i]
['Year'])
enrollment =
h_gdp.iloc[i]
['Enrollments
(Thousands)']
gdp =
h_gdp.iloc[i]['GDP']
#print(year,enrollment
,gdp)
INTO education_gdp
(Year,Enrollment,GDP)
VALUES(?,?,?)",
(year,enrollment,gdp))
If w e look at th e cu r r ent
folder , w e sh ou ld see a file
called Education_GDP.db,
and if w e can exam ine th at
u sing a database v iew er
pr ogr am , w e can see th e data
tr ansfer r ed th er e.
In t h ese ac t i v i t i es, w e h av e ex ami ned a c omp l et e

dat a w r angl i ng f l ow , i nc l u di ng r eadi ng dat a
f r om t h e w eb and a l oc al dr i v e, f i l t er i ng,
c l eani ng, qu i c k v i su al i zat i on, i mp u t at i on,
i ndex i ng, mer gi ng, and w r i t i ng b ac k t o a
dat ab ase t ab l e. W e al so w r ot e c u st om f u nc t i ons
t o t r ansf or m some of t h e dat a and saw h ow t o
h andl e si t u at i ons w h er e w e may get er r or s u p on
r eadi ng t h e f i l e.

Sarkar, DR Tirthajyoti - Roychowdhury, Shubhadeep - Data Wrangling With Python - Creating Actionable Data From Raw Sources-Packt Publishing (2019)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Sarkar, DR Tirthajyoti - Roychowdhury, Shubhadeep - Data Wrangling With Python - Creating Actionable Data From Raw Sources-Packt Publishing (2019)

Uploaded by

Copyright:

Available Formats

Contents

1. About the Book

1. About the Authors

2. Python for Data Wrangling

3. Basic File Operations in Python

1. Exercise 22: File Operations

1. NumPy Array and Features

1. Exercise 37: Creating a

4. Statistics and Visualization with NumPy

1. Refresher of Basic Descriptive

1. Exercise 48: Loading and

3. Detecting Outliers and Handling Missing

1. Missing Values in Pandas

4. Concatenating, Merging, and Joining

1. Exercise 54: Concatenation

5. Useful Methods of Pandas

1. Exercise 57: Randomized

1. Data Files Provided with This

3. Introduction to Beautiful Soup 4 and

1. Additional Software Required

2. Advanced List Comprehension and the

4. Identify and Clean Outliers

1. Exercise 79: Outliers in

5. Activity 8: Handling Outliers and

3. Reading Data from XML

1. Exercise 87: Creating an XML

4. Reading Data from an API

1. Defining the Base URL (or

5. Fundamentals of Regular Expressions

1. Regex in the Context of Web

1. Exercise 107: Connecting to

1. Additional Skills Required to

1. Solution of Activity 1: Handling Lists

1. Solution of Activity 2: Analyze

A l l r i gh t s r eser v ed. N o p ar t of t h i s b ook may b e

A u t h or s: Dr . Ti r t h ajy ot i Sar k ar and Sh u b h adeep

Managi ng Edi t or : St ef f i Mont ei r o

A c qu i si t i ons Edi t or : Ku nal Saw ant

Pr odu c t i on Edi t or : N i t esh Th ak u r

Edi t or i al Boar d: Dav i d Bar nes, Ew an

Fi r st Pu b l i sh ed: Feb r u ar y 201 9

Pr odu c t i on Ref er enc e: 1 28021 9

ISBN : 97 8-1 -7 8980-01 1 -1

PYTHON FOR DATA

LISTS, SETS, STRINGS,

EXERCISE 1: ACCESSING THE

EXERCISE 3: ITERATING OVER

EXERCISE 4: SORTING A LIST

ACTIVITY 1: HANDLING LISTS

UNION AND INTERSECTION

EXERCISE 6: ACCESSING AND

EXERCISE 7: ITERATING OVER

EXERCISE 8: REVISITING THE

EXERCISE 9: DELETING VALUE

EXERCISE 10: DICTIONARY

CREATING A TUPLE WITH

EXERCISE 11: HANDLING

EXERCISE 12: ACCESSING

EXERCISE 13: STRING SLICES

EXERCISE 14: SPLIT AND JOIN

EXERCISE 15: INTRODUCTION

EXERCISE 16: IMPLEMENTING

EXERCISE 17: IMPLEMENTING