You are on page 1of 538

Contents

1. Preface

1. About the Book

1. About the Authors


2. Learning Objectives
3. Approach
4. Audience
5. Minimum Hardware
Requirements
6. Software Requirements
7. Conventions
8. Installation and Setup
9. Installing the Code Bundle
10. Additional Resources

2. Chapter 1
3. Introduction to Data Wrangling with Python

1. Introduction

1. Importance of Data
Wrangling

2. Python for Data Wrangling


3. Lists, Sets, Strings, Tuples, and
Dictionaries

1. Lists
2. Exercise 1: Accessing the List
Members
3. Exercise 2: Generating a List
4. Exercise 3: Iterating over a
List and Checking
Membership
5. Exercise 4: Sorting a List
6. Exercise 5: Generating a
Random List
7. Activity 1: Handling Lists
8. Sets
9. Introduction to Sets
10. Union and Intersection of
Sets
11. Creating Null Sets
12. Dictionary
13. Exercise 6: Accessing and
Setting Values in a Dictionary
14. Exercise 7: Iterating Over a
Dictionary
15. Exercise 8: Revisiting the
Unique Valued List Problem
16. Exercise 9: Deleting Value
from Dict
17. Exercise 10: Dictionary
Comprehension
18. Tuples
19. Creating a Tuple with
Different Cardinalities
20. Unpacking a Tuple
21. Exercise 11: Handling Tuples
22. Strings
23. Exercise 12: Accessing Strings
24. Exercise 13: String Slices
25. String Functions
26. Exercise 14: Split and Join
27. Activity 2: Analyze a Multiline
String and Generate the
Unique Word Count

4. Summary

4. Chapter 2
5. Advanced Data Structures and File Handling

1. Introduction
2. Advanced Data Structures

1. Iterator
2. Exercise 15: Introduction to
the Iterator
3. Stacks
4. Exercise 16: Implementing a
Stack in Python
5. Exercise 17: Implementing a
Stack Using User-Defined
Methods
6. Exercise 18: Lambda
Expression
7. Exercise 19: Lambda
Expression for Sorting
8. Exercise 20: Multi-Element
Membership Checking
9. Queue
10. Exercise 21: Implementing a
Queue in Python
11. Activity 3: Permutation,
Iterator, Lambda, List

3. Basic File Operations in Python

1. Exercise 22: File Operations


2. File Handling
3. Exercise 23: Opening and
Closing a File
4. The with Statement
5. Opening a File Using the with
Statement
6. Exercise 24: Reading a File
Line by Line
7. Exercise 25: Write to a File
8. Activity 4: Design Your Own
CSV Parser

4. Summary

6. Chapter 3
7. Introduction to NumPy, Pandas,and Matplotlib

1. Introduction
2. NumPy Arrays

1. NumPy Array and Features


2. Exercise 26: Creating a
NumPy Array (from a List)
3. Exercise 27: Adding Two
NumPy Arrays
4. Exercise 28: Mathematical
Operations on NumPy Arrays
5. Exercise 29: Advanced
Mathematical Operations on
NumPy Arrays
6. Exercise 30: Generating
Arrays Using arange and
linspace
7. Exercise 31: Creating Multi-
Dimensional Arrays
8. Exercise 32: The Dimension,
Shape, Size, and Data Type of
the Two-dimensional Array
9. Exercise 33: Zeros, Ones,
Random, Identity Matrices,
and Vectors
10. Exercise 34: Reshaping,
Ravel, Min, Max, and Sorting
11. Exercise 35: Indexing and
Slicing
12. Conditional Subsetting
13. Exercise 36: Array Operations
(array-array, array-scalar, and
universal functions)
14. Stacking Arrays

3. Pandas DataFrames

1. Exercise 37: Creating a


Pandas Series
2. Exercise 38: Pandas Series
and Data Handling
3. Exercise 39: Creating Pandas
DataFrames
4. Exercise 40: Viewing a
DataFrame Partially
5. Indexing and Slicing Columns
6. Indexing and Slicing Rows
7. Exercise 41: Creating and
Deleting a New Column or
Row

4. Statistics and Visualization with NumPy


and Pandas

1. Refresher of Basic Descriptive


Statistics (and the Matplotlib
Library for Visualization)
2. Exercise 42: Introduction to
Matplotlib Through a Scatter
Plot
3. Definition of Statistical
Measures – Central Tendency
and Spread
4. Random Variables and
Probability Distribution
5. What Is a Probability
Distribution?
6. Discrete Distributions
7. Continuous Distributions
8. Data Wrangling in Statistics
and Visualization
9. Using NumPy and Pandas to
Calculate Basic Descriptive
Statistics on the DataFrame
10. Random Number Generation
Using NumPy
11. Exercise 43: Generating
Random Numbers from a
Uniform Distribution
12. Exercise 44: Generating
Random Numbers from a
Binomial Distribution and Bar
Plot
13. Exercise 45: Generating
Random Numbers from
Normal Distribution and
Histograms
14. Exercise 46: Calculation of
Descriptive Statistics from a
DataFrame
15. Exercise 47: Built-in Plotting
Utilities
16. Activity 5: Generating
Statistics from a CSV File
5. Summary

8. Chapter 4
9. A Deep Dive into Data Wrangling with Python

1. Introduction
2. Subsetting, Filtering, and Grouping

1. Exercise 48: Loading and


Examining a Superstore's
Sales Data from an Excel File
2. Subsetting the DataFrame
3. An Example Use Case:
Determining Statistics on
Sales and Profit
4. Exercise 49: The unique
Function
5. Conditional Selection and
Boolean Filtering
6. Exercise 50: Setting and
Resetting the Index
7. Exercise 51: The GroupBy
Method

3. Detecting Outliers and Handling Missing


Values

1. Missing Values in Pandas


2. Exercise 52: Filling in the
Missing Values with fillna
3. Exercise 53: Dropping Missing
Values with dropna
4. Outlier Detection Using a
Simple Statistical Test

4. Concatenating, Merging, and Joining

1. Exercise 54: Concatenation


2. Exercise 55: Merging by a
Common Key
3. Exercise 56: The join Method

5. Useful Methods of Pandas

1. Exercise 57: Randomized


Sampling
2. The value_counts Method
3. Pivot Table Functionality
4. Exercise 58: Sorting by
Column Values – the
sort_values Method
5. Exercise 59: Flexibility for
User-Defined Functions with
the apply Method
6. Activity 6: Working with the
Adult Income Dataset (UCI)

6. Summary

10. Chapter 5
11. Getting Comfortable with Different Kinds of Data
Sources

1. Introduction
2. Reading Data from Different Text-Based
(and Non-Text-Based) Sources

1. Data Files Provided with This


Chapter
2. Libraries to Install for This
Chapter
3. Exercise 60: Reading Data
from a CSV File Where
Headers Are Missing
4. Exercise 61: Reading from a
CSV File where Delimiters are
not Commas
5. Exercise 62: Bypassing the
Headers of a CSV File
6. Exercise 63: Skipping Initial
Rows and Footers when
Reading a CSV File
7. Reading Only the First N
Rows (Especially Useful for
Large Files)
8. Exercise 64: Combining
Skiprows and Nrows to Read
Data in Small Chunks
9. Setting the skip_blank_lines
Option
10. Read CSV from a Zip file
11. Reading from an Excel File
Using sheet_name and
Handling a Distinct
sheet_name
12. Exercise 65: Reading a
General Delimited Text File
13. Reading HTML Tables
Directly from a URL
14. Exercise 66: Further
Wrangling to Get the Desired
Data
15. Exercise 67: Reading from a
JSON File
16. Reading a Stata File
17. Exercise 68: Reading Tabular
Data from a PDF File

3. Introduction to Beautiful Soup 4 and


Web Page Parsing

1. Structure of HTML
2. Exercise 69: Reading an
HTML file and Extracting its
Contents Using
BeautifulSoup
3. Exercise 70: DataFrames and
BeautifulSoup
4. Exercise 71: Exporting a
DataFrame as an Excel File
5. Exercise 72: Stacking URLs
from a Document using bs4
6. Activity 7: Reading Tabular
Data from a Web Page and
Creating DataFrames

4. Summary

12. Chapter 6
13. Learning the Hidden Secrets of Data Wrangling

1. Introduction

1. Additional Software Required


for This Section

2. Advanced List Comprehension and the


zip Function

1. Introduction to Generator
Expressions
2. Exercise 73: Generator
Expressions
3. Exercise 74: One-Liner
Generator Expression
4. Exercise 75: Extracting a List
with Single Words
5. Exercise 76: The zip
Function
6. Exercise 77: Handling Messy
Data

3. Data Formatting

1. The % operator
2. Using the format Function
3. Exercise 78: Data
Representation Using {}

4. Identify and Clean Outliers

1. Exercise 79: Outliers in


Numerical Data
2. Z-score
3. Exercise 80: The Z-Score
Value to Remove Outliers
4. Exercise 81: Fuzzy Matching
of Strings

5. Activity 8: Handling Outliers and


Missing Data
6. Summary

14. Chapter 7
15. Advanced Web Scraping and Data Gathering

1. Introduction
2. The Basics of Web Scraping and the
Beautiful Soup Library

1. Libraries in Python
2. Exercise 81: Using the
Requests Library to Get a
Response from the Wikipedia
Home Page
3. Exercise 82: Checking the
Status of the Web Request
4. Checking the Encoding of the
Web Page
5. Exercise 83: Creating a
Function to Decode the
Contents of the Response
and Check its Length
6. Exercise 84: Extracting
Human-Readable Text From
a BeautifulSoup Object
7. Extracting Text from a
Section
8. Extracting Important
Historical Events that
Happened on Today's Date
9. Exercise 85: Using Advanced
BS4 Techniques to Extract
Relevant Text
10. Exercise 86: Creating a
Compact Function to Extract
the "On this Day" Text from
the Wikipedia Home Page

3. Reading Data from XML

1. Exercise 87: Creating an XML


File and Reading XML
Element Objects
2. Exercise 88: Finding Various
Elements of Data within a
Tree (Element)
3. Reading from a Local XML
File into an ElementTree
Object
4. Exercise 89: Traversing the
Tree, Finding the Root, and
Exploring all Child Nodes and
their Tags and Attributes
5. Exercise 90: Using the text
Method to Extract
Meaningful Data
6. Extracting and Printing the
GDP/Per Capita Information
Using a Loop
7. Exercise 91: Finding All the
Neighboring Countries for
each Country and Printing
Them
8. Exercise 92: A Simple Demo
of Using XML Data Obtained
by Web Scraping

4. Reading Data from an API

1. Defining the Base URL (or


API Endpoint)
2. Exercise 93: Defining and
Testing a Function to Pull
Country Data from an API
3. Using the Built-In JSON
Library to Read and Examine
Data
4. Printing All the Data
Elements
5. Using a Function that
Extracts a DataFrame
Containing Key Information
6. Exercise 94: Testing the
Function by Building a Small
Database of Countries'
Information

5. Fundamentals of Regular Expressions


(RegEx)

1. Regex in the Context of Web


Scraping
2. Exercise 95: Using the match
Method to Check Whether a
Pattern matches a
String/Sequence
3. Using the Compile Method to
Create a Regex Program
4. Exercise 96: Compiling
Programs to Match Objects
5. Exercise 97: Using Additional
Parameters in Match to
Check for Positional Matching
6. Finding the Number of
Words in a List That End
with "ing"
7. Exercise 98: The search
Method in Regex
8. Exercise 99: Using the span
Method of the Match Object
to Locate the Position of the
Matched Pattern
9. Exercise 100: Examples of
Single Character Pattern
Matching with search
10. Exercise 101: Examples of
Pattern Matching at the Start
or End of a String
11. Exercise 102: Examples of
Pattern Matching with
Multiple Characters
12. Exercise 103: Greedy versus
Non-Greedy Matching
13. Exercise 104: Controlling
Repetitions to Match
14. Exercise 105: Sets of
Matching Characters
15. Exercise 106: The use of OR
in Regex using the OR
Operator
16. The findall Method
17. Activity 9: Extracting the Top
100 eBooks from Gutenberg
18. Activity 10: Building Your
Own Movie Database by
Reading an API

6. Summary

16. Chapter 8
17. RDBMS and SQL

1. Introduction
2. Refresher of RDBMS and SQL

1. How is an RDBMS
Structured?
2. SQL

3. Using an RDBMS
(MySQL/PostgreSQL/SQLite)

1. Exercise 107: Connecting to


Database in SQLite
2. Exercise 108: DDL and DML
Commands in SQLite
3. Reading Data from a
Database in SQLite
4. Exercise 109: Sorting Values
that are Present in the
Database
5. Exercise 110: Altering the
Structure of a Table and
Updating the New Fields
6. Exercise 111: Grouping Values
in Tables
7. Relation Mapping in
Databases
8. Adding Rows in the
comments Table
9. Joins
10. Retrieving Specific Columns
from a JOIN query
11. Exercise 112: Deleting Rows
12. Updating Specific Values in a
Table
13. Exercise 113: RDBMS and
DataFrames
14. Activity 11: Retrieving Data
Correctly From Databases

4. Summary

18. Chapter 9
19. Application of Data Wrangling in Real Life
1. Introduction
2. Applying Your Knowledge to a Real-life
Data Wrangling Task
3. Activity 12: Data Wrangling Task –
Fixing UN Data
4. Activity 13: Data Wrangling Task –
Cleaning GDP Data
5. Activity 14: Data Wrangling Task –
Merging UN Data and GDP Data
6. Activity 15: Data Wrangling Task –
Connecting the New Data to the
Database
7. An Extension to Data Wrangling

1. Additional Skills Required to


Become a Data Scientist
2. Basic Familiarity with Big
Data and Cloud Technologies
3. What Goes with Data
Wrangling?
4. Tips and Tricks for Mastering
Machine Learning

8. Summary

20. Appendix

1. Solution of Activity 1: Handling Lists

1. Solution of Activity 2: Analyze


a Multiline String and
Generate the Unique Word
Count
2. Solution of Activity 3:
Permutation, Iterator,
Lambda, List
3. Solution of Activity 4: Design
Your Own CSV Parser
4. Solution of Activity 5:
Generating Statistics from a
CSV File
5. Solution of Activity 6:
Working with the Adult
Income Dataset (UCI)
6. Solution of Activity 7:
Reading Tabular Data from a
Web Page and Creating
DataFrames
7. Solution of Activity 8:
Handling Outliers and
Missing Data
8. Solution of Activity 9:
Extracting the Top 100
eBooks from Gutenberg
9. Solution of Activity 10:
Extracting the top 100 eBooks
from Gutenberg.org
10. Solution of Activity 11:
Retrieving Data Correctly
from Databases
11. Solution of Activity 12: Data
Wrangling Task – Fixing UN
Data
12. Activity 13: Data Wrangling
Task – Cleaning GDP Data
13. Solution of Activity 14: Data
Wrangling Task – Merging
UN Data and GDP Data
14. Activity 15: Data Wrangling
Task – Connecting the New
Data to a Database
Landmarks
1. Cover
2. Table of Contents
DATA WRANGLING WITH
PYTHON
Cop y r i gh t © 201 9 Pac k t Pu b l i sh i ng

A l l r i gh t s r eser v ed. N o p ar t of t h i s b ook may b e


r ep r odu c ed, st or ed i n a r et r i ev al sy st em, or
t r ansmi t t ed i n any f or m or b y any means,
w i t h ou t t h e p r i or w r i t t en p er mi ssi on of t h e
p u b l i sh er , ex c ep t i n t h e c ase of b r i ef qu ot at i ons
emb edded i n c r i t i c al ar t i c l es or r ev i ew s.

Ev er y ef f or t h as b een made i n t h e p r ep ar at i on of
t h i s b ook t o ensu r e t h e ac c u r ac y of t h e
i nf or mat i on p r esent ed. H ow ev er , t h e
i nf or mat i on c ont ai ned i n t h i s b ook i s sol d
w i t h ou t w ar r ant y , ei t h er ex p r ess or i mp l i ed.
N ei t h er t h e au t h or , nor Pac k t Pu b l i sh i ng, and i t s
deal er s and di st r i b u t or s w i l l b e h el d l i ab l e f or
any damages c au sed or al l eged t o b e c au sed
di r ec t l y or i ndi r ec t l y b y t h i s b ook .

Pac k t Pu b l i sh i ng h as endeav or ed t o p r ov i de
t r ademar k i nf or mat i on ab ou t al l of t h e
c omp ani es and p r odu c t s ment i oned i n t h i s b ook
b y t h e ap p r op r i at e u se of c ap i t al s. H ow ev er ,
Pac k t Pu b l i sh i ng c annot gu ar ant ee t h e ac c u r ac y
of t h i s i nf or mat i on.

A u t h or s: Dr . Ti r t h ajy ot i Sar k ar and Sh u b h adeep


Roy c h ow dh u r y

Managi ng Edi t or : St ef f i Mont ei r o

A c qu i si t i ons Edi t or : Ku nal Saw ant

Pr odu c t i on Edi t or : N i t esh Th ak u r

Edi t or i al Boar d: Dav i d Bar nes, Ew an


Bu c k i ngh am, Sh i v angi Ch at t er ji , Si mon Cox ,
Manasa Ku mar , A l ex Mazonow i c z, Dou gl as
Pat er son, Domi ni c Per ei r a, Sh i ny Poojar y ,
Saman Si ddi qu i , Er ol St av el ey , A nk i t a Th ak u r ,
and Moh i t a V y as.

Fi r st Pu b l i sh ed: Feb r u ar y 201 9

Pr odu c t i on Ref er enc e: 1 28021 9

ISBN : 97 8-1 -7 8980-01 1 -1

Pu b l i sh ed b y Pac k t Pu b l i sh i ng Lt d.

Li v er y Pl ac e, 35 Li v er y St r eet

Bi r mi ngh am B3 2PB, U K

Table of Contents
Preface
Introduction to Data
Wrangling with Python
INTRODUCTION

IMPORTANCE OF DATA
WRANGLING

PYTHON FOR DATA


WRANGLING

LISTS, SETS, STRINGS,


TUPLES, AND DICTIONARIES

LISTS

EXERCISE 1: ACCESSING THE


LIST MEMBERS

EXERCISE 2: GENERATING A
LIST

EXERCISE 3: ITERATING OVER


A LIST AND CHECKING
MEMBERSHIP

EXERCISE 4: SORTING A LIST

EXERCISE 5: GENERATING A
RANDOM LIST

ACTIVITY 1: HANDLING LISTS

SETS

INTRODUCTION TO SETS

UNION AND INTERSECTION


OF SETS
CREATING NULL SETS

DICTIONARY

EXERCISE 6: ACCESSING AND


SETTING VALUES IN A
DICTIONARY

EXERCISE 7: ITERATING OVER


A DICTIONARY

EXERCISE 8: REVISITING THE


UNIQUE VALUED LIST
PROBLEM

EXERCISE 9: DELETING VALUE


FROM DICT

EXERCISE 10: DICTIONARY


COMPREHENSION

TUPLES

CREATING A TUPLE WITH


DIFFERENT CARDINALITIES

UNPACKING A TUPLE

EXERCISE 11: HANDLING


TUPLES

STRINGS

EXERCISE 12: ACCESSING


STRINGS

EXERCISE 13: STRING SLICES

STRING FUNCTIONS

EXERCISE 14: SPLIT AND JOIN


ACTIVITY 2: ANALYZE A
MULTILINE STRING AND
GENERATE THE UNIQUE
WORD COUNT

SUMMARY

Advanced Data
Structures and File
Handling
INTRODUCTION

ADVANCED DATA
STRUCTURES

ITERATOR

EXERCISE 15: INTRODUCTION


TO THE ITERATOR

STACKS

EXERCISE 16: IMPLEMENTING


A STACK IN PYTHON

EXERCISE 17: IMPLEMENTING


A STACK USING USER-
DEFINED METHODS

EXERCISE 18: LAMBDA


EXPRESSION

EXERCISE 19: LAMBDA


EXPRESSION FOR SORTING

EXERCISE 20: MULTI-ELEMENT


MEMBERSHIP CHECKING
QUEUE

EXERCISE 21: IMPLEMENTING


A QUEUE IN PYTHON

ACTIVITY 3: PERMUTATION,
ITERATOR, LAMBDA, LIST

BASIC FILE OPERATIONS IN


PYTHON

EXERCISE 22: FILE


OPERATIONS

FILE HANDLING

EXERCISE 23: OPENING AND


CLOSING A FILE

THE WITH STATEMENT

OPENING A FILE USING THE


WITH STATEMENT

EXERCISE 24: READING A FILE


LINE BY LINE

EXERCISE 25: WRITE TO A FILE

ACTIVITY 4: DESIGN YOUR


OWN CSV PARSER

SUMMARY

Introduction to NumPy,
Pandas,and Matplotlib
INTRODUCTION

NUMPY ARRAYS
NUMPY ARRAY AND
FEATURES

EXERCISE 26: CREATING A


NUMPY ARRAY (FROM A LIST)

EXERCISE 27: ADDING TWO


NUMPY ARRAYS

EXERCISE 28: MATHEMATICAL


OPERATIONS ON NUMPY
ARRAYS

EXERCISE 29: ADVANCED


MATHEMATICAL OPERATIONS
ON NUMPY ARRAYS

EXERCISE 30: GENERATING


ARRAYS USING ARANGE AND
LINSPACE

EXERCISE 31: CREATING


MULTI-DIMENSIONAL ARRAYS

EXERCISE 32: THE


DIMENSION, SHAPE, SIZE,
AND DATA TYPE OF THE TWO-
DIMENSIONAL ARRAY

EXERCISE 33: ZEROS, ONES,


RANDOM, IDENTITY
MATRICES, AND VECTORS

EXERCISE 34: RESHAPING,


RAVEL, MIN, MAX, AND
SORTING
EXERCISE 35: INDEXING AND
SLICING

CONDITIONAL SUBSETTING

EXERCISE 36: ARRAY


OPERATIONS (ARRAY-ARRAY,
ARRAY-SCALAR, AND
UNIVERSAL FUNCTIONS)

STACKING ARRAYS

PANDAS DATAFRAMES

EXERCISE 37: CREATING A


PANDAS SERIES

EXERCISE 38: PANDAS SERIES


AND DATA HANDLING

EXERCISE 39: CREATING


PANDAS DATAFRAMES

EXERCISE 40: VIEWING A


DATAFRAME PARTIALLY

INDEXING AND SLICING


COLUMNS

INDEXING AND SLICING ROWS

EXERCISE 41: CREATING AND


DELETING A NEW COLUMN OR
ROW

STATISTICS AND
VISUALIZATION WITH NUMPY
AND PANDAS
REFRESHER OF BASIC
DESCRIPTIVE STATISTICS
(AND THE MATPLOTLIB
LIBRARY FOR
VISUALIZATION)

EXERCISE 42: INTRODUCTION


TO MATPLOTLIB THROUGH A
SCATTER PLOT

DEFINITION OF STATISTICAL
MEASURES – CENTRAL
TENDENCY AND SPREAD

RANDOM VARIABLES AND


PROBABILITY DISTRIBUTION

WHAT IS A PROBABILITY
DISTRIBUTION?

DISCRETE DISTRIBUTIONS

CONTINUOUS
DISTRIBUTIONS

DATA WRANGLING IN
STATISTICS AND
VISUALIZATION

USING NUMPY AND PANDAS


TO CALCULATE BASIC
DESCRIPTIVE STATISTICS ON
THE DATAFRAME

RANDOM NUMBER
GENERATION USING NUMPY
EXERCISE 43: GENERATING
RANDOM NUMBERS FROM A
UNIFORM DISTRIBUTION

EXERCISE 44: GENERATING


RANDOM NUMBERS FROM A
BINOMIAL DISTRIBUTION
AND BAR PLOT

EXERCISE 45: GENERATING


RANDOM NUMBERS FROM
NORMAL DISTRIBUTION AND
HISTOGRAMS

EXERCISE 46: CALCULATION


OF DESCRIPTIVE STATISTICS
FROM A DATAFRAME

EXERCISE 47: BUILT-IN


PLOTTING UTILITIES

ACTIVITY 5: GENERATING
STATISTICS FROM A CSV FILE

SUMMARY

A Deep Dive into Data


Wrangling with Python
INTRODUCTION

SUBSETTING, FILTERING, AND


GROUPING

EXERCISE 48: LOADING AND


EXAMINING A SUPERSTORE'S
SALES DATA FROM AN EXCEL
FILE
SUBSETTING THE
DATAFRAME

AN EXAMPLE USE CASE:


DETERMINING STATISTICS ON
SALES AND PROFIT

EXERCISE 49: THE UNIQUE


FUNCTION

CONDITIONAL SELECTION
AND BOOLEAN FILTERING

EXERCISE 50: SETTING AND


RESETTING THE INDEX

EXERCISE 51: THE GROUPBY


METHOD

DETECTING OUTLIERS AND


HANDLING MISSING VALUES

MISSING VALUES IN PANDAS

EXERCISE 52: FILLING IN THE


MISSING VALUES WITH
FILLNA

EXERCISE 53: DROPPING


MISSING VALUES WITH
DROPNA

OUTLIER DETECTION USING A


SIMPLE STATISTICAL TEST

CONCATENATING, MERGING,
AND JOINING
EXERCISE 54:
CONCATENATION

EXERCISE 55: MERGING BY A


COMMON KEY

EXERCISE 56: THE JOIN


METHOD

USEFUL METHODS OF
PANDAS

EXERCISE 57: RANDOMIZED


SAMPLING

THE VALUE_COUNTS METHOD

PIVOT TABLE FUNCTIONALITY

EXERCISE 58: SORTING BY


COLUMN VALUES – THE
SORT_VALUES METHOD

EXERCISE 59: FLEXIBILITY FOR


USER-DEFINED FUNCTIONS
WITH THE APPLY METHOD

ACTIVITY 6: WORKING WITH


THE ADULT INCOME DATASET
(UCI)

SUMMARY

Getting Comfortable
with Different Kinds of
Data Sources
INTRODUCTION
READING DATA FROM
DIFFERENT TEXT-BASED (AND
NON-TEXT-BASED) SOURCES

DATA FILES PROVIDED WITH


THIS CHAPTER

LIBRARIES TO INSTALL FOR


THIS CHAPTER

EXERCISE 60: READING DATA


FROM A CSV FILE WHERE
HEADERS ARE MISSING

EXERCISE 61: READING FROM


A CSV FILE WHERE
DELIMITERS ARE NOT
COMMAS

EXERCISE 62: BYPASSING THE


HEADERS OF A CSV FILE

EXERCISE 63: SKIPPING


INITIAL ROWS AND FOOTERS
WHEN READING A CSV FILE

READING ONLY THE FIRST N


ROWS (ESPECIALLY USEFUL
FOR LARGE FILES)

EXERCISE 64: COMBINING


SKIPROWS AND NROWS TO
READ DATA IN SMALL
CHUNKS

SETTING THE
SKIP_BLANK_LINES OPTION
READ CSV FROM A ZIP FILE

READING FROM AN EXCEL FILE


USING SHEET_NAME AND
HANDLING A DISTINCT
SHEET_NAME

EXERCISE 65: READING A


GENERAL DELIMITED TEXT
FILE

READING HTML TABLES


DIRECTLY FROM A URL

EXERCISE 66: FURTHER


WRANGLING TO GET THE
DESIRED DATA

EXERCISE 67: READING FROM


A JSON FILE

READING A STATA FILE

EXERCISE 68: READING


TABULAR DATA FROM A PDF
FILE

INTRODUCTION TO
BEAUTIFUL SOUP 4 AND WEB
PAGE PARSING

STRUCTURE OF HTML

EXERCISE 69: READING AN


HTML FILE AND EXTRACTING
ITS CONTENTS USING
BEAUTIFULSOUP
EXERCISE 70: DATAFRAMES
AND BEAUTIFULSOUP

EXERCISE 71: EXPORTING A


DATAFRAME AS AN EXCEL FILE

EXERCISE 72: STACKING URLS


FROM A DOCUMENT USING
BS4

ACTIVITY 7: READING
TABULAR DATA FROM A WEB
PAGE AND CREATING
DATAFRAMES

SUMMARY

Learning the Hidden


Secrets of Data
Wrangling
INTRODUCTION

ADDITIONAL SOFTWARE
REQUIRED FOR THIS SECTION

ADVANCED LIST
COMPREHENSION AND THE
ZIP FUNCTION

INTRODUCTION TO
GENERATOR EXPRESSIONS

EXERCISE 73: GENERATOR


EXPRESSIONS

EXERCISE 74: ONE-LINER


GENERATOR EXPRESSION
EXERCISE 75: EXTRACTING A
LIST WITH SINGLE WORDS

EXERCISE 76: THE ZIP


FUNCTION

EXERCISE 77: HANDLING


MESSY DATA

DATA FORMATTING

THE % OPERATOR

USING THE FORMAT


FUNCTION

EXERCISE 78: DATA


REPRESENTATION USING {}

IDENTIFY AND CLEAN


OUTLIERS

EXERCISE 79: OUTLIERS IN


NUMERICAL DATA

Z-SCORE

EXERCISE 80: THE Z-SCORE


VALUE TO REMOVE OUTLIERS

EXERCISE 81: FUZZY


MATCHING OF STRINGS

ACTIVITY 8: HANDLING
OUTLIERS AND MISSING DATA

SUMMARY

Advanced Web Scraping


and Data Gathering
INTRODUCTION

THE BASICS OF WEB


SCRAPING AND THE
BEAUTIFUL SOUP LIBRARY

LIBRARIES IN PYTHON

EXERCISE 81: USING THE


REQUESTS LIBRARY TO GET A
RESPONSE FROM THE
WIKIPEDIA HOME PAGE

EXERCISE 82: CHECKING THE


STATUS OF THE WEB
REQUEST

CHECKING THE ENCODING OF


THE WEB PAGE

EXERCISE 83: CREATING A


FUNCTION TO DECODE THE
CONTENTS OF THE
RESPONSE AND CHECK ITS
LENGTH

EXERCISE 84: EXTRACTING


HUMAN-READABLE TEXT
FROM A BEAUTIFULSOUP
OBJECT

EXTRACTING TEXT FROM A


SECTION

EXTRACTING IMPORTANT
HISTORICAL EVENTS THAT
HAPPENED ON TODAY'S DATE
EXERCISE 85: USING
ADVANCED BS4 TECHNIQUES
TO EXTRACT RELEVANT TEXT

EXERCISE 86: CREATING A


COMPACT FUNCTION TO
EXTRACT THE "ON THIS DAY"
TEXT FROM THE WIKIPEDIA
HOME PAGE

READING DATA FROM XML

EXERCISE 87: CREATING AN


XML FILE AND READING XML
ELEMENT OBJECTS

EXERCISE 88: FINDING


VARIOUS ELEMENTS OF DATA
WITHIN A TREE (ELEMENT)

READING FROM A LOCAL XML


FILE INTO AN ELEMENTTREE
OBJECT

EXERCISE 89: TRAVERSING


THE TREE, FINDING THE
ROOT, AND EXPLORING ALL
CHILD NODES AND THEIR
TAGS AND ATTRIBUTES

EXERCISE 90: USING THE


TEXT METHOD TO EXTRACT
MEANINGFUL DATA

EXTRACTING AND PRINTING


THE GDP/PER CAPITA
INFORMATION USING A LOOP
EXERCISE 91: FINDING ALL
THE NEIGHBORING
COUNTRIES FOR EACH
COUNTRY AND PRINTING
THEM

EXERCISE 92: A SIMPLE DEMO


OF USING XML DATA
OBTAINED BY WEB SCRAPING

READING DATA FROM AN API

DEFINING THE BASE URL (OR


API ENDPOINT)

EXERCISE 93: DEFINING AND


TESTING A FUNCTION TO
PULL COUNTRY DATA FROM
AN API

USING THE BUILT-IN JSON


LIBRARY TO READ AND
EXAMINE DATA

PRINTING ALL THE DATA


ELEMENTS

USING A FUNCTION THAT


EXTRACTS A DATAFRAME
CONTAINING KEY
INFORMATION

EXERCISE 94: TESTING THE


FUNCTION BY BUILDING A
SMALL DATABASE OF
COUNTRIES' INFORMATION
FUNDAMENTALS OF REGULAR
EXPRESSIONS (REGEX)

REGEX IN THE CONTEXT OF


WEB SCRAPING

EXERCISE 95: USING THE


MATCH METHOD TO CHECK
WHETHER A PATTERN
MATCHES A
STRING/SEQUENCE

USING THE COMPILE METHOD


TO CREATE A REGEX
PROGRAM

EXERCISE 96: COMPILING


PROGRAMS TO MATCH
OBJECTS

EXERCISE 97: USING


ADDITIONAL PARAMETERS IN
MATCH TO CHECK FOR
POSITIONAL MATCHING

FINDING THE NUMBER OF


WORDS IN A LIST THAT END
WITH "ING"

EXERCISE 98: THE SEARCH


METHOD IN REGEX

EXERCISE 99: USING THE


SPAN METHOD OF THE MATCH
OBJECT TO LOCATE THE
POSITION OF THE MATCHED
PATTERN
EXERCISE 100: EXAMPLES OF
SINGLE CHARACTER PATTERN
MATCHING WITH SEARCH

EXERCISE 101: EXAMPLES OF


PATTERN MATCHING AT THE
START OR END OF A STRING

EXERCISE 102: EXAMPLES OF


PATTERN MATCHING WITH
MULTIPLE CHARACTERS

EXERCISE 103: GREEDY


VERSUS NON-GREEDY
MATCHING

EXERCISE 104: CONTROLLING


REPETITIONS TO MATCH

EXERCISE 105: SETS OF


MATCHING CHARACTERS

EXERCISE 106: THE USE OF OR


IN REGEX USING THE OR
OPERATOR

THE FINDALL METHOD

ACTIVITY 9: EXTRACTING THE


TOP 100 EBOOKS FROM
GUTENBERG

ACTIVITY 10: BUILDING YOUR


OWN MOVIE DATABASE BY
READING AN API

SUMMARY

RDBMS and SQL


INTRODUCTION

REFRESHER OF RDBMS AND


SQL

HOW IS AN RDBMS
STRUCTURED?

SQL

USING AN RDBMS
(MYSQL/POSTGRESQL/SQLIT
E)

EXERCISE 107: CONNECTING


TO DATABASE IN SQLITE

EXERCISE 108: DDL AND DML


COMMANDS IN SQLITE

READING DATA FROM A


DATABASE IN SQLITE

EXERCISE 109: SORTING


VALUES THAT ARE PRESENT
IN THE DATABASE

EXERCISE 110: ALTERING THE


STRUCTURE OF A TABLE AND
UPDATING THE NEW FIELDS

EXERCISE 111: GROUPING


VALUES IN TABLES

RELATION MAPPING IN
DATABASES

ADDING ROWS IN THE


COMMENTS TABLE
JOINS

RETRIEVING SPECIFIC
COLUMNS FROM A JOIN
QUERY

EXERCISE 112: DELETING


ROWS

UPDATING SPECIFIC VALUES


IN A TABLE

EXERCISE 113: RDBMS AND


DATAFRAMES

ACTIVITY 11: RETRIEVING


DATA CORRECTLY FROM
DATABASES

SUMMARY

Application of
Data Wrangling in Real
Life
INTRODUCTION

APPLYING YOUR KNOWLEDGE


TO A REAL-LIFE DATA
WRANGLING TASK

ACTIVITY 12: DATA


WRANGLING TASK – FIXING
UN DATA

ACTIVITY 13: DATA


WRANGLING TASK –
CLEANING GDP DATA
ACTIVITY 14: DATA
WRANGLING TASK – MERGING
UN DATA AND GDP DATA

ACTIVITY 15: DATA


WRANGLING TASK –
CONNECTING THE NEW DATA
TO THE DATABASE

AN EXTENSION TO DATA
WRANGLING

ADDITIONAL SKILLS
REQUIRED TO BECOME A DATA
SCIENTIST

BASIC FAMILIARITY WITH BIG


DATA AND CLOUD
TECHNOLOGIES

WHAT GOES WITH DATA


WRANGLING?

TIPS AND TRICKS FOR


MASTERING MACHINE
LEARNING

SUMMARY

Appendix
Preface
About
Th i s sec t i on b r i ef l y i nt r odu c es t h e au t h or (s), t h e
c ov er age of t h i s b ook , t h e t ec h ni c al sk i l l s y ou 'l l
need t o get st ar t ed, and t h e h ar dw ar e and
sof t w ar e r equ i r ement s r equ i r ed t o c omp l et e al l
of t h e i nc l u ded ac t i v i t i es and ex er c i ses.

About the Book


For dat a t o b e u sef u l and meani ngf u l , i t mu st b e
c u r at ed and r ef i ned. Da ta Wra ng ling w ith Py th on
t eac h es y ou al l t h e c or e i deas b eh i nd t h ese
p r oc esses and equ i p s y ou w i t h k now l edge ab ou t
t h e most p op u l ar t ool s and t ec h ni qu es i n
t h e domai n.

Th e b ook st ar t s w i t h t h e ab sol u t e b asi c s of


Py t h on, f oc u si ng mai nl y on dat a st r u c t u r es, and
t h en qu i c k l y ju mp s i nt o t h e N u mPy and p andas
l i b r ar i es as t h e f u ndament al t ool s f or dat a
w r angl i ng. W e emp h asi ze w h y y ou sh ou l d st ay
aw ay f r om t h e t r adi t i onal w ay of dat a c l eani ng,
as done i n ot h er l angu ages, and t ak e adv ant age of
t h e sp ec i al i zed p r e-b u i l t r ou t i nes i n Py t h on.
Th er eaf t er , y ou w i l l l ear n h ow , u si ng t h e same
Py t h on b ac k end, y ou c an ex t r ac t and t r ansf or m
dat a f r om a di v er se ar r ay of sou r c es, su c h as t h e
i nt er net , l ar ge dat ab ase v au l t s, or Ex c el
f i nanc i al t ab l es. Th en, y ou w i l l al so l ear n h ow t o
h andl e mi ssi ng or i nc or r ec t dat a, and r ef or mat i t
b ased on t h e r equ i r ement s f r om t h e dow nst r eam
anal y t i c s t ool . You w i l l l ear n ab ou t t h ese
c onc ep t s t h r ou gh r eal -w or l d ex amp l es and
dat aset s.

By t h e end of t h i s b ook , y ou w i l l b e c onf i dent


enou gh t o h andl e a my r i ad of sou r c es t o ex t r ac t ,
c l ean, t r ansf or m, and f or mat y ou r dat a
ef f i c i ent l y .

ABOUT THE AUTHORS


Dr. T irthaj y oti Sarkar w or k s as a seni or
p r i nc i p al engi neer i n t h e semi c ondu c t or
t ec h nol ogy domai n, w h er e h e ap p l i es c u t t i ng-
edge dat a sc i enc e/mac h i ne l ear ni ng t ec h ni qu es
t o desi gn au t omat i on and p r edi c t i v e anal y t i c s.
H e w r i t es r egu l ar l y ab ou t Py t h on p r ogr ammi ng
and dat a sc i enc e t op i c s. H e h ol ds a Ph .D. f r om t h e
U ni v er si t y of Il l i noi s, and c er t i f i c at i ons i n
ar t i f i c i al i nt el l i genc e and mac h i ne l ear ni ng
f r om St anf or d and MIT.

Shubhade e p Roy chowdhury w or k s as a seni or


sof t w ar e engi neer at a Par i s-b ased
c y b er sec u r i t y st ar t -u p , w h er e h e i s ap p l y i ng
st at e-of -t h e-ar t c omp u t er v i si on and dat a
engi neer i ng al gor i t h ms and t ool s t o dev el op
c u t t i ng-edge p r odu c t s. H e of t en w r i t es ab ou t
al gor i t h m i mp l ement at i on i n Py t h on and
si mi l ar t op i c s. H e h ol ds a mast er 's degr ee i n
c omp u t er sc i enc e f r om W est Bengal U ni v er si t y
of Tec h nol ogy and c er t i f i c at i ons i n mac h i ne
l ear ni ng f r om St anf or d.

LEARNING OBJECTIVES
Use and m anipu late com plex
and sim ple data str u ctu r es

Har ness th e fu ll potential of


DataFr am es and
nu m py .ar r ay at r u n tim e

Per for m w eb scr aping w ith


Beau tifu lSou p4 and h tm l5lib

Execu te adv anced str ing


sear ch and m anipu lation
w ith RegEX

Handle ou tlier s and per for m


data im pu tation w ith Pandas

Use descr iptiv e statistics and


plotting tech niqu es

Pr actice data w r angling and


m odeling u sing data
gener ation tech niqu es

APPROACH
Dat a W r angl i ng w i t h Py t h on t ak es a p r ac t i c al
ap p r oac h t o equ i p b egi nner s w i t h t h e most
essent i al dat a anal y si s t ool s i n t h e sh or t est
p ossi b l e t i me. It c ont ai ns mu l t i p l e ac t i v i t i es
t h at u se r eal -l i f e b u si ness sc enar i os f or y ou t o
p r ac t i c e and ap p l y y ou r new sk i l l s i n a h i gh l y
r el ev ant c ont ex t .

AUDIENCE
Dat a W r angl i ng w i t h Py t h on i s desi gned f or
dev el op er s, dat a anal y st s, and b u si ness anal y st s
w h o ar e k een t o p u r su e a c ar eer as a f u l l -f l edged
dat a sc i ent i st or anal y t i c s ex p er t . A l t h ou gh , t h i s
b ook i s f or b egi nner s, p r i or w or k i ng k now l edge
of Py t h on i s nec essar y t o easi l y gr asp t h e
c onc ep t s c ov er ed h er e. It w i l l al so h el p t o h av e
r u di ment ar y k now l edge of r el at i onal dat ab ase
and SQL.

MINIMUM HARDWARE
REQUIREMENTS
For t h e op t i mal st u dent ex p er i enc e, w e
r ec ommend t h e f ol l ow i ng h ar dw ar e
c onf i gu r at i on:

Pr ocessor : Intel Cor e i5 or


equ iv alent

Mem or y : 8 GB RA M

Stor age: 3 5 GB av ailable


space

SOFTWARE REQUIREMENTS
You 'l l al so need t h e f ol l ow i ng sof t w ar e i nst al l ed
i n adv anc e:

OS: Window s 7 SP1 6 4 -bit,


Window s 8.1 6 4 -bit or
Window s 1 0 6 4 -bit,
Ubu ntu Linu x, or th e latest
v er sion of m acOS

v er sion of OS X

Pr ocessor : Intel Cor e i5 or


equ iv alent

Mem or y : 4 GB RA M (8 GB
Pr efer r ed)

Stor age: 3 5 GB av ailable


space

CONVENTIONS
Code w or ds i n t ex t , dat ab ase t ab l e names, f ol der
names, f i l enames, f i l e ex t ensi ons, p at h names,
du mmy U RLs, u ser i np u t , and Tw i t t er h andl es
ar e sh ow n as f ol l ow s: " Th i s w i l l r et u r n t h e
v al u e assoc i at ed w i t h i t - ["list_element1", 34]"

A b l oc k of c ode i s set as f ol l ow s:

list_1 = []

for x in range(0, 10):

list_1.append(x)

list_1

N ew t er ms and i mp or t ant w or ds ar e sh ow n i n
b ol d. W or ds t h at y ou see on t h e sc r een, f or
ex amp l e, i n menu s or di al og b ox es, ap p ear i n t h e
t ex t l i k e t h i s: "Cl i c k on Ne w and c h oose Py thon
3."

INSTALLATION AND SETUP


Eac h gr eat jou r ney b egi ns w i t h a h u mb l e st ep .
Ou r u p c omi ng adv ent u r e i n t h e l and of dat a
w r angl i ng i s no ex c ep t i on. Bef or e w e c an do
aw esome t h i ngs w i t h dat a, w e need t o b e
p r ep ar ed w i t h t h e most p r odu c t i v e
env i r onment . In t h i s sh or t sec t i on, w e sh al l see
h ow t o do t h at .

Th e onl y p r er equ i si t e r egar di ng t h e


env i r onment f or t h i s b ook i s t o h av e Doc k er
i nst al l ed. If y ou h av e nev er h ear d of Doc k er or
y ou h av e onl y a v er y f ai nt i dea w h at i t i s, t h en
f ear not . A l l y ou need t o k now ab ou t Doc k er f or
t h e p u r p ose of t h i s b ook i s t h i s: Doc k er i s a
l i gh t w ei gh t c ont ai ner i zat i on engi ne t h at r u ns
on al l t h r ee major p l at f or ms (Li nu x , W i ndow s,
and mac OS). Th e mai n i dea b eh i nd Doc k er i s gi v e
y ou saf e, easy , and l i gh t w ei gh t v i r t u al i zat i on on
t op of y ou r nat i v e OS.

I nstall Docke r

1 . To install Docker on a Mac or


Window s m ach ine, cr eate an
accou nt on Docker and
dow nload th e latest v er sion.
It's easy to install and set u p.

2 . Once y ou h av e set u p
Docker , open a sh ell (or
Ter m inal if y ou ar e a Mac
u ser ) and ty pe th e follow ing
com m and to v er ify th at th e
installation h as
been su ccessfu l:

docker version

If th e ou tpu t sh ow s y ou th e
ser v er and client v er sion of
Docker , th en y ou ar e all set
u p.

Pull the imag e

1 . Pu ll th e im age and y ou w ill


h av e all th e necessar y
packages (inclu ding Py th on
3 .6 .6 ) installed and r eady
for y ou to star t w or king.
Ty pe th e follow ing com m and
in a sh ell:

docker pull
rcshubhadeep/packt-
data-wrangling-base

2 . If y ou w ant to know th e fu ll
list of all th e packages and
th eir v er sions inclu ded in
th is im age, y ou can ch eck
ou t th e requirements.txt
file in th e setup folder of th e
sou r ce code r epositor y of th is
book. Once th e im age
is th er e, y ou ar e r eady to
r oll. Dow nloading it m ay
take tim e, depending
on y ou r connection speed.

Run the e nv ironme nt

1 . Ru n th e im age u sing th e
follow ing com m and:

docker run -p
8888:8888 -v
'pwd':/notebooks -it
rcshubhadeep/packt-
data-wrangling-base

Th is w ill giv e y ou a r eady -to-


u se env ir onm ent.

2 . Open a br ow ser tab in


Ch r om e or Fir efox and go to
http://localhost:8888.
You w ill be pr om pted to
enter a token. Th e token is
dw_4_all.

3 . Befor e y ou r u n th e im age,
cr eate a new folder and
nav igate th er e fr om th e sh ell
u sing th e cd com m and.

Once y ou cr eate a notebook


and sav e it as ipynb file. You
can u se Ctrl + C to stop
r u nning th e im age.

I ntroduction to Jupy te r note book

Pr ojec t Ju p y t er i s op en sou r c e, f r ee sof t w ar e


t h at gi v es y ou t h e ab i l i t y t o r u n c ode, w r i t t en i n
Py t h on and some ot h er l angu ages, i nt er ac t i v el y
f r om a sp ec i al not eb ook , si mi l ar t o a b r ow ser
i nt er f ac e. It w as b or n i n 201 4 f r om t h e IPython
p r ojec t and h as si nc e b ec ome t h e def au l t c h oi c e
f or t h e ent i r e dat a sc i enc e w or k f or c e.

1 . Once y ou ar e r u nning th e
Ju py ter ser v er , click on New
and ch oose Py t hon 3. A new
br ow ser tab w ill open w ith a
new and em pty notebook.
Renam e th e Ju py ter file:

Figure 0.1: Jupyter server interface

Th e m ain bu ilding blocks of


Ju py ter notebooks ar e cells.
Th er e ar e tw o ty pes of cells:
In (sh or t for inpu t) and Out
(sh or t for ou tpu t). You can
w r ite code, nor m al text, and
Mar kdow n in In cells, pr ess
Shift + Enter (or Shift +
Return), and th e code w r itten
in th at par ticu lar In cell w ill
be execu ted. Th e r esu lt w ill
be sh ow n in an Out cell, and
y ou w ill land in a new In
cell, r eady for th e next block
of code. Once y ou get u sed to
th is inter face, y ou w ill
slow ly discov er th e pow er
and flexibility it offer s.

2 . One final th ing y ou sh ou ld


know abou t Ju py ter cells is
th at w h en y ou star t a new
cell, by defau lt, it is assu m ed
th at y ou w ill w r ite code in it.
How ev er , if y ou w ant to
w r ite text, th en y ou h av e to
ch ange th e ty pe. You can do
th at u sing th e follow ing
sequ ence of key s: Escape-> m-
> Enter:

Figure 0.2: Jupyter notebook


3 . A nd w h en y ou ar e done w ith
w r iting th e text, execu te it
u sing Shift + Enter. Unlike
th e code cells, th e r esu lt of
th e com piled Mar kdow n w ill
be sh ow n in th e sam e place
as th e "In" cell.

Note

To have a "Cheat sheet" of all


the handy key shortcuts in
Jupyter, you can bookmark
this Gist:
https://gist.github.com/kidpi
xo/f4318f8c8143adee5b40.
With this basic introduction
and the image ready to be
used, w e are ready to embark
on the exciting and
enlightening journey that
aw aits us!

INSTALLING THE CODE


BUNDLE
Cop y t h e c ode b u ndl e f or t h e c l ass t o t h e C:/Code
f ol der .

ADDITIONAL RESOURCES
Th e c ode b u ndl e f or t h i s b ook i s al so h ost ed on
Gi t H u b at :
h t t p s://gi t h u b .c om/Tr ai ni ngBy Pac k t /Dat a-
W r angl i ng-w i t h -Py t h on.

W e al so h av e ot h er c ode b u ndl es f r om ou r r i c h
c at al og of b ook s and v i deos av ai l ab l e at
h t t p s://gi t h u b .c om/Pac k t Pu b l i sh i ng/. Ch ec k
t h em ou t !
Chapter 1
Introduction to Data
Wrangling with Python
Learning Objectives
By t h e end of t h i s c h ap t er , y ou w i l l b e ab l e t o do
t h e f ol l ow i ng:

Define th e im por tance of


data w r angling in data
science

Manipu late th e data


str u ctu r es th at ar e av ailable
in Py th on

Com par e th e differ ent


im plem entations of th e
inbu ilt Py th on data
str u ctu r es

Th i s c h ap t er desc r i b es t h e i mp or t anc e of dat a


w r angl i ng, i dent i f i es t h e i mp or t ant t ask s t o b e
p er f or med i n dat a w r angl i ng, and i nt r odu c es
b asi c Py t h on dat a st r u c t u r es.

Introduction
Dat a sc i enc e and anal y t i c s ar e t ak i ng ov er t h e
w h ol e w or l d and t h e job of a dat a sc i ent i st i s
r ou t i nel y b ei ng c al l ed t h e c ool est job of t h e 21 st
c ent u r y . Bu t f or al l t h e emp h asi s on dat a, i t i s t h e
sc i enc e t h at mak es y ou – t h e p r ac t i t i oner –
t r u l y v al u ab l e.

To p r ac t i c e h i gh -qu al i t y sc i enc e w i t h dat a, y ou


need t o mak e su r e i t i s p r op er l y sou r c ed, c l eaned,
f or mat t ed, and p r e-p r oc essed. Th i s b ook t eac h es
y ou t h e most essent i al b asi c s of t h i s i nv al u ab l e
c omp onent of t h e dat a sc i enc e p i p el i ne: dat a
w r angl i ng. In sh or t , dat a w r angl i ng i s t h e
p r oc ess t h at ensu r es t h at t h e dat a i s i n a f or mat
t h at i s c l ean, ac c u r at e, f or mat t ed, and r eady t o b e
u sed f or dat a anal y si s.

A p r omi nent ex amp l e of dat a w r angl i ng w i t h a


l ar ge amou nt of dat a i s t h e one c ondu c t ed at t h e
Su p er c omp u t er Cent er of U ni v er si t y of
Cal i f or ni a San Di ego (U CSD). Th e p r ob l em i n
Cal i f or ni a i s t h at w i l df i r es ar e v er y c ommon,
mai nl y b ec au se of t h e dr y w eat h er and ex t r eme
h eat , esp ec i al l y du r i ng t h e su mmer s. Dat a
sc i ent i st s at t h e U CSD Su p er c omp u t er Cent er
gat h er dat a t o p r edi c t t h e nat u r e and sp r ead
di r ec t i on of t h e f i r e. Th e dat a t h at c omes f r om
di v er se sou r c es su c h as w eat h er st at i ons, sensor s
i n t h e f or est , f i r e st at i ons, sat el l i t e i mager y , and
Tw i t t er f eeds mi gh t st i l l b e i nc omp l et e or
mi ssi ng. Th i s dat a needs t o b e c l eaned and
f or mat t ed so t h at i t c an b e u sed t o p r edi c t f u t u r e
oc c u r r enc es of w i l df i r es.

Th i s i s an ex amp l e of h ow dat a w r angl i ng and


dat a sc i enc e c an p r ov e t o b e h el p f u l and
r el ev ant .

IMPORTANCE OF DATA
WRANGLING
Oi l does not c ome i n i t s f i nal f or m f r om t h e r i g; i t
h as t o b e r ef i ned. Si mi l ar l y , dat a mu st b e
c u r at ed, massaged, and r ef i ned t o b e u sed i n
i nt el l i gent al gor i t h ms and c onsu mer p r odu c t s.
Th i s i s k now n as w r angl i ng. Most dat a sc i ent i st s
sp end t h e major i t y of t h ei r t i me dat a w r angl i ng.

Dat a w r angl i ng i s gener al l y done at t h e v er y


f i r st st age of a dat a sc i enc e/anal y t i c s p i p el i ne.
A f t er t h e dat a sc i ent i st s i dent i f y u sef u l dat a
sou r c es f or sol v i ng t h e b u si ness p r ob l em (f or
i nst anc e, i n-h ou se dat ab ase st or age or i nt er net
or st r eami ng sensor dat a), t h ey t h en p r oc eed t o
ex t r ac t , c l ean, and f or mat t h e nec essar y dat a
f r om t h ose sou r c es.

Gener al l y , t h e t ask of dat a w r angl i ng i nv ol v es


t h e f ol l ow i ng st ep s:

Scr aping r aw data fr om


m u ltiple sou r ces (inclu ding
w eb and database tables)

Im pu ting, for m atting, and


tr ansfor m ing – basically
m aking it r eady to be u sed in
th e m odeling pr ocess (su ch
as adv anced m ach ine
lear ning)

Handling r ead/w r ite er r or s

Detecting ou tlier s

Per for m ing qu ick


v isu alizations (plotting) and
basic statistical analy sis to
ju dge th e qu ality of y ou r
for m atted data

Th i s i s an i l l u st r at i v e r ep r esent at i on of t h e
p osi t i oni ng and essent i al f u nc t i onal r ol e of dat a
w r angl i ng i n a t y p i c al dat a sc i enc e p i p el i ne:
Figure 1.1: Process of data wrangling

Th e p r oc ess of dat a w r angl i ng i nc l u des f i r st


f i ndi ng t h e ap p r op r i at e dat a t h at 's nec essar y f or
t h e anal y si s. Th i s dat a c an b e f r om one or
mu l t i p l e sou r c es, su c h as t w eet s, b ank
t r ansac t i on st at ement s i n a r el at i onal dat ab ase,
sensor dat a, and so on. Th i s dat a needs t o b e
c l eaned. If t h er e i s mi ssi ng dat a, w e w i l l ei t h er
del et e or su b st i t u t e i t , w i t h t h e h el p of sev er al
t ec h ni qu es. If t h er e ar e ou t l i er s, w e need t o f i r st
det ec t t h em and t h en h andl e t h em ap p r op r i at el y .
If dat a i s f r om mu l t i p l e sou r c es, w e w i l l h av e t o
p er f or m joi n op er at i ons t o c omb i ne i t .

In an ex t r emel y r ar e si t u at i on, dat a w r angl i ng


may not b e needed. For ex amp l e, i f t h e dat a t h at 's
nec essar y f or a mac h i ne l ear ni ng t ask i s al r eady
st or ed i n an ac c ep t ab l e f or mat i n an i n-h ou se
dat ab ase, t h en a si mp l e SQL qu er y may b e enou gh
t o ex t r ac t t h e dat a i nt o a t ab l e, r eady t o b e p assed
on t o t h e model i ng st age.

Python for Data


Wrangling
Th er e i s al w ay s a deb at e on w h et h er t o p er f or m
t h e w r angl i ng p r oc ess u si ng an ent er p r i se t ool
or b y u si ng a p r ogr ammi ng l angu age and
assoc i at ed f r amew or k s. Th er e ar e many
c ommer c i al , ent er p r i se-l ev el t ool s f or dat a
f or mat t i ng and p r e-p r oc essi ng t h at do not
i nv ol v e mu c h c odi ng on t h e p ar t of t h e u ser .
Th ese ex amp l es i nc l u de t h e f ol l ow i ng:

Gener al pu r pose data


analy sis platfor m s su ch as
Micr osoft Excel (w ith add-
ins)

Statistical discov er y package


su ch as JMP (fr om SA S)

Modeling platfor m s su ch as
RapidMiner

A naly tics platfor m s fr om


nich e play er s focu sing on
data w r angling, su ch as
Trifact a, Paxat a, and
Alt ery x

H ow ev er , p r ogr ammi ng l angu ages su c h as


Py t h on p r ov i de mor e f l ex i b i l i t y , c ont r ol , and
p ow er c omp ar ed t o t h ese of f -t h e-sh el f t ool s.

A s t h e v ol u me, v el oc i t y , and v ar i et y (t h e t h r ee
V s of big data) of dat a u nder go r ap i d c h anges, i t
i s al w ay s a good i dea t o dev el op and nu r t u r e a
si gni f i c ant amou nt of i n-h ou se ex p er t i se i n dat a
w r angl i ng u si ng f u ndament al p r ogr ammi ng
f r amew or k s so t h at an or gani zat i on i s not
b eh ol den t o t h e w h i ms and f anc i es of any
ent er p r i se p l at f or m f or as b asi c a t ask as dat a
w r angl i ng:
Figure 1.2: Google trend worldwide over the last Five
years

A f ew of t h e ob v i ou s adv ant ages of u si ng an op en


sou r c e, f r ee p r ogr ammi ng p ar adi gm su c h as
Py t h on f or dat a w r angl i ng ar e t h e f ol l ow i ng:

Gener al pu r pose open sou r ce


par adigm pu tting no
r estr iction on any of th e
m eth ods y ou can dev elop for
th e specific pr oblem at h and

Gr eat ecosy stem of fast,


optim ized, open sou r ce
libr ar ies, focu sed on data
analy tics

Gr ow ing su ppor t to connect


Py th on to ev er y conceiv able
data sou r ce ty pe

Easy inter face to basic


statistical testing and qu ick
v isu alization libr ar ies to
ch eck data qu ality

Seam less inter face of th e


data w r angling ou tpu t w ith
adv anced m ach ine lear ning
m odels

Py t h on i s t h e most p op u l ar l angu age of c h oi c e of


mac h i ne l ear ni ng and ar t i f i c i al i nt el l i genc e
t h ese day s.
Lists, Sets, Strings,
Tuples, and Dictionaries
N ow t h at w e h av e l ear ned t h e i mp or t anc e of
Py t h on, w e w i l l st ar t b y ex p l or i ng v ar i ou s b asi c
dat a st r u c t u r es i n Py t h on. W e w i l l l ear n
t ec h ni qu es t o h andl e dat a. Th i s i s i nv al u ab l e f or
a dat a p r ac t i t i oner .

W e c an i ssu e t h e f ol l ow i ng c ommand t o st ar t a
new Ju p y t er ser v er b y t y p i ng t h e f ol l ow i ng i n
t o t h e Command Pr omp t w i ndow :

docker run -p 8888:8888 -v


'pwd':/notebooks -it rcshubhadeep/packt-
data-wrangling-base:latest ipython

Th i s w i l l st ar t a ju p y t er ser v er and y ou c an v i si t
i t at http://localhost:8888 and u se t h e p assc ode
dw_4_all t o ac c ess t h e mai n i nt er f ac e.

LISTS
Li st s ar e f u ndament al Py t h on dat a st r u c t u r es
t h at h av e c ont i nu ou s memor y l oc at i ons, c an h ost
di f f er ent dat a t y p es, and c an b e ac c essed b y t h e
i ndex .

W e w i l l st ar t w i t h a l i st and l i st c omp r eh ensi on.


W e w i l l gener at e a l i st of nu mb er s, and t h en
ex ami ne w h i c h ones among t h em ar e ev en. W e
w i l l sor t , r ev er se, and c h ec k f or du p l i c at es. W e
w i l l al so see h ow many di f f er ent w ay s w e c an
ac c ess t h e l i st el ement s, i t er at i ng ov er t h em and
c h ec k i ng t h e memb er sh i p of an el ement .

Th e f ol l ow i ng i s an ex amp l e of a si mp l e l i st :

list_example = [51, 27, 34, 46, 90, 45,


-19]

Th e f ol l ow i ng i s al so an ex amp l e of a l i st :

list_example2 = [15, "Yellow car", True,


9.456, [12, "Hello"]]

A s y ou c an see, a l i st c an c ont ai n any nu mb er of


t h e al l ow ed dat at y p e, su c h as int, float, string,
and Boolean, and a l i st c an al so b e a mi x of
di f f er ent dat a t y p es (i nc l u di ng nest ed l i st s).

If y ou ar e c omi ng f r om a st r ongl y t y p ed
l angu age, su c h as C, C++, or Jav a, t h en t h i s w i l l
p r ob ab l y b e st r ange as y ou ar e not al l ow ed t o
mi x di f f er ent k i nds of dat a t y p es i n a si ngl e
ar r ay i n t h ose l angu ages. Li st s ar e somew h at l i k e
ar r ay s, i n t h e sense t h at t h ey ar e b ot h b ased on
c ont i nu ou s memor y l oc at i ons and c an b e
ac c essed u si ng i ndex es. Bu t t h e p ow er of Py t h on
l i st s c ome f r om t h e f ac t t h at t h ey c an h ost
di f f er ent dat a t y p es and y ou ar e al l ow ed t o
mani p u l at e t h e dat a.

Note
Be c a re fu l, th ou g h , a s th e v e ry p ow e r of lis ts , a nd
th e fa c t th a t y ou c a n m ix d iffe re nt d a ta ty p e s in a
s ing le lis t, c a n a c tu a lly c re a te s u btle bu g s th a t
c a n be v e ry d iffic u lt to tra c k .

EXERCISE 1: ACCESSING THE


LIST MEMBERS
In t h e f ol l ow i ng ex er c i se, w e w i l l b e c r eat i ng a
l i st and t h en ob ser v i ng t h e di f f er ent w ay s of
ac c essi ng t h e el ement s:

1 . Define a list called list_1


w ith fou r integer m em ber s,
u sing th e follow ing
com m and:

list_1 = [34, 12, 89,


1]

Th e indices w ill be
au tom atically assigned, as
follow s:

Figure 1.3: List showing the forward and


backward indices
2 . A ccess th e fir st elem ent fr om
list_1 u sing its for w ar d
index:

list_1[0] #34

3 . A ccess th e last elem ent fr om


list_1 u sing its for w ar d
index:

list_1[3] #1

4 . A ccess th e last elem ent fr om


list_1 u sing th e len
fu nction:

list_1[len(list_1) -
1] #1

Th e len fu nction in Py th on
r etu r ns th e length of th e
specified list.

5. A ccess th e last elem ent fr om


list_1 u sing its backw ar d
index:

list_1[-1] #1

6 . A ccess th e fir st th r ee
elem ents fr om list_1 u sing
for w ar d indices:

list_1[1:3] # [12, 89]

Th is is also called list slicing,


as it r etu r ns a sm aller list
fr om th e or iginal list by
extr acting only , a par t of it.
To slice a list, w e need tw o
integer s. Th e fir st integer
w ill denote th e star t of th e
slice and th e second integer
w ill denote th e end-1
elem ent.

Note

Notice that slicing did not


include the third index or the
end element. This is how list
slicing w orks.

7 . A ccess th e last tw o elem ents


fr om list_1 by slicing:
list_1[-2:] # [89, 1]

8. A ccess th e fir st tw o elem ents


u sing backw ar d indices:

list_1[:-2] # [34, 12]

Wh en w e leav e one side of


th e colon (:) blank, w e ar e
basically telling Py th on
eith er to go u ntil th e end or
star t fr om th e beginning of
th e list. It w ill au tom atically
apply th e r u le of list slices
th at w e ju st lear ned.

9 . Rev er se th e elem ents in th e


str ing:

list_1[-1::-1] # [1,
89, 12, 34]

Note

The last bit of code is not very


readable, meaning it is not
obvious just by looking at it
w hat it is doing. I t is against
Python's philosophy. So,
although this kind of code
may look clever, w e should
resist the temptation to w rite
code like this.

EXERCISE 2: GENERATING A
LIST
W e ar e goi ng t o ex ami ne v ar i ou s w ay s of
gener at i ng a l i st :

1 . Cr eate a list u sing th e


append m eth od:

list_1 = []

for x in range(0, 10):

list_1.append(x)

list_1

Th e ou tpu t w ill be as follow s:

[0, 1, 2, 3, 4, 5, 6,
7, 8, 9]
Her e, w e star ted by
declar ing an em pty list and
th en w e u sed a for loop to
append v alu es to it. Th e
append m eth od is a m eth od
th at's giv en to u s by th e
Py th on list data ty pe.

2 . Gener ate a list u sing th e


follow ing com m and:

list_2 = [x for x in
range(0, 100)]

list_2

Th e par tial ou tpu t is as


follow s:

Figure 1.4: List comprehension

Th is is list com pr eh ension,


w h ich is a v er y pow er fu l tool
th at w e need to m aster . Th e
pow er of list com pr eh ension
com es fr om th e fact th at w e
can u se conditionals inside
th e com pr eh ension itself.

3 . Use a while loop to iter ate


ov er a list to u nder stand th e
differ ence betw een a while
loop and a for loop:
i = 0

while i < len(list_1)


:

print(list_1[i])

i += 1

Th e par tial ou tpu t w ill be as


follow s:

Figure 1.5: Output showing the contents of


list_1 using a while loop

4 . Cr eate list_3 w ith nu m ber s


th at ar e div isible by 5:

list_3 = [x for x in
range(0, 100) if x % 5
== 0]

list_3

Th e ou tpu t w ill be a list of


nu m ber s u p to 1 00 in
incr em ents of 5:

[0, 5, 10, 15, 20, 25,


30, 35, 40, 45, 50,
55, 60, 65, 70, 75,
80, 85, 90, 95]

5. Gener ate a list by adding th e


tw o lists:

list_1 = [1, 4, 56,


-1]

list_2 = [1, 39, 245,


-23, 0, 45]
list_3 = list_1 +
list_2

list_3

Th e ou tpu t is as follow s:
[1, 4, 56, -1, 1, 39,
245, -23, 0, 45]

6 . Extend a str ing u sing th e


extend key w or d:

list_1.extend(list_2)

list_1

Th e par tial ou tpu t is as


follow s:

Figure 1.6: Contents of list_1

Th e sec ond op er at i on c h anges t h e or i gi nal l i st


(l i st _1 ) and ap p ends al l t h e el ement s of l i st _2 t o
i t . So, b e c ar ef u l w h en u si ng i t .

EXERCISE 3: ITERATING OVER


A LIST AND CHECKING
MEMBERSHIP
W e ar e goi ng t o i t er at e ov er a l i st and t est
w h et h er a c er t ai n v al u e ex i st s i n i t :

1 . Iter ate ov er a list:

list_1 = [x for x in
range(0, 100)]

for i in range(0,
len(list_1)):

print(list_1[i])

Th e ou tpu t is as follow s:
Figure 1.7: Section of list_1

2 . How ev er , it is not v er y
Py th onic. Being Py th onic is
to follow and confor m to a set
of best pr actices and
conv entions th at h av e been
cr eated ov er th e y ear s by
th ou sands of v er y able
dev eloper s, w h ich in th is
case m eans to u se th e in
key w or d, becau se Py th on
does not h av e index
initialization, bou nds
ch ecking, or index
incr em enting, u nlike
tr aditional langu ages. Th e
Py th onic w ay of iter ating
ov er a list is as follow s:

for i in list_1:

print(i)
Th e ou tpu t is as follow s:

Figure 1.8: A section of list_1

Notice th at, in th e second


m eth od, w e do not need a
cou nter any m or e to access
th e list index; instead,
Py th on's in oper ator giv es u s
th e elem ent at th e i th
position dir ectly .

3 . Ch eck w h eth er th e integer s


2 5 and -4 5 ar e in th e list
u sing th e in oper ator :

25 in list_1
Th e ou tpu t is True.

-45 in list_1

Th e ou tpu t is False.

EXERCISE 4: SORTING A LIST


W e gener at ed a l i st c al l ed list_1 i n t h e p r ev i ou s
ex er c i se. W e ar e goi ng t o sor t i t now :

1 . A s th e list w as or iginally a
list of nu m ber s fr om 0 to 99,
w e w ill sor t it in th e r ev er se
dir ection. To do th at, w e w ill
u se th e sort m eth od w ith
reverse=True:

list_1.sort(reverse=Tr
ue)

list_1

Th e par tial ou tpu t is as


follow s:
Figure 1.9: Section of output showing the
reversed list

2 . We can u se th e reverse
m eth od dir ectly to ach iev e
th is r esu lt:

list_1.reverse()

list_1

Th e ou tpu t is as follow s:
Figure 1.10: Section of output a er reversing the string

Note
Th e d iffe re nc e be tw e e n th e s ort fu nc tion a nd th e
re v e rs e fu nc tion is th e fa c t th a t w e c a n u s e s ort
w ith c u s tom s orting fu nc tions to d o c u s tom
s orting , w h e re a s w e c a n only u s e re v e rs e to
re v e rs e a lis t. He re a ls o, both th e fu nc tions w ork
in-p la c e , s o be a w a re of th is w h ile u s ing th e m .

EXERCISE 5: GENERATING A
RANDOM LIST
In t h i s ex er c i se, w e w i l l b e gener at i ng a list
w i t h r andom nu mb er s:

1 . Im por t th e random libr ar y :

import random

2 . Use th e randint fu nction to


gener ate r andom integer s
and add th em to a list:

list_1 =
[random.randint(0, 30)
for x in range (0,
100)]

3 . Pr int th e list u sing


print(list_1). Note th at
th er e w ill be du plicate
v alu es in list_1:

list_1

Th e sam ple ou tpu t is as


follow s:
Figure 1.11: Section of the sample output for list_1

Th er e ar e many w ay s t o get a l i st of u ni qu e
nu mb er s, and w h i l e y ou may b e ab l e t o w r i t e a
f ew l i nes of c ode u si ng a f or l oop and anot h er l i st
(y ou sh ou l d ac t u al l y t r y doi ng i t !), l et 's see h ow
w e c an do t h i s w i t h ou t a f or l oop and w i t h a
si ngl e l i ne of c ode. Th i s w i l l b r i ng u s t o t h e nex t
dat a st r u c t u r e, set s.

ACTIVITY 1: HANDLING LISTS


In t h i s ac t i v i t y , w e w i l l gener at e a list of
r andom nu mb er s and t h en gener at e anot h er list
f r om t h e f i r st one, w h i c h onl y c ont ai ns nu mb er s
t h at ar e di v i si b l e b y t h r ee. Rep eat t h e
ex p er i ment t h r ee t i mes. Th en, w e w i l l c al c u l at e
t h e av er age di f f er enc e of l engt h b et w een t h e
t w o l i st s.

Th ese ar e t h e st ep s f or c omp l et i ng t h i s ac t i v i t y :
1 . Cr eate a list of 1 00 r andom
nu m ber s.

2 . Cr eate a new list fr om th is


r andom list, w ith nu m ber s
th at ar e div isible by 3.

3 . Calcu late th e length of th ese


tw o lists and stor e th e
differ ence in a new v ar iable.

4 . Using a loop, per for m steps 2


and 3 and find th e differ ence
v ar iable th r ee tim es.

5. Find th e ar ith m etic m ean of


th ese th r ee differ ence v alu es.

Note

The solution for this activity


can be found on page 282.

SETS
A set , mat h emat i c al l y sp eak i ng, i s ju st a
c ol l ec t i on of w el l -def i ned di st i nc t ob jec t s.
Py t h on gi v es u s a st r ai gh t f or w ar d w ay t o deal
w i t h t h em u si ng i t s set dat at y p e.

INTRODUCTION TO SETS
W i t h t h e l ast l i st t h at w e gener at ed, w e ar e goi ng
t o r ev i si t t h e p r ob l em of get t i ng r i d of
du p l i c at es f r om i t . W e c an ac h i ev e t h at w i t h t h e
f ol l ow i ng l i ne of c ode:

list_12 = list(set(list_1))

If w e p r i nt t h i s, w e w i l l see t h at i t onl y c ont ai ns


u ni qu e nu mb er s. W e u sed t h e set dat a t y p e t o
t u r n t h e f i r st l i st i nt o a set , t h u s get t i ng r i d of
al l du p l i c at e el ement s, and t h en w e u sed t h e list
f u nc t i on on i t t o t u r n i t i nt o a l i st f r om a set onc e
mor e:

list_12

Th e ou t p u t w i l l b e as f ol l ow s:
Figure 1.12: Section of output for list_21

UNION AND INTERSECTION


OF SETS
Th i s i s w h at a u ni on b et w een t w o set s l ook s l i k e:
Figure 1.13: Venn diagram showing the union of two
sets

Th i s si mp l y means t ak e ev er y t h i ng f r om b ot h
set s b u t t ak e t h e c ommon el ement s onl y onc e.

W e c an c r eat e t h i s u si ng t h e f ol l ow i ng c ode:

set1 = {"Apple", "Orange", "Banana"}

set2 = {"Pear", "Peach", "Mango",


"Banana"}

To f i nd t h e u ni on of t h e t w o set s, t h e f ol l ow i ng
i nst r u c t i ons sh ou l d b e u sed:
set1 | set2

Th e ou t p u t w ou l d b e as f ol l ow s:

{'Apple', 'Banana', 'Mango', 'Orange',


'Peach', 'Pear'}

N ot i c e t h at t h e c ommon el ement , Banana,


ap p ear s onl y onc e i n t h e r esu l t i ng set . Th e
c ommon el ement s b et w een t w o set s c an b e
i dent i f i ed b y ob t ai ni ng t h e i nt er sec t i on of t h e
t w o set s, as f ol l ow s:

Figure 1.14: Venn diagram showing the intersection of


two sets

W e get t h e i nt er sec t i on of t w o set s i n Py t h on as


f ol l ow s:

set1 & set2

Th i s w i l l gi v e u s a set w i t h onl y one el ement .


Th e ou t p u t i s as f ol l ow s:
{'Banana'}

Note
You c a n a ls o c a lc u la te th e d iffe re nc e be tw e e n
s e ts (a ls o k now n a s c om p le m e nts ). To find ou t
m ore , re fe r to th is link :
h ttp s ://d oc s .p y th on.org /3/tu toria l/d a ta s tru c tu re
s .h tm l#s e ts .

CREATING NULL SETS


You c an c r eat e a nu l l set b y c r eat i ng a set
c ont ai ni ng no el ement s. You c an do t h i s b y u si ng
t h e f ol l ow i ng c ode:

null_set_1 = set({})

null_set_1

Th e ou t p u t i s as f ol l ow s:

set()

H ow ev er , t o c r eat e a di c t i onar y , u se t h e
f ol l ow i ng c ommand:

null_set_2 = {}

null_set_2

Th e ou t p u t i s as f ol l ow s:

{}

W e ar e goi ng t o l ear n ab ou t t h i s i n det ai l i n t h e


nex t t op i c .

DICTIONARY
A di c t i onar y i s l i k e a l i st , w h i c h means i t i s a
c ol l ec t i on of sev er al el ement s. H ow ev er , w i t h
t h e di c t i onar y , i t i s a c ol l ec t i on of k ey -v al u e
p ai r s, w h er e t h e k ey c an b e any t h i ng t h at c an b e
h ash ed. Gener al l y , w e u se nu mb er s or st r i ngs as
k ey s.

To c r eat e a di c t i onar y , u se t h e f ol l ow i ng c ode:

dict_1 = {"key1": "value1", "key2":


"value2"}

dict_1

Th e ou t p u t i s as f ol l ow s:

{'key1': 'value1', 'key2': 'value2'}

Th i s i s al so a v al i d di c t i onar y :

dict_2 = {"key1": 1, "key2":


["list_element1", 34], "key3": "value3",

"key4": {"subkey1": "v1"}, "key5": 4.5}

dict_2

Th e ou t p u t i s as f ol l ow s:

{'key1': 1,

'key2': ['list_element1', 34],


'key3': 'value3',

'key4': {'subkey1': 'v1'},

'key5': 4.5}

Th e k ey s mu st b e u ni qu e i n a di c t i onar y .

EXERCISE 6: ACCESSING AND


SETTING VALUES IN A
DICTIONARY
In t h i s ex er c i se, w e ar e goi ng t o ac c ess and set
v al u es i n a di c t i onar y :

1 . A ccess a par ticu lar key in a


dictionar y :

dict_2["key2"]

Th is w ill r etu r n th e v alu e


associated w ith it as follow s:

['list_element1', 34]

2 . A ssign a new v alu e to th e


key :

dict_2["key2"] = "My
new value"

3 . Define a blank dictionar y


and th en u se th e key
notation to assign v alu es to
it:

dict_3 = {} # Not a
null set. It is a dict

dict_3["key1"] =
"Value1"

dict_3

Th e ou tpu t is as follow s:

{'key1': 'Value1'}

EXERCISE 7: ITERATING OVER


A DICTIONARY
In t h i s ex er c i se, w e ar e goi ng t o i t er at e ov er a
di c t i onar y :

1 . Cr eate dict_1:

dict_1 = {"key1": 1,
"key2":
["list_element1", 34],
"key3": "value3",
"key4": {"subkey1":
"v1"}, "key5": 4.5}

2 . Use th e looping v ar iables k


and v:

for k, v in
dict_1.items():

print("{} -
{}".format(k, v))

Th e ou tpu t is as follow s:

key1 - 1

key2 -
['list_element1', 34]

key3 - value3

key4 - {'subkey1':
'v1'}

key5 - 4.5

Note

Notice the difference betw een


how w e did the iteration on
the list and how w e are doing
it here.

EXERCISE 8: REVISITING THE


UNIQUE VALUED LIST
PROBLEM
W e w i l l u se t h e f ac t t h at di c t i onar y k ey s c annot
b e du p l i c at ed t o gener at e t h e u ni qu e v al u ed l i st :

1 . Fir st, gener ate a r andom list


w ith du plicate v alu es:

list_1 =
[random.randint(0, 30)
for x in range (0,
100)]

2 . Cr eate a u niqu e v alu ed list


fr om list_1:

list(dict.fromkeys(lis
t_1).keys())
Th e sam ple ou tpu t is as
follow s:

Figure 1.15: Output showing the unique valued list

H er e, w e h av e u sed t w o u sef u l f u nc t i ons on t h e


di c t dat a t y p e i n Py t h on, fromkeys and keys.
fromkeys c r eat es a di c t w h er e t h e k ey s c ome f r om
t h e iterable (i n t h i s c ase, w h i c h i s a l i st ), v al u es
def au l t t o N one, and keys gi v e u s t h e k ey s of a
di c t .

EXERCISE 9: DELETING VALUE


FROM DICT
In t h i s ex er c i se, w e ar e goi ng t o del et e a v al u e
f r om a dict:

1 . Cr eate list_1 w ith fiv e


elem ents:
dict_1 = {"key1": 1,
"key2":
["list_element1", 34],
"key3": "value3",

"key4": {"subkey1":
"v1"}, "key5": 4.5}

dict_1

Th e ou tpu t is as follow s:

{'key1': 1,

'key2':
['list_element', 34],

'key3': 'value3',

'key4': {'subkey1':
'v1'},

'key5': 4.5}

2 . We w ill u se th e del fu nction


and specify th e elem ent:

del dict_1["key2"]

Th e ou tpu t is as follow s:

{'key3': 'value3',
'key4': {'subkey1':
'v1'}, 'key5': 4.5}

Note

The del operator can be used


to delete a specific index from
a list as w ell.

EXERCISE 10: DICTIONARY


COMPREHENSION
In t h i s f i nal ex er c i se on dict, w e w i l l go ov er a
l ess u sed c omp r eh ensi on t h an t h e l i st one:
di c t i onar y c omp r eh ensi on. W e w i l l al so
ex ami ne t w o ot h er w ay s t o c r eat e a dict, w h i c h
w i l l b e u sef u l i n t h e f u t u r e.

A di c t i onar y c omp r eh ensi on w or k s ex ac t l y t h e


same w ay as t h e l i st one, b u t w e need t o sp ec i f y
b ot h t h e k ey s and v al u es:

1 . Gener ate a dict th at h as 0 to


9 as th e key s and th e squ ar e
of th e key as th e v alu es:
list_1 = [x for x in
range(0, 10)]

dict_1 = {x : x**2 for


x in list_1}

dict_1

Th e ou tpu t is as follow s:

{0: 0, 1: 1, 2: 4, 3:
9, 4: 16, 5: 25, 6:
36, 7: 49, 8: 64, 9:
81}

Can y ou gener ate a dict


u sing dict com pr eh ension
w h er e th e key s ar e fr om 0 to
9 and th e v alu es ar e th e
squ ar e r oot of th e key s? Th is
tim e, w e w on't u se a list.

2 . Gener ate a dictionary


u sing th e dict fu nction:

dict_2 = dict([('Tom',
100), ('Dick', 200),
('Harry', 300)])

dict_2

Th e ou tpu t is as follow s:

{'Tom': 100, 'Dick':


200, 'Harry': 300}

You can also gener ate


dictionary u sing th e dict
fu nction, as follow s:

dict_3 = dict(Tom=100,
Dick=200, Harry=300)

dict_3

Th e ou tpu t is as follow s:

{'Tom': 100, 'Dick':


200, 'Harry': 300}

It is pr etty v er satile. So, both


th e pr eceding com m ands
w ill gener ate v alid
dictionar ies.

Th e str ange looking pair of


v alu es th at w e h ad ju st
noticed ('Har r y ', 3 00) is
called a tuple. Th is is
anoth er im por tant
fu ndam ental data ty pe in
Py th on. We w ill lear n abou t
tu ples in th e next topic.

TUPLES
A t u p l e i s anot h er dat a t y p e i n Py t h on. It i s
sequ ent i al i n nat u r e and si mi l ar t o l i st s.

A t u p l e c onsi st s of v al u es sep ar at ed b y c ommas,


as f ol l ow s:

tuple_1 = 24, 42, 2.3456, "Hello"

N ot i c e t h at , u nl i k e l i st s, w e di d not op en and
c l ose squ ar e b r ac k et s h er e.

CREATING A TUPLE WITH


DIFFERENT CARDINALITIES
Th i s i s h ow w e c r eat e an emp t y t u p l e:

tuple_1 = ()

A nd t h i s i s h ow w e c r eat e a t u p l e w i t h onl y one


v al u e:

tuple_1 = "Hello",

N ot i c e t h e t r ai l i ng c omma h er e.

W e c an nest t u p l es, si mi l ar t o l i st and di c t s, as


f ol l ow s:

tuple_1 = "hello", "there"

tuple_12 = tuple_1, 45, "Sam"

One sp ec i al t h i ng ab ou t t u p l es i s t h e f ac t t h at
t h ey ar e an i mmu t ab l e dat a t y p e. So, onc e
c r eat ed, w e c annot c h ange t h ei r v al u es. W e c an
ju st ac c ess t h em, as f ol l ow s:

tuple_1 = "Hello", "World!"

tuple_1[1] = "Universe!"

Th e l ast l i ne of c ode w i l l r esu l t i n a TypeError as


a t u p l e does not al l ow modi f i c at i on.

Th i s mak es t h e u se c ase f or t u p l es a b i t di f f er ent


t h an l i st s, al t h ou gh t h ey l ook and b eh av e v er y
si mi l ar l y i n a f ew asp ec t s.

UNPACKING A TUPLE
Th e t er m u np ac k i ng a t u p l e si mp l y means t o get
t h e v al u es c ont ai ned i n t h e t u p l e i n di f f er ent
v ar i ab l es:

tuple_1 = "Hello", "World"

hello, world = tuple_1


print(hello)

print(world)

Th e ou t p u t i s as f ol l ow s:

Hello

World

Of c ou r se, as soon as w e do t h at , w e c an modi f y


t h e v al u es c ont ai ned i n t h ose v ar i ab l es.

EXERCISE 11: HANDLING


TUPLES
1 . Cr eate a tu ple to
dem onstr ate h ow tu ples ar e
im m u table. Unpack it to
r ead all elem ents, as follow s:

tupleE = "1", "3", "5"

tupleE

Th e ou tpu t is as follow s:

('1', '3', '5')

2 . Tr y to ov er r ide a v ar iable
fr om th e tupleE tu ple:

tupleE[1] = "5"

Th is step w ill r esu lt in


TypeError as th e tu ple does
not allow m odification.

3 . Tr y to assign a ser ies to th e


tupleE tu ple:

1, 3, 5 = tupleE

4 . Pr int th e ou tpu t:

print(1)

print(3)

Th e ou tpu t is as follow s:

W e h av e mai nl y seen t w o di f f er ent t y p es of dat a


so f ar . One i s r ep r esent ed b y nu mb er s; anot h er i s
r ep r esent ed b y t ex t u al dat a. W h er eas nu mb er s
h av e t h ei r ow n t r i c k s, w h i c h w e w i l l see l at er , i t
i s t i me t o l ook i nt o t ex t u al dat a i n a b i t mor e
det ai l .
STRINGS
In t h e f i nal sec t i on of t h i s sec t i on, w e w i l l l ear n
ab ou t st r i ngs. St r i ngs i n Py t h on ar e si mi l ar t o
any ot h er p r ogr ammi ng l angu age.

Th i s i s a st r i ng:

string1 = 'Hello World!'

A st r i ng c an al so b e dec l ar ed i n t h i s manner :

string2 = "Hello World 2!"

You c an u se si ngl e qu ot es and dou b l e qu ot es t o


def i ne a st r i ng.

EXERCISE 12: ACCESSING


STRINGS
St r i ngs i n Py t h on b eh av e si mi l ar t o l i st s, ap ar t
f r om one b i g c av eat . St r i ngs ar e i mmu t ab l e,
w h er eas l i st s ar e mu t ab l e dat a st r u c t u r es:

1 . Cr eate a str ing called str_1:

str_1 = "Hello World!"

A ccess th e elem ents of th e


str ing by specify ing th e
location of th e elem ent, like
w e did in lists.

2 . A ccess th e fir st m em ber of


th e str ing:

str_1[0]

Th e ou tpu t is as follow s:

'H'

3 . A ccess th e fou r th m em ber of


th e str ing:

str_1[4]

Th e ou tpu t is as follow s:

'o'

4 . A ccess th e last m em ber of


th e str ing:

str_1[len(str_1) - 1]

Th e ou tpu t is as follow s:

'!'

5. A ccess th e last m em ber of


th e str ing:
str_1[-1]

Th e ou tpu t is as follow s:

'!'

Each of th e pr eceding
oper ations w ill giv e y ou th e
ch ar acter at th e specific
index.

Note

The method for accessing the


elements of a string is like
accessing a list.

EXERCISE 13: STRING SLICES


Ju st l i k e l i st s, w e c an sl i c e st r i ngs:

1 . Cr eate a str ing, str_1:

str_1 = "Hello World!


I am learning data
wrangling"

2 . Specify th e slicing v alu es


and slice th e str ing:

str_1[2:10]

Th e ou tpu t is th is:

'llo Worl'

3 . Slice a str ing by skipping a


slice v alu e:

str_1[-31:]

Th e ou tpu t is as follow s:

'd! I am learning data


wrangling'

4 . Use negativ e nu m ber s to


slice th e str ing:

str_1[-10:-5]

Th e ou tpu t is as follow s:

' wran'

STRING FUNCTIONS
To f i nd ou t t h e l engt h of a st r i ng, w e si mp l y u se
t h e len f u nc t i on:

str_1 = "Hello World! I am learning data


wrangling"

len(str_1)

Th e l engt h of t h e st r i ng i s 41 . To c onv er t a
st r i ng's c ase, w e c an u se t h e lower and upper
met h ods:

str_1 = "A COMPLETE UPPER CASE STRING"

str_1.lower()

str_1.upper()

Th e ou t p u t i s as f ol l ow s:

'A COMPLETE UPPER CASE STRING'

To sear c h f or a st r i ng w i t h i n a st r i ng, w e c an u se
t h e find met h od:

str_1 = "A complicated string looks like


this"

str_1.find("complicated")

str_1.find("hello")# This will return -1

Th e ou t p u t i s -1 . Can y ou f i gu r e ou t w h et h er t h e
f i nd met h od i s c ase-sensi t i v e or not ? A l so, w h at
do y ou t h i nk t h e f i nd met h od r et u r ns w h en i t
ac t u al l y f i nds t h e st r i ng?

To r ep l ac e one st r i ng w i t h anot h er , w e h av e t h e
replace met h od. Si nc e w e k now t h at a st r i ng i s an
i mmu t ab l e dat a st r u c t u r e, r ep l ac e ac t u al l y
r et u r ns a new st r i ng i nst ead of r ep l ac i ng and
r et u r ni ng t h e ac t u al one:

str_1 = "A complicated string looks like


this"

str_1.replace("complicated", "simple")

Th e ou t p u t i s as f ol l ow s:

'A simple string looks like this'

You sh ou l d l ook u p st r i ng met h ods i n t h e


st andar d doc u ment at i on of Py t h on 3 t o di sc ov er
mor e ab ou t t h ese met h ods.

EXERCISE 14: SPLIT AND JOIN


Th ese t w o st r i ng met h ods need sep ar at e
i nt r odu c t i ons, as t h ey enab l e y ou t o c onv er t a
st r i ng i nt o a l i st and v i c e v er sa:

1 . Cr eate a str ing and conv er t


it to a list u sing th e split
m eth od:

str_1 = "Name, Age,


Sex, Address"

list_1 =
str_1.split(",")
list_1

Th e pr eceding code w ill giv e


y ou a list sim ilar to th e
follow ing:

['Name', ' Age', '


Sex', ' Address']

2 . Com bine th is list into


anoth er str ing u sing th e
join m eth od:

" | ".join(list_1)

Th is code w ill giv e y ou a


str ing like th is:

'Name | Age | Sex |


Address'

W i t h t h ese, w e ar e at t h e end of ou r sec ond t op i c


of t h i s c h ap t er . W e now h av e t h e mot i v at i on t o
l ear n dat a w r angl i ng and h av e a sol i d
i nt r odu c t i on t o t h e f u ndament al s of dat a
st r u c t u r es u si ng Py t h on. Th er e i s mor e t o t h i s
t op i c , w h i c h w i l l b e c ov er ed i n a f u t u r e
c h ap t er s.

W e h av e desi gned an ac t i v i t y f or y ou so t h at y ou
c an p r ac t i c e al l t h e sk i l l s y ou ju st l ear ned. Th i s
smal l ac t i v i t y sh ou l d t ak e ar ou nd 30 t o 45
mi nu t es t o f i ni sh .

ACTIVITY 2: ANALYZE A
MULTILINE STRING AND
GENERATE THE UNIQUE
WORD COUNT
Th i s sec t i on w i l l ensu r e t h at y ou h av e
u nder st ood t h e v ar i ou s b asi c dat a st r u c t u r es and
t h ei r mani p u l at i on. W e w i l l do t h at b y goi ng
t h r ou gh an ac t i v i t y t h at h as b een desi gned
sp ec i f i c al l y f or t h i s p u r p ose.

In t h i s ac t i v i t y , w e w i l l do t h e f ol l ow i ng:

Get m u ltiline text and sav e


it in a Py th on v ar iable

Get r id of all new lines in it


u sing str ing m eth ods

Get all th e u niqu e w or ds and


th eir occu r r ences fr om th e
str ing
Repeat th e step to find all
u niqu e w or ds and
occu r r ences, w ith ou t
consider ing case sensitiv ity

Note

For the sake of simplicity for


this activity, the original text
(w hich can be found
at https://w w w .gutenberg.or
g/files/1342/1342-h/1342-
h.htm) has been pre-
processed a bit.

Th ese ar e t h e st ep s t o gu i de y ou t h r ou gh sol v i ng
t h i s ac t i v i t y :

1 . Cr eate a mutliline_text
v ar iable by copy ing th e text
fr om th e fir st ch apter of Pride
and Prejudice.

Note

The first chapter of Pride and


Prejudice by Jane Austen has
been made available on the
GitHub repository at
https://github.com/TrainingB
yPackt/Data-Wrangling-w ith-
Python/blob/master/Chapter
01/Activity02/.

2 . Find th e ty pe and length of


th e multiline_text str ing
u sing th e com m ands type
and len.

3 . Rem ov e all new lines and


sy m bols u sing th e replace
fu nction.

4 . Find all of th e w or ds in
multiline_text u sing th e
split fu nction.

5. Cr eate a list fr om th is list


th at w ill contain only th e
u niqu e w or ds.

6 . Cou nt th e nu m ber of tim es


th e u niqu e w or d h as
appear ed in th e list u sing th e
key and value in dict.

7 . Find th e top 2 5 w or ds fr om
th e u niqu e w or ds th at y ou
h av e fou nd u sing th e slice
fu nction.

You ju st cr eated, step by


step, a u niqu e w or d cou nter
u sing all th e neat tr icks th at
y ou lear ned abou t in th is
ch apter .

Note

The solution for this activity


can be found on page 285.

Summary
In t h i s c h ap t er , w e l ear ned w h at t h e t er m dat a
w r angl i ng means. W e al so got ex amp l es f r om
v ar i ou s r eal -l i f e dat a sc i enc e si t u at i ons w h er e
dat a w r angl i ng i s v er y u sef u l and i s u sed i n
i ndu st r y . W e mov ed on t o l ear n ab ou t t h e
di f f er ent b u i l t -i n dat a st r u c t u r es t h at Py t h on
h as t o of f er . W e got ou r h ands di r t y b y ex p l or i ng
l i st s, set s, di c t i onar i es, t u p l es, and st r i ngs. Th ey
ar e t h e f u ndament al b u i l di ng b l oc k s i n Py t h on
dat a st r u c t u r es, and w e need t h em al l t h e t i me
w h i l e w or k i ng and mani p u l at i ng dat a i n Py t h on.
W e di d sev er al smal l h ands-on ex er c i ses t o l ear n
mor e ab ou t t h em. W e f i ni sh ed t h i s c h ap t er w i t h
a c ar ef u l l y desi gned ac t i v i t y , w h i c h l et u s
c omb i ne a l ot of di f f er ent t r i c k s f r om al l t h e
di f f er ent dat a st r u c t u r es i nt o a r eal -l i f e
si t u at i on and l et u s ob ser v e t h e i nt er p l ay
b et w een al l of t h em.

In t h e nex t c h ap t er , w e w i l l l ear n ab ou t t h e dat a


st r u c t u r es i n Py t h on and u t i l i ze t h em t o sol v e
r eal -w or l d p r ob l ems.
Chapter 2
Advanced Data
Structures and File
Handling
Learning Objectives
By t h e end of t h i s c h ap t er , y ou w i l l b e ab l e t o:

Com par e Py th on’s adv anced


data str u ctu r es

Utilize data str u ctu r es to


solv e r eal-w or ld pr oblem s

Make u se of OS file-h andling


oper ations

Th i s c h ap t er emp h asi zes t h e dat a st r u c t u r es i n


Py t h on and t h e op er at i ng sy st em f u nc t i ons t h at
ar e t h e f ou ndat i on of t h i s b ook .

Introduction
W e w er e i nt r odu c ed t o t h e b asi c c onc ep t s of
di f f er ent f u ndament al dat a st r u c t u r es i n t h e l ast
c h ap t er . W e l ear ned ab ou t t h e l i st , set , di c t ,
t u p l e, and st r i ng. Th ey ar e t h e b u i l di ng b l oc k s of
f u t u r e c h ap t er s and ar e essent i al f or dat a
sc i enc e.

H ow ev er , w h at w e h av e c ov er ed so f ar w er e onl y
b asi c op er at i ons on t h em. Th ey h av e mu c h mor e
t o of f er onc e y ou l ear n h ow t o u t i l i ze t h em
ef f ec t i v el y . In t h i s c h ap t er , w e w i l l v ent u r e
f u r t h er i nt o t h e l and of dat a st r u c t u r es. W e w i l l
l ear n ab ou t adv anc ed op er at i ons and
mani p u l at i ons and u se t h ese f u ndament al dat a
st r u c t u r es t o r ep r esent mor e c omp l ex and
h i gh er -l ev el dat a st r u c t u r es; t h i s i s of t en h andy
w h i l e w r angl i ng dat a i n r eal l i f e.

In r eal l i f e, w e deal w i t h dat a t h at c omes f r om


di f f er ent sou r c es and gener al l y r ead dat a f r om a
f i l e or a dat ab ase. W e w i l l b e i nt r odu c ed t o
op er at i ons r el at ed t o f i l es. W e w i l l see h ow t o
op en a f i l e and h ow many w ay s t h er e ar e t o do i t ,
h ow t o r ead dat a f r om i t , h ow t o w r i t e dat a t o i t ,
and h ow t o saf el y c l ose i t onc e w e ar e done. Th e
l ast p ar t , w h i c h many p eop l e t end t o i gnor e, i s
su p er i mp or t ant . W e of t en r u n i nt o v er y st r ange
and h ar d-t o-t r ac k -dow n b u gs i n a r eal -w or l d
sy st em ju st b ec au se a p r oc ess op ened a f i l e and
di d not c l ose i t p r op er l y . W i t h ou t f u r t h er ado,
l et 's b egi n ou r jou r ney .
Advanced Data
Structures
W e w i l l st ar t t h i s c h ap t er b y di sc u ssi ng
adv anc ed dat a st r u c t u r es. W e w i l l do t h at b y
r ev i si t i ng l i st s. W e w i l l c onst r u c t a st ac k and a
qu eu e, ex p l or e mu l t i p l e el ement memb er sh i p
c h ec k i ng, and t h r ow a b i t of f u nc t i onal
p r ogr ammi ng i n f or good measu r e. If al l of t h i s
sou nds i nt i mi dat i ng, t h en do not w or r y . W e w i l l
get t o t h i ngs st ep b y st ep , l i k e i n t h e p r ev i ou s
c h ap t er , and y ou w i l l f eel c onf i dent onc e y ou
h av e f i ni sh ed t h i s c h ap t er .

To st ar t t h i s c h ap t er , y ou h av e t o op en an emp t y
not eb ook . To do t h at , y ou c an si mp l y i np u t t h e
f ol l ow i ng c ommand i n a sh el l . It i s adv i sed t h at
y ou f i r st nav i gat e t o an emp t y di r ec t or y u si ng cd
b ef or e y ou ent er t h e c ommand:

docker run -p 8888:8888 -v


'pwd':/notebooks -it rcshubhadeep/packt-
data-wrangling-base:latest

Onc e t h e Doc k er c ont ai ner i s r u nni ng, p oi nt


y ou r b r ow ser t o h t t p ://l oc al h ost :8888 and u se
dw_4_all as t h e p assc ode t o ac c ess t h e not eb ook
i nt er f ac e.

ITERATOR
W e w i l l st ar t of f t h i s t op i c w i t h l i st s. H ow ev er ,
b ef or e w e get i nt o l i st s, w e w i l l i nt r odu c e t h e
c onc ep t of an i t er at or . A n i t er at or i s an ob jec t
t h at i mp l ement s t h e next met h od, meani ng an
i t er at or i s an ob jec t t h at c an i t er at e ov er a
c ol l ec t i on (l i st s, t u p l es, di c t s, and so on). It i s
st at ef u l , w h i c h means t h at eac h t i me w e c al l t h e
next met h od, i t gi v es u s t h e nex t el ement f r om
t h e c ol l ec t i on. A nd i f t h er e i s no f u r t h er
el ement , t h en i t r ai ses a StopIteration
ex c ep t i on.

Note
A StopIteration e xc e p tion oc c u rs w ith th e
ite ra tor's ne xt m e th od w h e n th e re a re no fu rth e r
v a lu e s to ite ra te .

If y ou ar e f ami l i ar w i t h a p r ogr ammi ng


l angu age l i k e C, C++, Jav a, Jav aSc r i p t , or PH P,
y ou may h av e not i c ed t h e di f f er enc e b et w een
t h e for loop i mp l ement at i on i n t h ose l angu ages,
w h i c h c onsi st s of t h r ee di st i nc t p ar t s, p r ec i sel y
t h e i ni t i at i on, t h e i nc r ement , and t h e
t er mi nat i on c ondi t i on, and t h e for loop i n
Py t h on. In Py t h on, w e do not u se t h at k i nd of f or
l oop . W h at w e u se i n Py t h on i s mor e l i k e a
foreach l oop : for i in list_1. Th i s i s b ec au se,
u nder t h e h ood, t h e f or l oop i s u si ng an i t er at or ,
and t h u s w e do not need t o do al l t h e ex t r a st ep s.
Th e i t er at or does t h i s f or u s.
EXERCISE 15: INTRODUCTION
TO THE ITERATOR
To gener at e l i st s of nu mb er s, w e c an u se
di f f er ent met h ods:

1 . Gener ate a list th at w ill


contain 1 0000000: ones:

big_list_of_numbers =
[1 for x in range(0,
10000000)]

2 . Ch eck th e size of th is
v ar iable:

from sys import


getsizeof

getsizeof(big_list_of_
numbers)

Th e v alu e it w ill sh ow y ou
w ill be som eth ing ar ou nd
81528056 (it is in by tes).
Th is is a lot of m em or y ! A nd
th e big_list_of_numbers
v ar iable is only av ailable
once th e list com pr eh ension
is ov er . It can also ov er flow
th e av ailable sy stem
m em or y if y ou tr y too big a
nu m ber .

3 . Use an iter ator to r edu ce


m em or y u tilization:
from itertools import
repeat

small_list_of_numbers
= repeat(1,
times=10000000)

getsizeof(small_list_o
f_numbers)

Th e last line sh ow s th at ou r
small_list_of_numbers is
only 56 by tes in size. A lso, it
is a lazy m eth od, as it did not
gener ate all th e elem ents. It
w ill gener ate th em one by
one w h en asked, th u s sav ing
u s tim e. In fact, if y ou om it
th e times key w or d
ar gu m ent, th en y ou can
pr actically gener ate an
infinite nu m ber of 1 s.

4 . Loop ov er th e new ly
gener ated iter ator :

for i, x in
enumerate(small_list_o
f_numbers):

print(x)

if i > 10:

break

We u se th e enumerate
fu nction so th at w e get th e
loop cou nter , along w ith th e
v alu es. Th is w ill h elp u s
br eak once w e r each a
cer tain nu m ber of th e
cou nter (1 0 for exam ple).

Th e ou tpu t w ill be a list of 1 0


ones.

5. To look u p th e definition of
any fu nction, ty pe th e
fu nction nam e, follow ed by a
? and pr ess Shift + Enter in a
Ju py ter notebook. Ru n th e
follow ing code to u nder stand
h ow w e can u se
per m u tations and
com binations w ith iter tools:

from itertools import


(permutations,
combinations,
dropwhile, repeat,

zip_longest)

permutations?

combinations?
dropwhile?

repeat?

zip_longest?

STACKS
A st ac k i s a v er y u sef u l dat a st r u c t u r e. If y ou
k now a b i t ab ou t CPU i nt er nal s and h ow a
p r ogr am get s ex ec u t ed, t h en y ou h av e an i dea
t h at a st ac k i s p r esent i n many su c h c ases. It i s
si mp l y a l i st w i t h one r est r i c t i on, Last In Fi r st
Ou t (LIFO), meani ng an el ement t h at c omes i n
l ast goes ou t f i r st w h en a v al u e i s r ead f r om a
st ac k . Th e f ol l ow i ng i l l u st r at i on w i l l mak e t h i s
a b i t c l ear er :

Figure 2.1: A stack with two insert elements and one


pop operation

A s y ou c an see, w e h av e a LIFO st r at egy t o r ead


v al u es f r om a st ac k . W e w i l l i mp l ement a st ac k
u si ng a Py t h on l i st . Py t h on's l i st s h av e a met h od
c al l ed pop, w h i c h does t h e ex ac t same p op
op er at i on t h at y ou c an see i n t h e p r ec edi ng
i l l u st r at i on. W e w i l l u se t h at t o i mp l ement a
st ac k .

EXERCISE 16: IMPLEMENTING


A STACK IN PYTHON
1 . Fir st, define an em pty stack:

stack = []

2 . Use th e append m eth od to


add an elem ent in th e stack.
Th anks to append, th e
elem ent w ill be alw ay s
appended at th e end of th e
list:

stack.append(25)

stack

Th e ou tpu t is as follow s:

[25]

3 . A ppend anoth er v alu e to th e


stack:
stack.append(-12)

stack

Th e ou tpu t is as follow s:

[25, -12]

4 . Read a v alu e fr om ou r stack


u sing th e pop m eth od. Th is
m eth od r eads at th e cu r r ent
last index of th e list and
r etu r ns it to u s. It also deletes
th e index once th e r ead is
done:

tos = stack.pop()tos

Th e ou tpu t is as follow s:

-12

A fter w e execu te th e
pr eceding code, w e w ill h av e
-1 2 in tos and th e stack w ill
h av e only one elem ent in it,
25.

5. A ppend h ello to th e stack:

stack.append("Hello")

stack

Th e ou tpu t is as follow s:

[25, 'Hello']

Imagi ne y ou ar e sc r ap i ng a w eb p age and y ou


w ant t o f ol l ow eac h U RL t h at i s p r esent t h er e. If
y ou i nser t (ap p end) t h em one b y one i n a st ac k ,
w h i l e y ou r ead t h e w eb p age, and t h en p op t h em
one b y one and f ol l ow t h e l i nk , t h en y ou h av e a
c l ean and ex t endab l e sol u t i on t o t h e p r ob l em.
W e w i l l ex ami ne p ar t of t h i s t ask i n t h e nex t
ex er c i se.

EXERCISE 17: IMPLEMENTING


A STACK USING USER-
DEFINED METHODS
W e w i l l c ont i nu e t h e t op i c of t h e st ac k f r om t h e
l ast ex er c i se. Bu t t h i s t i me, w e w i l l i mp l ement
t h e append and pop f u nc t i ons b y ou r sel v es. Th e
ai m of t h i s ex er c i se i s t w of ol d. On one h and, w e
w i l l i mp l ement t h e st ac k , and t h i s t i me w i t h a
r eal -l i f e ex amp l e, w h i c h al so i nv ol v es
k now l edge of st r i ng met h ods and t h u s ser v es as a
r emi nder of t h e l ast c h ap t er and ac t i v i t y . On t h e
ot h er h and, i t w i l l sh ow u s a su b t l e f eat u r e of
Py t h on and h ow i t h andl es p assi ng l i st v ar i ab l es
t o f u nc t i ons, and w i l l b r i ng u s t o t h e nex t
ex er c i se, f u nc t i onal p r ogr ammi ng:

1 . Fir st, w e w ill define tw o


fu nctions, stack_push and
stack_pop. We r enam ed
th em so th at w e do not h av e
a nam espace conflict. A lso,
cr eate a stack called
url_stack for later u se:

def stack_push(s,
value):

return s + [value]

def stack_pop(s):

tos = s[-1]

del s[-1]

return tos

url_stack = []

2 . Th e fir st fu nction takes th e


alr eady existing stack and
adds th e v alu e at th e end of
it.

Note

Notice the square brackets


around the value to convert it
in to a one-element list for the
sake of the + operation.

3 . Th e second one r eads th e


v alu e th at's cu r r ently at th e
-1 index of th e stack and
th en u ses th e del oper ator to
delete th at index, and finally
r etu r ns th e v alu e it r ead
ear lier .

4 . Now , w e ar e going to h av e a
str ing w ith a few URLs in it.
Ou r job is to analy ze th e
str ing so th at w e pu sh th e
URLs in th e stack one by one
as w e encou nter th em , and
th en finally u se a for loop to
pop th em one by one. Let's
take th e fir st line fr om th e
Wikipedia ar ticle abou t data
science:

wikipedia_datascience
= "Data science is an
interdisciplinary
field that uses
scientific methods,
processes, algorithms
and systems to extract
knowledge
[https://en.wikipedia.
org/wiki/Knowledge]
and insights from data
[https://en.wikipedia.
org/wiki/Data] in
various forms, both
structured and
unstructured,similar
to data mining
[https://en.wikipedia.
org/wiki/Data_mining]"

5. For th e sake of th e sim plicity


of th is exer cise, w e h av e kept
th e links in squ ar e br ackets
beside th e tar get w or ds.

6 . Find th e length of th e str ing:

len(wikipedia_datascie
nce)

Th e ou tpu t is as follow s:

347

7 . Conv er t th is str ing into a list


by u sing th e split m eth od
fr om th e str ing and th en
calcu late its length :

wd_list =
wikipedia_datascience.
split()

len(wd_list)

Th e ou tpu t is as follow s:

34

8. Use a for loop to go ov er each


w or d and ch eck w h eth er it is
a URL. To do th at, w e w ill u se
th e startswith m eth od
fr om th e str ing, and if it is a
URL, th en w e pu sh it into th e
stack:

for word in wd_list:

if word.startswith("
[https://"):

url_stack =
stack_push(url_stack,
word[1:-1])

# Notice the clever


use of string slicing

9 . Pr int th e v alu e in
url_stack:

url_stack

Th e ou tpu t is as follow s:

['https://en.wikipedia
.org/wiki/Knowledge',

'https://en.wikipedia.
org/wiki/Data',

'https://en.wikipedia.
org/wiki/Data_mining']

1 0. Iter ate ov er th e list and pr int


th e URLs one by one by u sing
th e stack_pop fu nction:

for i in range(0,
len(url_stack)):

print(stack_pop(url_st
ack))

Th e ou tpu t is as follow s:

Figure 2.2: Output of the URLs that are


printed using a stack

1 1 . Pr int it again to m ake su r e


th at th e stack is em pty after
th e final for loop:

print(url_stack)

Th e ou tpu t is as follow s:

[]
W e h av e not i c ed a st r ange p h enomenon i n t h e
stack_pop met h od. W e p assed t h e l i st v ar i ab l e
t h er e, and w e u sed t h e del op er at or i nsi de t h e
f u nc t i on, b u t i t c h anged t h e or i gi nal v ar i ab l e b y
del et i ng t h e l ast i ndex eac h t i me w e c al l t h e
f u nc t i on. If y ou ar e c omi ng f r om a l angu age l i k e
C, C++, and Jav a, t h en t h i s i s a c omp l et el y
u nex p ec t ed b eh av i or , as i n t h ose l angu ages t h i s
c an onl y h ap p en i f w e p ass t h e v ar i ab l e b y
r ef er enc e and i t c an l ead t o su b t l e b u gs i n
Py t h on c ode. So b e c ar ef u l . In gener al , i t i s not a
good i dea t o c h ange a v ar i ab l e's v al u e i n p l ac e,
meani ng i nsi de a f u nc t i on. A ny v ar i ab l e t h at 's
p assed t o t h e f u nc t i on sh ou l d b e c onsi der ed and
t r eat ed as i mmu t ab l e. Th i s i s c l ose t o t h e
p r i nc i p l es of f u nc t i onal p r ogr ammi ng. A l amb da
ex p r essi on i n Py t h on i s a w ay t o c onst r u c t one-
l i ne, namel ess f u nc t i ons t h at ar e, b y c onv ent i on,
si de ef f ec t -f r ee.

EXERCISE 18: LAMBDA


EXPRESSION
In t h i s ex er c i se, w e w i l l u se a l amb da ex p r essi on
t o p r ov e t h e f amou s t r i gonomet r i c i dent i t y :

Figure 2.3 Trigonometric identity

1 . Im por t th e math package:

import math

2 . Define tw o fu nctions,
my_sine and my_cosine. Th e
r eason w e ar e declar ing
th ese fu nctions is becau se th e
or iginal sin and cos
fu nctions fr om th e m ath
package take r adians as
inpu t, bu t w e ar e m or e
fam iliar w ith degr ees. So, w e
w ill u se a lam bda expr ession
to define a nam eless one-line
fu nction and u se it. Th is
lam bda fu nction w ill
au tom atically conv er t ou r
degr ee inpu t to r adians and
th en apply sin or cos on it
and r etu r n th e v alu e:

def my_sine():
return lambda x:
math.sin(math.radians(
x))

def my_cosine():

return lambda x:
math.cos(math.radians(
x))

3 . Define sine and cosine for


ou r pu r pose:

sine = my_sine()

cosine = my_cosine()

math.pow(sine(30), 2)
+ math.pow(cosine(30),
2)

Th e ou tpu t is as follow s:

1.0

Notice th at w e h av e assigned
th e r etu r n v alu e fr om both
my_sine and my_cosine to
tw o v ar iables, and th en u sed
th em dir ectly as th e
fu nctions. It is a m u ch
cleaner appr oach th an u sing
th em explicitly . Notice th at
w e did not explicitly w r ite a
return statem ent inside th e
lam bda fu nction. It is
assu m ed.

EXERCISE 19: LAMBDA


EXPRESSION FOR SORTING
Th e l amb da ex p r essi on w i l l t ak e an i np u t and
sor t i t ac c or di ng t o t h e v al u es i n t u p l es. A
l amb da c an t ak e one or mor e i np u t s. A l amb da
ex p r essi on c an al so b e u sed t o r ev er se sor t b y
u si ng t h e p ar amet er of reverse as True:

1 . Im agine y ou 'r e in a data


w r angling job w h er e y ou ar e
confr onted w ith th e
follow ing list of tu ples:

capitals = [("USA",
"Washington"),
("India", "Delhi"),
("France", "Paris"),
("UK", "London")]

capitals

Th e ou tpu t w ill be as follow s:

[('USA',
'Washington'),

('India', 'Delhi'),

('France', 'Paris'),

('UK', 'London')]

2 . Sor t th is list by th e nam e of


th e capitals of each cou ntr y ,
u sing a sim ple lam bda
expr ession. Use th e follow ing
code:

capitals.sort(key=lamb
da item: item[1])

capitals

Th e ou tpu t w ill be as follow s:

[('India', 'Delhi'),

('UK', 'London'),

('France', 'Paris'),

('USA', 'Washington')]

A s w e c an see, l amb da ex p r essi ons ar e p ow er f u l


i f w e mast er t h em and u se t h em i n ou r dat a
w r angl i ng job s. Th ey ar e al so si de ef f ec t -f r ee,
meani ng t h at t h ey do not c h ange t h e v al u es of
t h e v ar i ab l es t h at ar e p assed t o t h em i n p l ac e.

EXERCISE 20: MULTI-ELEMENT


MEMBERSHIP CHECKING
H er e i s an i nt er est i ng p r ob l em. Let 's i magi ne a
l i st of a f ew w or ds sc r ap ed f r om a t ex t c or p u s
y ou ar e w or k i ng w i t h :

1 . Cr eate a list_of_words list


w ith w or ds scr aped fr om a
text cor pu s:

list_of_words =
["Hello", "there.",
"How", "are", "you",
"doing?"]
2 . Find ou t w h eth er th is list
contains all th e elem ents
fr om anoth er list:

check_for = ["How",
"are"]

Th er e exists an elabor ate


solu tion, w h ich inv olv es a
for loop and few if-else
conditions (and y ou sh ou ld
tr y to w r ite it!), bu t th er e
also exists an elegant
Py th onic solu tion to th is
pr oblem , w h ich takes one
line and u ses th e all
fu nction. Th e all fu nction
r etu r ns True if all elem ents
of th e iter able ar e tr u e.

3 . Using th e in key w or d to
ch eck m em ber sh ip in th e list
list_of_words:

all(w in list_of_words
for w in check_for)

Th e ou tpu t is as follow s:

True

It is indeed elegant and


sim ple to r eason abou t, and
th is neat tr ick is v er y
im por tant w h en dealing
w ith lists.

QUEUE
A p ar t f r om st ac k s, anot h er h i gh -l ev el dat a
st r u c t u r e t h at w e ar e i nt er est ed i n i s qu eu e. A
qu eu e i s l i k e a st ac k , meani ng t h at y ou c ont i nu e
addi ng el ement s one b y one. W i t h a qu eu e, t h e
r eadi ng of el ement s ob ey s a FIFO (Fi r st In Fi r st
Ou t ) st r at egy . Ch ec k ou t t h e f ol l ow i ng di agr am
t o u nder st and t h i s b et t er :
Figure 2.4: Pictorial representation of a queue

W e w i l l ac c omp l i sh t h i s f i r st u si ng l i st met h ods


and w e w i l l sh ow y ou t h at f or t h i s p u r p ose, i t i s
i nef f i c i ent . Th en, w e w i l l l ear n ab ou t t h e
dequeue dat a st r u c t u r e f r om t h e c ol l ec t i on
modu l e of Py t h on.

EXERCISE 21: IMPLEMENTING


A QUEUE IN PYTHON
1 . Cr eate a Py th on qu eu e w ith
th e plain list m eth ods:

%%time

queue = []

for i in range(0,
100000):

queue.append(i)

print("Queue created")

Th e ou tpu t is as follow s:

Queue created

Wall time: 11 ms

2 . Use th e pop fu nction to


em pty th e qu eu e and ch eck
item s in it:
for i in range(0,
100000):

queue.pop(0)

print("Queue emptied")

Th e ou tpu t is as follow s:

Queue emptied
If w e u se th e %%time m agic
com m and w h ile execu ting
th e pr eceding code, w e w ill
see th at it takes a w h ile to
finish . In a m oder n MacBook,
w ith a qu ad-cor e pr ocessor
and 8 GB RA M, it took
ar ou nd 1 .2 0 seconds to
finish . Th is tim e is taken
becau se of th e pop(0)
oper ation, w h ich m eans
ev er y tim e w e pop a v alu e
fr om th e left of th e list
(w h ich is th e cu r r ent 0
index), Py th on h as to
r ear r ange all th e oth er
elem ents of th e list by
sh ifting th em one space left.
Indeed, it is not a v er y
optim ized im plem entation.

3 . Im plem ent th e sam e qu eu e


u sing th e deque data
str u ctu r e fr om Py th on's
collection package:

%%time

from collections
import deque

queue2 = deque()

for i in range(0,
100000):

queue2.append(i)

print("Queue created")

for i in range(0,
100000):

queue2.popleft()

print("Queue emptied")

Th e ou tpu t is as follow s:

Queue created

Queue emptied

Wall time: 23 ms

4 . With th e specialized and


optim ized qu eu e
im plem entation fr om
Py th on's standar d libr ar y ,
th e tim e th at's taken for th is
oper ation is only in th e
r ange of 2 8 m illiseconds!
Th is is a h u ge im pr ov em ent
on th e pr ev iou s one.

A qu eu e i s a v er y i mp or t ant dat a st r u c t u r e. To
gi v e one ex amp l e f r om r eal l i f e, w e c an t h i nk
ab ou t a p r odu c er -c onsu mer sy st em desi gn. W h i l e
doi ng dat a w r angl i ng, y ou w i l l of t en c ome ac r oss
a p r ob l em w h er e y ou mu st p r oc ess v er y b i g f i l es.
One of t h e w ay s t o deal w i t h t h i s p r ob l em i s t o
c h u nk t h e c ont ent s of t h e f i l e i n t o smal l er p ar t s
and t h en p u sh t h em i n t o a qu eu e w h i l e c r eat i ng
smal l , dedi c at ed w or k er p r oc esses, w h i c h r eads
of f t h e qu eu e and p r oc esses one smal l c h u nk at a
t i me. Th i s i s a v er y p ow er f u l desi gn, and y ou c an
ev en u se i t ef f i c i ent l y t o desi gn h u ge mu l t i -node
dat a w r angl i ng p i p el i nes.

W e w i l l end t h e di sc u ssi on on dat a st r u c t u r es


h er e. W h at w e di sc u ssed h er e i s ju st t h e t i p of t h e
i c eb er g. Dat a st r u c t u r es ar e a f asc i nat i ng
su b jec t . Th er e ar e many ot h er dat a st r u c t u r es
t h at w e di d not t ou c h and w h i c h , w h en u sed
ef f i c i ent l y , c an of f er enor mou s added v al u e. W e
st r ongl y enc ou r age y ou t o ex p l or e dat a
st r u c t u r es mor e. Tr y t o l ear n ab ou t l i nk ed l i st s,
t r ee, gr ap h , t r i e, and al l t h e di f f er ent v ar i at i ons
of t h em as mu c h as y ou c an. N ot onl y do t h ey
of f er t h e joy of l ear ni ng, b u t t h ey ar e al so t h e
sec r et mega w eap ons i n t h e ar senal of a dat a
p r ac t i t i oner t h at y ou c an b r i ng ou t ev er y t i me
y ou ar e c h al l enged w i t h a di f f i c u l t dat a
w r angl i ng job .

ACTIVITY 3: PERMUTATION,
ITERATOR, LAMBDA, LIST
In t h i s ac t i v i t y , w e w i l l b e u si ng permutations t o
gener at e al l p ossi b l e t h r ee-di gi t nu mb er s t h at
c an b e gener at ed u si ng 0, 1 , and 2. Th en, l oop ov er
t h i s i t er at or , and al so u se isinstance and assert
t o mak e su r e t h at t h e r et u r n t y p es ar e t u p l es.
A l so, u se a si ngl e l i ne of c ode i nv ol v i ng
dropwhile and lambda ex p r essi ons t o c onv er t al l
t h e t u p l es t o l i st s w h i l e dr op p i ng any l eadi ng
zer os (f or ex amp l e, (0, 1 , 2) b ec omes [1 , 2]).
Fi nal l y , w r i t e a f u nc t i on t h at t ak es a l i st l i k e
b ef or e and r et u r ns t h e ac t u al nu mb er c ont ai ned
i n i t.

Th ese st ep s w i l l gu i de y ou t o sol v e t h i s ac t i v i t y :

1 . Look u p th e definition of
permutations and
dropwhile fr om itertools.
2 . Wr ite an expr ession to
gener ate all th e possible
th r ee-digit nu m ber s u sing 0,
1, and 2.

3 . Loop ov er th e iter ator


expr ession y ou gener ated
befor e. Pr int each elem ent
th at's r etu r ned by th e
iter ator . Use assert and
isinstance to m ake su r e
th at th e elem ents ar e of th e
tu ple ty pe.

4 . Wr ite th e loop again u sing


dropwhile w ith a lam bda
expr ession to dr op any
leading zer os fr om th e tu ples.
A s an exam ple, (0, 1, 2)
w ill becom e [0, 2]. A lso,
cast th e ou tpu t of dropwhile
to a list.

5. Ch eck th e actu al ty pe th at
dropwhile r etu r ns.

6 . Com bine th e pr eceding code


into one block, and th is tim e
w r ite a separ ate fu nction
w h er e y ou w ill pass th e list
gener ated fr om dropwhile,
and th e fu nction w ill r etu r n
th e w h ole nu m ber contained
in th e list. A s an exam ple, if
y ou pass [1, 2] to th e
fu nction, it w ill r etu r n 12.
Make su r e th at th e r etu r n
ty pe is indeed a nu m ber and
not a str ing. A lth ou gh th is
task can be ach iev ed u sing
oth er tr icks, w e r equ ir e th at
y ou tr eat th e incom ing list
as a stack in th e fu nction and
gener ate th e nu m ber by
r eading th e indiv idu al digits
fr om th e stack.

W i t h t h i s ac t i v i t y , w e h av e f i ni sh ed t h i s t op i c
and w e w i l l h ead ov er t o t h e nex t t op i c , w h i c h
i nv ol v es b asi c f i l e-l ev el op er at i ons. Bu t b ef or e
w e l eav e t h i s t op i c , w e enc ou r age y ou t o t h i nk
ab ou t a sol u t i on t o t h e p r ec edi ng p r ob l em
w i t h ou t u si ng al l t h e adv anc ed op er at i ons and
dat a st r u c t u r es w e h av e u sed h er e. You w i l l soon
r eal i ze h ow c omp l ex t h e nai v e sol u t i on i s, and
h ow mu c h v al u e t h ese dat a st r u c t u r es and
op er at i ons b r i ng.

Note
Th e s olu tion for th is a c tiv ity c a n be fou nd on p a g e
289.

Basic File Operations in


Python
In t h e p r ev i ou s t op i c , w e i nv est i gat ed a f ew
adv anc ed dat a st r u c t u r es and al so l ear ned neat
and u sef u l f u nc t i onal p r ogr ammi ng met h ods t o
mani p u l at e t h em w i t h ou t si de ef f ec t s. In t h i s
t op i c , w e w i l l l ear n ab ou t a f ew op er at i ng
sy st em (OS)-l ev el f u nc t i ons i n Py t h on. W e w i l l
c onc ent r at e mai nl y on f i l e-r el at ed f u nc t i ons and
l ear n h ow t o op en a f i l e, r ead t h e dat a l i ne b y
l i ne or al l at onc e, and f i nal l y h ow t o c l eanl y
c l ose t h e f i l e w e op ened. W e w i l l ap p l y a f ew of
t h e t ec h ni qu es w e h av e l ear ned ab ou t on a f i l e
t h at w e w i l l r ead t o p r ac t i c e ou r dat a w r angl i ng
sk i l l s f u r t h er .

EXERCISE 22: FILE


OPERATIONS
In t h i s ex er c i se, w e w i l l l ear n ab ou t t h e OS
modu l e of Py t h on, and w e w i l l al so see t w o v er y
u sef u l w ay s t o w r i t e and r ead env i r onment
v ar i ab l es. Th e p ow er of w r i t i ng and r eadi ng
env i r onment v ar i ab l es i s of t en v er y i mp or t ant
w h i l e desi gni ng and dev el op i ng dat a w r angl i ng
p i p el i nes.

Note
I n fa c t, one of th e fa c tors of th e fa m ou s 12-fa c tor
a p p d e s ig n is th e v e ry id e a of s toring
c onfig u ra tion in th e e nv ironm e nt. You c a n c h e c k
it ou t a t th is URL: h ttp s ://12fa c tor.ne t/c onfig .

Th e p u r p ose of t h e OS modu l e i s t o gi v e y ou w ay s
t o i nt er ac t w i t h op er at i ng sy st em-dep endent
f u nc t i onal i t i es. In gener al , i t i s p r et t y l ow -l ev el
and most of t h e f u nc t i ons f r om t h er e ar e not
u sef u l on a day -t o-day b asi s, h ow ev er , some ar e
w or t h l ear ni ng. os.environ i s t h e c ol l ec t i on
Py t h on mai nt ai ns w i t h al l t h e p r esent
env i r onment v ar i ab l es i n y ou r OS. It gi v es y ou
t h e p ow er t o c r eat e new ones. Th e os.getenv
f u nc t i on gi v es y ou t h e ab i l i t y t o r ead an
env i r onment v ar i ab l e:

1 . Im por t th e os m odu le.

import os
2 . Set few env ir onm ent
v ar iables:

os.environ['MY_KEY'] =
"MY_VAL"

os.getenv('MY_KEY')

Th e ou tpu t is as follow s:

'MY_VAL'

Pr int th e env ir onm ent


v ar iable w h en it is not set:

print(os.getenv('MY_KE
Y_NOT_SET'))

Th e ou tpu t is as follow s:

None

3 . Pr int th e os env ir onm ent:

print(os.environ)

Note

The output has not been added


for security reasons.

A fter execu ting th e


pr eceding code, y ou w ill be
able to see th at y ou h av e
su ccessfu lly pr inted th e
v alu e of MY_KEY, and w h en
y ou tr ied to pr int
MY_KEY_NOT_SET, it pr inted
None.

FILE HANDLING
In t h i s ex er c i se, w e w i l l l ear n ab ou t h ow t o op en
a f i l e i n Py t h on. W e w i l l l ear n ab ou t t h e
di f f er ent modes t h at w e c an u se and w h at t h ey
st and f or . Py t h on h as a b u i l t -i n open f u nc t i on
t h at w e w i l l u se t o op en a f i l e. Th e open f u nc t i on
t ak es f ew ar gu ment s as i np u t . A mong t h em, t h e
f i r st one, w h i c h st ands f or t h e name of t h e f i l e
y ou w ant t o op en, i s t h e onl y one t h at 's
mandat or y . Ev er y t h i ng el se h as a def au l t v al u e.
W h en y ou c al l open, Py t h on u ses u nder l y i ng
sy st em-l ev el c al l s t o op en a f i l e h andl er and w i l l
r et u r n i t t o t h e c al l er .

U su al l y , a f i l e c an b e op ened ei t h er f or r eadi ng
or f or w r i t i ng. If w e op en a f i l e i n one mode, t h e
ot h er op er at i on i s not su p p or t ed. W h er eas
r eadi ng u su al l y means w e st ar t t o r ead f r om t h e
b egi nni ng of an ex i st i ng f i l e, w r i t i ng c an mean
ei t h er st ar t i ng a new f i l e and w r i t i ng f r om t h e
b egi nni ng or op eni ng an ex i st i ng f i l e and
ap p endi ng t o i t . H er e i s a t ab l e sh ow i ng y ou al l
t h e di f f er ent modes Py t h on su p p or t s f or op eni ng
a f i l e:

Figure 2.5 Modes to read a file

Th er e al so ex i st s a dep r ec at ed mode, U, w h i c h i n a
Py t h on3 env i r onment does not h i ng. One t h i ng
w e mu st r ememb er h er e i s t h at Py t h on w i l l
al w ay s di f f er ent i at e b et w een t and b modes, ev en
i f t h e u nder l y i ng OS doesn't . Th i s i s b ec au se i n b
mode, Py t h on does not t r y t o dec ode w h at i t i s
r eadi ng and gi v es u s b ac k t h e b y t es ob jec t
i nst ead, w h er eas i n t mode, i t does t r y t o dec ode
t h e st r eam and gi v es u s b ac k t h e st r i ng
r ep r esent at i on.

You c an op en a f i l e f or r eadi ng l i k e so:

fd = open("Alice’s Adventures in
Wonderland, by Lewis Carroll")
Th i s i s op ened i n rt mode. You c an op en t h e same
f i l e i n b i nar y mode i f y ou w ant . To op en t h e f i l e
i n b i nar y mode, u se t h e rb mode:

fd = open("Alice’s Adventures in
Wonderland, by Lewis Carroll",

"rb")

fd

Th e ou t p u t i s as f ol l ow s:

<_io.BufferedReader name='Alice's
Adventures in Wonderland, by Lewis
Carroll'>

Th i s i s h ow w e op en a f i l e f or w r i t i ng:

fd = open("interesting_data.txt", "w")

fd

Th e ou t p u t i s as f ol l ow s:

<_io.TextIOWrapper
name='interesting_data.txt' mode='w'
encoding='cp1252'>

EXERCISE 23: OPENING AND


CLOSING A FILE
In t h i s ex er c i se, w e w i l l l ear n h ow t o c l ose an
op en f i l e. It i s v er y i mp or t ant t h at w e c l ose a
f i l e onc e w e op en i t . A l ot of sy st em-l ev el b u gs
c an oc c u r du e t o a dangl i ng f i l e h andl er . Onc e w e
c l ose a f i l e, no f u r t h er op er at i ons c an b e
p er f or med on t h at f i l e u si ng t h at sp ec i f i c f i l e
h andl er :

1 . Open a file in binar y m ode:

fd = open("Alice's
Adventures in
Wonderland, by Lewis
Carroll",

"rb")

2 . Close a file u sing close():

fd.close()

3 . Py th on also giv es u s a
closed flag w ith th e file
h andler . If w e pr int it befor e
closing, th en w e w ill see
False, w h er eas if w e pr int it
after closing, th en w e w ill see
True. If ou r logic ch ecks
w h eth er a file is pr oper ly
closed or not, th en th is is th e
flag w e w ant to u se.
THE WITH STATEMENT
In t h i s ex er c i se, w e w i l l l ear n ab ou t t h e with
st at ement i n Py t h on and h ow w e c an ef f ec t i v el y
u se i t i n t h e c ont ex t of op eni ng and c l osi ng f i l es.

Th e with c ommand i s a c omp ou nd st at ement i n


Py t h on. Li k e any c omp ou nd st at ement , with al so
af f ec t s t h e ex ec u t i on of t h e c ode enc l osed b y i t .
In t h e c ase of with, i t i s u sed t o w r ap a b l oc k of
c ode i n t h e sc op e of w h at w e c al l a Context
Manager i n Py t h on. A det ai l ed di sc u ssi on of t h e
c ont ex t manager i s ou t of t h e sc op e of t h i s
ex er c i se and t h i s t op i c i n gener al , b u t i t i s
su f f i c i ent t o say t h at t h ank s t o a c ont ex t
manager i mp l ement ed i nsi de t h e open c al l f or
op eni ng a f i l e i n Py t h on, i t i s gu ar ant eed t h at a
c l ose c al l w i l l au t omat i c al l y h ap p en i f w e w r ap
i t i nsi de a with st at ement .

Note
Th e re is a n e ntire PEP for w ith a t
h ttp s ://w w w .p y th on.org /d e v /p e p s /p e p -0343/.
We e nc ou ra g e y ou to look into it.

OPENING A FILE USING THE


WITH STATEMENT
Op en a f i l e u si ng t h e w i t h st at ement :

with open("Alice’s Adventures in


Wonderland, by Lewis Carroll")as fd:

print(fd.closed)

print(fd.closed)

Th e ou t p u t i s as f ol l ow s:

False

True

If w e ex ec u t e t h e p r ec edi ng c ode, w e w i l l see


t h at t h e f i r st p r i nt w i l l end u p p r i nt i ng False,
w h er eas t h e sec ond one w i l l p r i nt True. Th i s
means t h at as soon as t h e c ont r ol goes ou t of t h e
with b l oc k , t h e f i l e desc r i p t or i s au t omat i c al l y
c l osed.

Note
Th is is by fa r th e c le a ne s t a nd m os t Py th onic w a y
to op e n a file a nd obta in a file d e s c rip tor for it. We
e nc ou ra g e y ou to u s e th is p a tte rn w h e ne v e r y ou
ne e d to op e n a file by y ou rs e lf.

EXERCISE 24: READING A FILE


LINE BY LINE
1 . Open a file and th en r ead th e
file line by line and pr int it
as w e r ead it:

with open("Alice’s
Adventures in
Wonderland, by Lewis
Carroll",

encoding="utf8") as
fd:

for line in fd:

print(line)

Th e ou tpu t is as follow s:

Figure 2.6: Screenshot from the Jupyter


notebook

2 . Looking at th e pr eceding
code, w e can r eally see w h y
it is im por tant. With th is
sm all snippet of code, y ou
can ev en open and r ead files
th at ar e m any GBs in size,
line by line, and w ith ou t
flooding or ov er r u nning th e
sy stem m em or y !

Th er e is anoth er explicit
m eth od in th e file descr iptor
object called readline,
w h ich r eads one line at a
tim e fr om a file.

3 . Du plicate th e sam e for loop,


ju st after th e fir st one:

with open("Alice’s
Adventures in
Wonderland, by Lewis
Carroll",

encoding="utf8") as
fd:

for line in fd:

print(line)

print("Ended first
loop")

for line in fd:

print(line)

Th e ou tpu t is as follow s:

Figure 2.7: Section of the open file

EXERCISE 25: WRITE TO A FILE


W e w i l l end t h i s t op i c on f i l e op er at i ons b y
sh ow i ng y ou h ow t o w r i t e t o a f i l e. W e w i l l
w r i t e a f ew l i nes t o a f i l e and r ead t h e f i l e:
1 . Use th e write fu nction fr om
th e file descr iptor object:

data_dict = {"India":
"Delhi", "France":
"Paris", "UK":
"London",

"USA": "Washington"}

with
open("data_temporary_f
iles.txt", "w") as fd:

for country, capital


in data_dict.items():

fd.write("The capital
of {} is {}\n".format(

country, capital))

2 . Read th e file u sing th e


follow ing com m and:

with
open("data_temporary_f
iles.txt", "r") as fd:

for line in fd:

print(line)

Th e ou tpu t is as follow s:

The capital of India


is Delhi

The capital of France


is Paris

The capital of UK is
London

The capital of USA is


Washington

3 . Use th e pr int fu nction to


w r ite to a file u sing th e
follow ing com m and:

data_dict_2 =
{"China": "Beijing",
"Japan": "Tokyo"}

with
open("data_temporary_f
iles.txt", "a") as fd:

for country, capital


in
data_dict_2.items():

print("The capital of
{} is {}".format(

country, capital),
file=fd)

4 . Read th e file u sing th e


follow ing com m and:

with
open("data_temporary_f
iles.txt", "r") as fd:

for line in fd:

print(line)

Th e ou tpu t is as follow s:

The capital of India


is Delhi

The capital of France


is Paris

The capital of UK is
London

The capital of USA is


Washington

The capital of China


is Beijing

The capital of Japan


is Tokyo

Note:

I n the second case, w e did not


add an extra new line
character, \n, at the end of the
string to be w ritten. The print
function does that
automatically for us.

W i t h t h i s, w e w i l l end t h i s t op i c . Ju st l i k e t h e
p r ev i ou s t op i c s, w e h av e desi gned an ac t i v i t y f or
y ou t o p r ac t i c e y ou r new l y ac qu i r ed sk i l l s.

ACTIVITY 4: DESIGN YOUR


OWN CSV PARSER
A CSV f i l e i s somet h i ng y ou w i l l enc ou nt er a l ot
i n y ou r l i f e as a dat a p r ac t i t i oner . A CSV i s a
c omma-sep ar at ed f i l e w h er e dat a f r om a t ab u l ar
f or mat i s gener al l y st or ed and sep ar at ed u si ng
c ommas, al t h ou gh ot h er c h ar ac t er s c an al so b e
u sed.

In t h i s ac t i v i t y , w e w i l l b e t ask ed w i t h b u i l di ng
ou r ow n CSV r eader and p ar ser . A l t h ou gh i t i s a
b i g t ask i f w e t r y t o c ov er al l u se c ases and edge
c ases, al ong w i t h esc ap e c h ar ac t er s and al l , f or
t h e sak e of t h i s smal l ac t i v i t y , w e w i l l k eep ou r
r equ i r ement s smal l . W e w i l l assu me t h at t h er e i s
no esc ap e c h ar ac t er , meani ng t h at i f y ou u se a
c omma at any p l ac e i n y ou r r ow , i t means y ou ar e
st ar t i ng a new c ol u mn. W e w i l l al so assu me t h at
t h e onl y f u nc t i on w e ar e i nt er est ed i n i s t o b e
ab l e t o r ead a CSV f i l e l i ne b y l i ne w h er e eac h
r ead w i l l gener at e a new di c t w i t h t h e c ol u mn
names as k ey s and r ow names as v al u es.

H er e i s an ex amp l e:

Figure 2.8 Table with sample data

W e c an c onv er t t h e dat a i n t h e p r ec edi ng t ab l e


i nt o a Py t h on di c t i onar y , w h i c h w ou l d l ook as
f ol l ow s: {"Name": "Bob", "Age": "24",
"Location": "California"}:

1 . Im por t zip_longest fr om
itertools. Cr eate a
fu nction to zip header, line
and fillvalue=None.

2 . Open th e accom pany ing


sales_record.csv file fr om
th e GitHu b link by u sing r
m ode inside a w ith block and
fir st ch eck th at it is opened.

3 . Read th e fir st line and u se


str ing m eth ods to gener ate a
list of all th e colu m n nam es.
4 . Star t r eading th e file. Read it
line by line.

5. Read each line and pass th at


line to a fu nction, along w ith
th e list of th e h eader s. Th e
w or k of th e fu nction is to
constr u ct a dict ou t of th ese
tw o and fill u p th e
key:values. Keep in m ind
th at a m issing v alu e sh ou ld
r esu lt in None.

Note

The solution for this activity


can be found on page 291.

Summary
In t h i s c h ap t er , w e l ear ned ab ou t t h e w or k i ngs
of adv anc ed dat a st r u c t u r es su c h as st ac k s and
qu eu es. W e i mp l ement ed and mani p u l at ed b ot h
st ac k s and qu eu es. W e t h en f oc u sed on di f f er ent
met h ods of f u nc t i onal p r ogr ammi ng, i nc l u di ng
i t er at or s, and c omb i ned l i st s and f u nc t i ons
t oget h er . A f t er t h i s, w e l ook ed at t h e OS-l ev el
f u nc t i ons and t h e management of env i r onment
v ar i ab l es and f i l es. W e al so ex ami ned a c l ean
w ay t o deal w i t h f i l es, and w e c r eat ed ou r ow n
CSV p ar ser i n t h e l ast ac t i v i t y .

In t h e nex t c h ap t er , w e w i l l b e deal i ng w i t h t h e
t h r ee most i mp or t ant l i b r ar i es, namel y N u mPy ,
p andas, and mat p l ot l i b .
Chapter 3
Introduction to NumPy,
Pandas,and Matplotlib
Learning Objectives
By t h e end of t h e c h ap t er , y ou w i l l b e ab l e t o:

Cr eate and m anipu late one-


dim ensional and m u lti-
dim ensional ar r ay s

Cr eate and m anipu late


pandas DataFr am es and
ser ies objects

Plot and v isu alize nu m er ical


data u sing th e Matplotlib
libr ar y

A pply m atplotlib, Nu m Py ,
and pandas to calcu late
descr iptiv e statistics fr om a
DataFr am e/m atr ix

In t h i s c h ap t er , y ou w i l l l ear n ab ou t t h e
f u ndament al s of t h e N u mPy , p andas, and
mat p l ot l i b l i b r ar i es.

Introduction
In t h e p r ec edi ng c h ap t er s, w e h av e c ov er ed some
adv anc ed dat a st r u c t u r es, su c h as st ac k , qu eu e,
i t er at or , and f i l e op er at i ons i n Py t h on. In t h i s
sec t i on, w e w i l l c ov er t h r ee essent i al l i b r ar i es,
namel y N u mPy , p andas, and mat p l ot l i b .

NumPy Arrays
In t h e l i f e of a dat a sc i ent i st , r eadi ng and
mani p u l at i ng ar r ay s i s of p r i me i mp or t anc e, and
i t i s al so t h e most f r equ ent l y enc ou nt er ed t ask .
Th ese ar r ay s c ou l d b e a one-di mensi onal l i st or a
mu l t i -di mensi onal t ab l e or a mat r i x f u l l of
nu mb er s.

Th e ar r ay c ou l d b e f i l l ed w i t h i nt eger s, f l oat i ng-


p oi nt nu mb er s, Bool eans, st r i ngs, or ev en mi x ed
t y p es. H ow ev er , i n t h e major i t y of c ases,
nu mer i c dat a t y p es ar e p r edomi nant .

Some ex amp l e sc enar i os w h er e y ou w i l l need t o


h andl e nu mer i c ar r ay s ar e as f ol l ow s:
To r ead a list of ph one
nu m ber s and postal codes
and extr act a cer tain patter n

To cr eate a m atr ix w ith


r andom nu m ber s to r u n a
Monte Car lo sim u lation on
som e statistical pr ocess

To scale and nor m alize a


sales figu r e table, w ith lots of
financial and tr ansactional
data

To cr eate a sm aller table of


key descr iptiv e statistics (for
exam ple, m ean, m edian,
m in/m ax r ange, v ar iance,
inter -qu ar tile r anges) fr om a
lar ge r aw data table

To r ead in and analy ze tim e


ser ies data in a one-
dim ensional ar r ay daily ,
su ch as th e stock pr ice of an
or ganization ov er a y ear or
daily tem per atu r e data fr om
a w eath er station

In sh or t , ar r ay s and nu mer i c dat a t ab l es ar e


ev er y w h er e. A s a dat a w r angl i ng p r of essi onal ,
t h e i mp or t anc e of t h e ab i l i t y t o r ead and p r oc ess
nu mer i c ar r ay s c annot b e ov er st at ed. In t h i s
r egar d, N u mPy ar r ay s w i l l b e t h e most
i mp or t ant ob jec t i n Py t h on t h at y ou need t o
k now ab ou t .

NUMPY ARRAY AND


FEATURES
NumPy and SciPy ar e op en sou r c e add-on
modu l es f or Py t h on t h at p r ov i de c ommon
mat h emat i c al and nu mer i c al r ou t i nes i n p r e-
c omp i l ed, f ast f u nc t i ons. Th ese h av e gr ow n i nt o
h i gh l y mat u r e l i b r ar i es t h at p r ov i de
f u nc t i onal i t y t h at meet s, or p er h ap s ex c eeds,
w h at i s assoc i at ed w i t h c ommon c ommer c i al
sof t w ar e su c h as MAT LAB or Mathe matica.

One of t h e mai n adv ant ages of t h e N u mPy modu l e


i s t o h andl e or c r eat e one-di mensi onal or mu l t i -
di mensi onal ar r ay s. Th i s adv anc ed dat a
st r u c t u r e/c l ass i s at t h e h ear t of t h e N u mPy
p ac k age and i t ser v es as t h e f u ndament al
b u i l di ng b l oc k of mor e adv anc ed c l asses su c h as
pandas and DataF rame , w h i c h w e w i l l c ov er
sh or t l y i n t h i s c h ap t er .
N u mPy ar r ay s ar e di f f er ent t h an c ommon
Py t h on l i st s, si nc e Py t h on l i st s c an b e t h ou gh t as
si mp l e ar r ay . N u mPy ar r ay s ar e b u i l t f or
v e ctorize d op er at i ons t h at p r oc ess a l ot of
nu mer i c al dat a w i t h ju st a si ngl e l i ne of c ode.
Many b u i l t -i n mat h emat i c al f u nc t i ons i n
N u mPy ar r ay s ar e w r i t t en i n l ow -l ev el
l angu ages su c h as C or For t r an and p r e-c omp i l ed
f or r eal , f ast ex ec u t i on.

Note
Nu m Py a rra y s a re op tim iz e d d a ta s tru c tu re s for
nu m e ric a l a na ly s is , a nd th a t's w h y th e y a re s o
im p orta nt to d a ta s c ie ntis ts .

EXERCISE 26: CREATING A


NUMPY ARRAY (FROM A LIST)
In t h i s ex er c i se, w e w i l l c r eat e a N u mPy ar r ay
f r om a l i st :

1 . To w or k w ith Nu m Py , w e
m u st im por t it. By
conv ention, w e giv e it a
sh or t nam e, np, w h ile
im por ting:

import numpy as np

2 . Cr eate a list w ith th r ee


elem ents, 1 , 2 , and 3 :

list_1 = [1,2,3]

3 . Use th e array fu nction to


conv er t it into an ar r ay :

array_1 =
np.array(list_1)

We ju st cr eated a Nu m Py
ar r ay object called array_1
fr om th e r egu lar Py th on list
object, list_1.

4 . Cr eate an ar r ay of floating
ty pe elem ents 1 .2 , 3 .4 , and
5.6 :

import array as arr

a = arr.array('d',
[1.2, 3.4, 5.6])

print(a)

Th e ou tpu t is as follow s:
array('d', [1.2, 3.4,
5.6])

5. Let's ch eck th e ty pe of th e
new ly cr eated object by
u sing th e type fu nction:

type(array_1)

Th e ou tpu t is as follow s:

numpy.ndarray

6 . Use type on list_1:

type (list_1)

Th e ou tpu t is as follow s:

list

So, t h i s i s i ndeed di f f er ent f r om t h e r egu l ar list


ob jec t .

EXERCISE 27: ADDING TWO


NUMPY ARRAYS
Th i s si mp l e ex er c i se w i l l demonst r at e t h e
addi t i on of t w o N u mPy ar r ay s, and t h er eb y sh ow
t h e k ey di f f er enc e b et w een a r egu l ar Py t h on
l i st /ar r ay and a N u mPy ar r ay :

1 . Consider list_1 and


array_1 fr om th e pr eceding
exer cise. If y ou h av e
ch anged th e Ju py ter
notebook, y ou w ill h av e to
declar e th em again.

2 . Use th e + notation to add tw o


list_1 object and sav e th e
r esu lts in list_2:

list_2 = list_1 +
list_1

print(list_2)

Th e ou tpu t is as follow s:

[1, 2, 3, 1, 2, 3]

3 . Use th e sam e + notation to


add tw o array_1 objects and
sav e th e r esu lt in array_2:

array_2 = array_1 +
array_1
print(array_2)

Th e ou tpu t is as follow s:

[2, ,4, 6]

Di d y ou not i c e t h e di f f er enc e? Th e f i r st p r i nt
sh ow s a l i st w i t h 6 el ement s [1 , 2, 3, 1 , 2, 3]. Bu t t h e
sec ond p r i nt sh ow s anot h er N u mPy ar r ay (or
v ec t or ) w i t h t h e el ement s [2, 4, 6], w h i c h ar e ju st
t h e su m of t h e i ndi v i du al el ement s of array_1.

N u mPy ar r ay s ar e l i k e mat h emat i c al ob jec t s –


v e ctors. Th ey ar e b u i l t f or el ement -w i se
op er at i ons, t h at i s, w h en w e add t w o N u mPy
ar r ay s, w e add t h e f i r st el ement of t h e f i r st ar r ay
t o t h e f i r st el ement of t h e sec ond ar r ay – t h er e i s
an el ement -t o-el ement c or r esp ondenc e i n t h i s
op er at i on. Th i s i s i n c ont r ast t o Py t h on l i st s,
w h er e t h e el ement s ar e si mp l y ap p ended and
t h er e i s no el ement -t o-el ement r el at i on. Th i s i s
t h e r eal p ow er of a N u mPy ar r ay : t h ey c an b e
t r eat ed ju st l i k e mat h emat i c al v ec t or s.

A v ec t or i s a c ol l ec t i on of nu mb er s t h at c an
r ep r esent , f or ex amp l e, t h e c oor di nat es of p oi nt s
i n a t h r ee-di mensi onal sp ac e or t h e c ol or of
nu mb er s (RGB) i n a p i c t u r e. N at u r al l y , r el at i v e
or der i s i mp or t ant f or su c h a c ol l ec t i on and as
w e di sc u ssed p r ev i ou sl y , a N u mPy ar r ay c an
mai nt ai n su c h or der r el at i onsh i p s. Th at 's w h y
t h ey ar e p er f ec t l y su i t ab l e t o u se i n nu mer i c al
c omp u t at i ons.

EXERCISE 28: MATHEMATICAL


OPERATIONS ON NUMPY
ARRAYS
N ow t h at y ou k now t h at t h ese ar r ay s ar e l i k e
v ec t or s, w e w i l l t r y some mat h emat i c al
op er at i ons on ar r ay s.

N u mPy ar r ay s ev en su p p or t el ement -w i se
ex p onent i at i on. For ex amp l e, su p p ose t h er e ar e
t w o ar r ay s – t h e el ement s of t h e f i r st ar r ay w i l l
b e r ai sed t o t h e p ow er of t h e el ement s i n t h e
sec ond ar r ay :

1 . Mu ltiply tw o ar r ay s u sing
th e follow ing com m and:

print("array_1
multiplied by array_1:
",array_1*array_1)

Th e ou tpu t is as follow s:

array_1 multiplied by
array_1: [1 4 9]
2 . Div ide tw o ar r ay s u sing th e
follow ing com m and:

print("array_1 divided
by array_1:
",array_1/array_1)

Th e ou tpu t is as follow s:

array_1 divided by
array_1: [1. 1. 1.]

3 . Raise one ar r ay to th e second


ar r ay s pow er u sing th e
follow ing com m and:

print("array_1 raised
to the power of
array_1:
",array_1**array_1)

Th e ou tpu t is as follow s:

array_1 raised to the


power of array_1: [ 1
4 27]

EXERCISE 29: ADVANCED


MATHEMATICAL OPERATIONS
ON NUMPY ARRAYS
N u mPy h as al l t h e b u i l t -i n mat h emat i c al
f u nc t i ons t h at y ou c an t h i nk of . H er e, w e ar e
goi ng t o b e c r eat i ng a l i st and c onv er t i ng i t i nt o
a N u mPy ar r ay . Th en, w e w i l l p er f or m some
adv anc ed mat h emat i c al op er at i ons on t h at ar r ay .

H er e, w e ar e c r eat i ng a l i st and t h en c onv er t i ng


t h at i nt o a N u mPy ar r ay . W e w i l l t h en sh ow y ou
h ow t o p er f or m some adv anc ed mat h emat i c al
op er at i ons on t h at ar r ay :

1 . Cr eate a list w ith fiv e


elem ents:

list_5=[i for i in
range(1,6)]

print(list_5)

Th e ou tpu t is as follow s:

[1, 2, 3, 4, 5]

2 . Conv er t th e list into a


Nu m Py ar r ay by u sing th e
follow ing com m and:
array_5=np.array(list_
5)

array_5

Th e ou tpu t is as follow s:

array([1, 2, 3, 4, 5])

3 . Find th e sine v alu e of th e


ar r ay by u sing th e follow ing
com m and:

# sine function

print("Sine:
",np.sin(array_5))

Th e ou tpu t is as follow s:

Sine: [ 0.84147098
0.90929743 0.14112001
-0.7568025
-0.95892427]

4 . Find th e logar ith m ic v alu e of


th e ar r ay by u sing th e
follow ing com m and:

# logarithm

print("Natural
logarithm:
",np.log(array_5))

print("Base-10
logarithm:
",np.log10(array_5))
print("Base-2
logarithm:
",np.log2(array_5))

Th e ou tpu t is as follow s:

Natural logarithm: [0.


0.69314718 1.09861229
1.38629436 1.60943791]

Base-10 logarithm: [0.


0.30103 0.47712125
0.60205999 0.69897 ]

Base-2 logarithm: [0.


1. 1.5849625 2.
2.32192809]

5. Find th e exponential v alu e of


th e ar r ay by u sing th e
follow ing com m and:

# Exponential

print("Exponential:
",np.exp(array_5))

Th e ou tpu t is as follow s:

Exponential: [
2.71828183 7.3890561
20.08553692
54.59815003
148.4131591 ]

EXERCISE 30: GENERATING


ARRAYS USING ARANGE AND
LINSPACE
Gener at i on of nu mer i c al ar r ay s i s a f ai r l y
c ommon t ask . So f ar , w e h av e b een doi ng t h i s b y
c r eat i ng a Py t h on l i st ob jec t and t h en
c onv er t i ng t h at i nt o a N u mPy ar r ay . H ow ev er ,
w e c an b y p ass t h at and w or k di r ec t l y w i t h
nat i v e N u mPy met h ods.

Th e arange f u nc t i on c r eat es a ser i es of nu mb er s


b ased on t h e mi ni mu m and max i mu m b ou nds y ou
gi v e and t h e st ep si ze y ou sp ec i f y . A not h er
f u nc t i on, linspace, c r eat es a ser i es of t h e f i x ed
nu mb er s of i nt er medi at e p oi nt s b et w een t w o
ex t r emes:

1 . Cr eate a ser ies of nu m ber s


u sing th e arange m eth od, by
u sing th e follow ing
com m and:

print("A series of
numbers:",np.arange(5,
16))

Th e ou tpu t is as follow s:

A series of numbers: [
5 6 7 8 9 10 11 12 13
14 15]

2 . Pr int nu m ber s u sing th e


arange fu nction by u sing th e
follow ing com m and:

print("Numbers spaced
apart by 2:
",np.arange(0,11,2))
print("Numbers spaced
apart by a floating
point number:
",np.arange(0,11,2.5))

print("Every 5th
number from 30 in
reverse
order\n",np.arange(30,
-1,-5))

Th e ou tpu t is as follow s:

Numbers spaced apart


by 2: [ 0 2 4 6 8 10]

Numbers spaced apart


by a floating point
number: [ 0. 2.5 5.
7.5 10. ]

Every 5th number from


30 in reverse order

[30 25 20 15 10 5 0]

3 . For linear ly spaced nu m ber s,


w e can u se th e linspace
m eth od, as follow s:

print("11 linearly
spaced numbers between
1 and 5:
",np.linspace(1,5,11))

Th e ou tpu t is as follow s:

11 linearly spaced
numbers between 1 and
5: [1. 1.4 1.8 2.2 2.6
3. 3.4 3.8 4.2 4.6 5.
]

EXERCISE 31: CREATING


MULTI-DIMENSIONAL ARRAYS
So f ar , w e h av e c r eat ed onl y one-di mensi onal
ar r ay s. N ow , l et 's c r eat e some mu l t i -di mensi onal
ar r ay s (su c h as a mat r i x i n l i near al geb r a). Ju st
l i k e w e c r eat ed t h e one-di mensi onal ar r ay f r om
a si mp l e f l at l i st , w e c an c r eat e a t w o-
di mensi onal ar r ay f r om a l i st of l i st s:

1 . Cr eate a list of lists and


conv er t it into a tw o-
dim ensional Nu m Py ar r ay
by u sing th e follow ing
com m and:

list_2D = [[1,2,3],
[4,5,6],[7,8,9]]

mat1 =
np.array(list_2D)

print("Type/Class of
this
object:",type(mat1))

print("Here is the
matrix\n----------
\n",mat1,"\n----------
")

Th e ou tpu t is as follow s:

Type/Class of this
object: <class
'numpy.ndarray'>

Here is the matrix

----------

[[1 2 3]

[4 5 6]

[7 8 9]]
----------

2 . Tu ples can be conv er ted into


m u lti-dim ensional ar r ay s by
u sing th e follow ing code:

tuple_2D =
np.array([(1.5,2,3),
(4,5,6)])

mat_tuple =
np.array(tuple_2D)

print (mat_tuple)

Th e ou tpu t is as follow s:

[[1.5 2. 3. ]

[4. 5. 6. ]]

Th u s, w e h av e c r eat ed mu l t i -di mensi onal ar r ay s


u si ng Py t h on l i st s and t u p l es.

EXERCISE 32: THE


DIMENSION, SHAPE, SIZE,
AND DATA TYPE OF THE TWO-
DIMENSIONAL ARRAY
Th e f ol l ow i ng met h ods l et y ou c h ec k t h e
di mensi on, sh ap e, and si ze of t h e ar r ay . N ot e t h at
i f i t 's a 3x 2 mat r i x , t h at i s, i t h as 3 r ow s and 2
c ol u mns, t h en t h e sh ap e w i l l b e (3,2), b u t t h e si ze
w i l l b e 6, as 6 = 3x 2:

1 . Pr int th e dim ension of th e


m atr ix u sing ndim by u sing
th e follow ing com m and:

print("Dimension of
this matrix:
",mat1.ndim,sep='')

Th e ou tpu t is as follow s:

Dimension of this
matrix: 2

2 . Pr int th e size u sing size:

print("Size of this
matrix: ",
mat1.size,sep='')

Th e ou tpu t is as follow s:
Size of this matrix: 9

3 . Pr int th e sh ape of th e m atr ix


u sing shape:

print("Shape of this
matrix: ",
mat1.shape,sep='')

Th e ou tpu t is as follow s:

Shape of this matrix:


(3, 3)

4 . Pr int th e dim ension ty pe


u sing dtype:

print("Data type of
this matrix: ",
mat1.dtype,sep='')

Th e ou tpu t is as follow s:

Data type of this


matrix: int32
EXERCISE 33: ZEROS, ONES,
RANDOM, IDENTITY
MATRICES, AND VECTORS
N ow t h at w e ar e f ami l i ar w i t h b asi c v ec t or (one-
di mensi onal ) and mat r i x dat a st r u c t u r es i n
N u mPy , w e w i l l t ak e a l ook h ow t o c r eat e sp ec i al
mat r i c es easi l y . Of t en, y ou may h av e t o c r eat e
mat r i c es f i l l ed w i t h zer os, ones, r andom
nu mb er s, or ones i n t h e di agonal :

1 . Pr int th e v ector of zer os by


u sing th e follow ing
com m and:

print("Vector of
zeros: ",np.zeros(5))

Th e ou tpu t is as follow s:

Vector of zeros: [0.


0. 0. 0. 0.]

2 . Pr int th e m atr ix of zer os by


u sing th e follow ing
com m and:

print("Matrix of
zeros:
",np.zeros((3,4)))

Th e ou tpu t is as follow s:
Matrix of zeros: [[0.
0. 0. 0.]

[0. 0. 0. 0.]

[0. 0. 0. 0.]]

3 . Pr int th e m atr ix of fiv es by


u sing th e follow ing
com m and:

print("Matrix of 5's:
",5*np.ones((3,3)))

Th e ou tpu t is as follow s:

Matrix of 5's: [[5. 5.


5.]

[5. 5. 5.]

[5. 5. 5.]]

4 . Pr int an identity m atr ix by


u sing th e follow ing
com m and:
print("Identity matrix
of dimension
2:",np.eye(2))

Th e ou tpu t is as follow s:

Identity matrix of
dimension 2: [[1. 0.]

[0. 1.]]

5. Pr int an identity m atr ix


w ith a dim ension of 4 x4 by
u sing th e follow ing
com m and:

print("Identity matrix
of dimension
4:",np.eye(4))

Th e ou tpu t is as follow s:

Identity matrix of
dimension 4: [[1. 0.
0. 0.]

[0. 1. 0. 0.]

[0. 0. 1. 0.]

[0. 0. 0. 1.]]

6 . Pr int a m atr ix of r andom


sh ape u sing th e randint
fu nction:

print("Random matrix
of shape
(4,3):\n",np.random.ra
ndint(low=1,high=10,si
ze=(4,3)))

Th e sam ple ou tpu t is as


follow s:

Random matrix of shape


(4,3):

[[6 7 6]

[5 6 7]

[5 3 6]

[2 9 4]]

Note

When creating matrices, you


need to pass on tuples of
integers as arguments.

Random nu mb er gener at i on i s a v er y u sef u l


u t i l i t y and needs t o b e mast er ed f or dat a
sc i enc e/dat a w r angl i ng t ask s. W e w i l l l ook at
t h e t op i c of r andom v ar i ab l es and di st r i b u t i ons
agai n i n t h e sec t i on on st at i st i c s and see h ow
N u mPy and p andas h av e b u i l t -i n r andom nu mb er
and ser i es gener at i on, as w el l as mani p u l at i on
f u nc t i ons.

EXERCISE 34: RESHAPING,


RAVEL, MIN, MAX, AND
SORTING
Re shaping an ar r ay i s a v er y u sef u l op er at i on
f or v ec t or s as mac h i ne l ear ni ng al gor i t h ms may
demand i np u t v ec t or s i n v ar i ou s f or mat s f or
mat h emat i c al mani p u l at i on. In t h i s sec t i on, w e
w i l l b e l ook i ng at h ow r esh ap i ng c an t ak e b e
done on an ar r ay . Th e op p osi t e of reshape i s t h e
ravel f u nc t i on, w h i c h f l at t ens any gi v en ar r ay
i nt o a one-di mensi onal ar r ay . It i s a v er y u sef u l
ac t i on i n many mac h i ne l ear ni ng and dat a
anal y t i c s t ask s.

Th e f ol l ow i ng f u nc t i ons r esh ap e t h e f u nc t i on.


W e w i l l f i r st gener at e a r andom one-di mensi onal
v ec t or of 2-di gi t nu mb er s and t h en r esh ap e t h e
v ec t or i nt o mu l t i -di mensi onal v ec t or s:

1 . Cr eate an ar r ay of 3 0
r andom integer s (sam pled
fr om 1 to 9 9 ) and r esh ape it
into tw o differ ent for m s
u sing th e follow ing code:

a =
np.random.randint(1,10
0,30)

b = a.reshape(2,3,5)

c = a.reshape(6,5)

2 . Pr int th e sh ape u sing th e


shape fu nction by u sing th e
follow ing code:

print ("Shape of a:",


a.shape)

print ("Shape of b:",


b.shape)

print ("Shape of c:",


c.shape)

Th e ou tpu t is as follow s:
Shape of a: (30,)

Shape of b: (2, 3, 5)

Shape of c: (6, 5)

3 . Pr int th e ar r ay s a, b, and c
u sing th e follow ing code:

print("\na looks
like\n",a)

print("\nb looks
like\n",b)

print("\nc looks
like\n",c)

Th e sam ple ou tpu t is as


follow s:

a looks like

[ 7 82 9 29 50 50 71
65 33 84 55 78 40 68
50 15 65 55 98 38 23
75 50 57

32 69 34 59 98 48]

b looks like

[[[ 7 82 9 29 50]

[50 71 65 33 84]

[55 78 40 68 50]]

[[15 65 55 98 38]

[23 75 50 57 32]

[69 34 59 98 48]]]

c looks like

[[ 7 82 9 29 50]

[50 71 65 33 84]

[55 78 40 68 50]

[15 65 55 98 38]

[23 75 50 57 32]

[69 34 59 98 48]]

Note

"b" is a three-dimensional
array – a kind of list of a list of
a list.
4 . Rav el file b u sing th e
follow ing code:

b_flat = b.ravel()

print(b_flat)

Th e sam ple ou tpu t is as


follow s:

[ 7 82 9 29 50 50 71
65 33 84 55 78 40 68
50 15 65 55 98 38 23
75 50 57

32 69 34 59 98 48]

EXERCISE 35: INDEXING AND


SLICING
I nde xing and slicing of N u mPy ar r ay s i s v er y
si mi l ar t o r egu l ar l i st i ndex i ng. W e c an ev en
st ep t h r ou gh a v ec t or of el ement s w i t h a def i ni t e
st ep si ze b y p r ov i di ng i t as an addi t i onal
ar gu ment i n t h e f or mat (st ar t , st ep , end).
Fu r t h er mor e, w e c an p ass a l i st as t h e ar gu ment
t o sel ec t sp ec i f i c el ement s.

In t h i s ex er c i se, w e w i l l l ear n ab ou t i ndex i ng


and sl i c i ng on one-di mensi onal and mu l t i -
di mensi onal ar r ay s:

Note
I n m u lti-d im e ns iona l a rra y s , y ou c a n u s e tw o
nu m be rs to d e note th e p os ition of a n e le m e nt. For
e xa m p le , if th e e le m e nt is in th e th ird row a nd
s e c ond c olu m n, its ind ic e s a re 2 a nd 1 (be c a u s e of
Py th on's z e ro-ba s e d ind e xing ).

1 . Cr eate an ar r ay of 1 0
elem ents and exam ine its
v ar iou s elem ents by slicing
and indexing th e ar r ay w ith
sligh tly differ ent sy ntaxes.
Do th is by u sing th e
follow ing com m and:

array_1 =
np.arange(0,11)

print("Array:",array_1
)

Th e ou tpu t is as follow s:

Array: [ 0 1 2 3 4 5 6
7 8 9 10]
2 . Pr int th e elem ent in th e
sev enth position by u sing th e
follow ing com m and:

print("Element at 7th
index is:",
array_1[7])

Th e ou tpu t is as follow s:

Element at 7th index


is: 7

3 . Pr int th e elem ents betw een


th e th ir d and sixth positions
by u sing th e follow ing
com m and:

print("Elements from
3rd to 5th index
are:", array_1[3:6])

Th e ou tpu t is as follow s:

Elements from 3rd to


5th index are: [3 4 5]

4 . Pr int th e elem ents u ntil th e


fou r th position by u sing th e
follow ing com m and:

print("Elements up to
4th index are:",
array_1[:4])

Th e ou tpu t is as follow s:

Elements up to 4th
index are: [0 1 2 3]

5. Pr int th e elem ents


backw ar ds by u sing th e
follow ing com m and:

print("Elements from
last backwards are:",
array_1[-1::-1])

Th e ou tpu t is as follow s:

Elements from last


backwards are: [10 9 8
7 6 5 4 3 2 1 0]

6 . Pr int th e elem ents u sing


th eir backw ar d index,
skipping th r ee v alu es, by
u sing th e follow ing
com m and:

print("3 Elements from


last backwards are:",
array_1[-1:-6:-2])

Th e ou tpu t is as follow s:

3 Elements from last


backwards are: [10 8
6]

7 . Cr eate a new ar r ay called


array_2 by u sing th e
follow ing com m and:

array_2 =
np.arange(0,21,2)

print("New
array:",array_2)

Th e ou tpu t is as follow s:

New array: [ 0 2 4 6 8
10 12 14 16 18 20]

8. Pr int th e second, fou r th , and


ninth elem ents of th e ar r ay :

print("Elements at
2nd, 4th, and 9th
index are:",
array_2[[2,4,9]])

Th e ou tpu t is as follow s:

Elements at 2nd, 4th,


and 9th index are: [ 4
8 18]

9 . Cr eate a m u lti-dim ensional


ar r ay by u sing th e follow ing
com m and:

matrix_1 =
np.random.randint(10,1
00,15).reshape(3,5)

print("Matrix of
random 2-digit
numbers\n ",matrix_1)

Th e sam ple ou tpu t is as


follow s:
Matrix of random 2-
digit numbers

[[21 57 60 24 15]

[53 20 44 72 68]

[39 12 99 99 33]]

1 0. A ccess th e v alu es u sing


dou ble br acket indexing by
u sing th e follow ing
com m and:

print("\nDouble
bracket indexing\n")

print("Element in row
index 1 and column
index 2:", matrix_1[1]
[2])

Th e sam ple ou tpu t is as


follow s:

Double bracket
indexing

Element in row index 1


and column index 2: 44

1 1 . A ccess th e v alu es u sing


single br acket indexing by
u sing th e follow ing
com m and:

print("\nSingle
bracket with comma
indexing\n")

print("Element in row
index 1 and column
index 2:",
matrix_1[1,2])

Th e sam ple ou tpu t is as


follow s:

Single bracket with


comma indexing

Element in row index 1


and column index 2: 44

1 2 . A ccess th e v alu es in a m u lti-


dim ensional ar r ay u sing a
r ow or colu m n by u sing th e
follow ing com m and:
print("\nRow or column
extract\n")

print("Entire row at
index 2:",
matrix_1[2])

print("Entire column
at index 3:",
matrix_1[:,3])

Th e sam ple ou tpu t is as


follow s:

Row or column extract

Entire row at index 2:


[39 12 99 99 33]

Entire column at index


3: [24 72 99]

1 3 . Pr int th e m atr ix w ith th e


specified r ow and colu m n
indices by u sing th e
follow ing com m and:

print("\nSubsetting
sub-matrices\n")

print("Matrix with row


indices 1 and 2 and
column indices 3 and
4\n",
matrix_1[1:3,3:5])

Th e sam ple ou tpu t is as


follow s:

Subsetting sub-
matrices

Matrix with row


indices 1 and 2 and
column indices 3 and 4

[[72 68]

[99 33]]

1 4 . Pr int th e m atr ix w ith th e


specified r ow and colu m n
indices by u sing th e
follow ing com m and:

print("Matrix with row


indices 0 and 1 and
column indices 1 and
3\n", matrix_1[0:2,
[1,3]])

Th e sam ple ou tpu t is as


follow s:

Matrix with row


indices 0 and 1 and
column indices 1 and 3

[[57 24]

[20 72]]

CONDITIONAL SUBSETTING
Conditional subse tting i s a w ay t o sel ec t
sp ec i f i c el ement s b ased on some nu mer i c
c ondi t i on. It i s al most l i k e a sh or t ened v er si on of
a SQL qu er y t o su b set el ement s. See t h e f ol l ow i ng
ex amp l e:

matrix_1 =
np.array(np.random.randint(10,100,15)).res
hape(3,5)

print("Matrix of random 2-digit


numbers\n",matrix_1)

print ("\nElements greater than 50\n",


matrix_1[matrix_1>50])

Th e samp l e ou t p u t i s as f ol l ow s (not e t h at t h e
ex ac t ou t p u t w i l l b e di f f er ent f or y ou as i t i s
r andom):

Matrix of random 2-digit numbers

[[71 89 66 99 54]

[28 17 66 35 85]

[82 35 38 15 47]]

Elements greater than 50

[71 89 66 99 54 66 85 82]

EXERCISE 36: ARRAY


OPERATIONS (ARRAY-ARRAY,
ARRAY-SCALAR, AND
UNIVERSAL FUNCTIONS)
N u mPy ar r ay s op er at e ju st l i k e mat h emat i c al
mat r i c es, and t h e op er at i ons ar e p er f or med
el ement -w i se.

Cr eat e t w o mat r i c es (mu l t i -di mensi onal ar r ay s)


w i t h r andom i nt eger s and demonst r at e el ement -
w i se mat h emat i c al op er at i ons su c h as addi t i on,
su b t r ac t i on, mu l t i p l i c at i on, and di v i si on. Sh ow
t h e ex p onent i at i on (r ai si ng a nu mb er t o a
c er t ai n p ow er ) op er at i on, as f ol l ow s:
Note
Du e to ra nd om nu m be r g e ne ra tion, y ou r s p e c ific
ou tp u t c ou ld be d iffe re nt to w h a t is s h ow n h e re .

1 . Cr eate tw o m atr ices:

matrix_1 =
np.random.randint(1,10
,9).reshape(3,3)

matrix_2 =
np.random.randint(1,10
,9).reshape(3,3)

print("\n1st Matrix of
random single-digit
numbers\n",matrix_1)

print("\n2nd Matrix of
random single-digit
numbers\n",matrix_2)

Th e sam ple ou tpu t is as


follow s (note th at th e exact
ou tpu t w ill be differ ent for
y ou as it is r andom ):

1st Matrix of random


single-digit numbers

[[6 5 9]

[4 7 1]

[3 2 7]]

2nd Matrix of random


single-digit numbers

[[2 3 1]

[9 9 9]

[9 9 6]]

2 . Per for m addition,


su btr action, div ision, and
linear com bination on th e
m atr ices:

print("\nAddition\n",
matrix_1+matrix_2)

print("\nMultiplicatio
n\n",
matrix_1*matrix_2)

print("\nDivision\n",
matrix_1/matrix_2)
print("\nLinear
combination: 3*A -
2*B\n", 3*matrix_1-
2*matrix_2)

Th e sam ple ou tpu t is as


follow s (note th at th e exact
ou tpu t w ill be differ ent for
y ou as it is r andom ):

Addition

[[ 8 8 10]

[13 16 10]

[12 11 13]] ^

Multiplication

[[12 15 9]

[36 63 9]

[27 18 42]]

Division

[[3. 1.66666667 9. ]
[0.44444444 0.77777778
0.11111111]

[0.33333333 0.22222222
1.16666667]]

Linear combination:
3*A - 2*B

[[ 14 9 25]

[ -6 3 -15]

[ -9 -12 9]]

3 . Per for m th e addition of a


scalar , exponential m atr ix
cu be, and exponential squ ar e
r oot:

print("\nAddition of a
scalar (100)\n",
100+matrix_1)

print("\nExponentiatio
n, matrix cubed
here\n", matrix_1**3)

print("\nExponentiatio
n, square root using
'pow'
function\n",pow(matrix
_1,0.5))

Th e sam ple ou tpu t is as


follow s (note th at th e exact
ou tpu t w ill be differ ent for
y ou as it is r andom ):

Addition of a scalar
(100)

[[106 105 109]

[104 107 101]

[103 102 107]]

Exponentiation, matrix
cubed here

[[216 125 729]

[ 64 343 1]

[ 27 8 343]]

Exponentiation, square
root using 'pow'
function

[[2.44948974
2.23606798 3. ]
[2. 2.64575131 1. ]

[1.73205081 1.41421356
2.64575131]]

STACKING ARRAYS
Stacking array s on t op of eac h ot h er (or si de b y
si de) i s a u sef u l op er at i on f or dat a w r angl i ng.
H er e i s t h e c ode:

a = np.array([[1,2],[3,4]])

b = np.array([[5,6],[7,8]])

print("Matrix a\n",a)

print("Matrix b\n",b)

print("Vertical
stacking\n",np.vstack((a,b)))

print("Horizontal
stacking\n",np.hstack((a,b)))

Th e ou t p u t i s as f ol l ow s:

Matrix a

[[1 2]

[3 4]]

Matrix b
[[5 6]

[7 8]]

Vertical stacking

[[1 2]

[3 4]

[5 6]

[7 8]]

Horizontal stacking

[[1 2 5 6]

[3 4 7 8]]

N u mPy h as many ot h er adv anc ed f eat u r es,


mai nl y r el at ed t o st at i st i c s and l i near al geb r a
f u nc t i ons, w h i c h ar e u sed ex t ensi v el y i n
mac h i ne l ear ni ng and dat a sc i enc e t ask s.
H ow ev er , not al l of t h at i s di r ec t l y u sef u l f or
b egi nner l ev el dat a w r angl i ng, so w e w on't c ov er
i t h er e.

Pandas DataFrames
Th e p andas l i b r ar y i s a Py t h on p ac k age t h at
p r ov i des f ast , f l ex i b l e, and ex p r essi v e dat a
st r u c t u r es t h at ar e desi gned t o mak e w or k i ng
w i t h r el at i onal or l ab el ed dat a b ot h easy and
i nt u i t i v e. It ai ms t o b e t h e f u ndament al h i gh -
l ev el b u i l di ng b l oc k f or doi ng p r ac t i c al , r eal -
w or l d dat a anal y si s i n Py t h on. A ddi t i onal l y , i t
h as t h e b r oader goal of b ec omi ng t h e most
p ow er f u l and f l ex i b l e op en sou r c e dat a
anal y si s/mani p u l at i on t ool t h at 's av ai l ab l e i n
any l angu age.

Th e t w o p r i mar y dat a st r u c t u r es of p andas,


Series (one-di mensi onal ) and DataFrame (t w o-
di mensi onal ), h andl e t h e v ast major i t y of t y p i c al
u se c ases. Pandas i s b u i l t on t op of N u mPy and i s
i nt ended t o i nt egr at e w el l w i t h i n a sc i ent i f i c
c omp u t i ng env i r onment w i t h many ot h er t h i r d-
p ar t y l i b r ar i es.

EXERCISE 37: CREATING A


PANDAS SERIES
In t h i s ex er c i se, w e w i l l l ear n ab ou t h ow t o
c r eat e a p andas ser i es ob jec t f r om t h e dat a
st r u c t u r es t h at w e c r eat ed p r ev i ou sl y . If y ou
h av e i mp or t ed p andas as pd, t h en t h e f u nc t i on t o
c r eat e a ser i es i s si mp l y pd.Series:

1 . Initialize labels, lists, and a


dictionar y :

labels = ['a','b','c']

my_data = [10,20,30]
array_1 =
np.array(my_data)

d =
{'a':10,'b':20,'c':30}

print ("Labels:",
labels)

print("My data:",
my_data)

print("Dictionary:",
d)

Th e ou tpu t is as follow s:

Labels: ['a', 'b',


'c']

My data: [10, 20, 30]

Dictionary: {'a': 10,


'b': 20, 'c': 30}

2 . Im por t pandas as pd by u sing


th e follow ing com m and:

import pandas as pd

3 . Cr eate a ser ies fr om th e


my_data list by u sing th e
follow ing com m and:

series_1=pd.Series(dat
a=my_data)

print(series_1)

Th e ou tpu t is as follow s:

0 10

1 20

2 30

dtype: int64

4 . Cr eate a ser ies fr om th e


my_data list along w ith th e
labels as follow s:

series_2=pd.Series(dat
a=my_data, index =
labels)

print(series_2)

Th e ou tpu t is as follow s:

a 10
b 20

c 30

dtype: int64

5. Th en, cr eate a ser ies fr om


th e Nu m Py ar r ay , as follow s:

series_3=pd.Series(arr
ay_1,labels)

print(series_3)

Th e ou tpu t is as follow s:

a 10

b 20

c 30

dtype: int32

6 . Cr eate a ser ies fr om th e


dictionar y , as follow s:

series_4=pd.Series(d)

print(series_4)

Th e ou tpu t is as follow s:

a 10

b 20

c 30

dtype: int64

EXERCISE 38: PANDAS SERIES


AND DATA HANDLING
Th e p andas ser i es ob jec t c an h ol d many t y p es of
dat a. Th i s i s t h e k ey t o c onst r u c t i ng a b i gger
t ab l e w h er e mu l t i p l e ser i es ob jec t s ar e st ac k ed
t oget h er t o c r eat e a dat ab ase-l i k e ent i t y :

1 . Cr eate a pandas ser ies w ith


nu m er ical data by u sing th e
follow ing com m and:

print ("\nHolding
numerical data\n",'-
'*25, sep='')

print(pd.Series(array_
1))

Th e ou tpu t is as follow s:
Holding numerical data

----------------------
---

0 10

1 20

2 30

dtype: int32

2 . Cr eate a pandas ser ies w ith


labels by u sing th e follow ing
com m and:

print ("\nHolding text


labels\n",'-'*20,
sep='')

print(pd.Series(labels
))

Th e ou tpu t is as follow s:

Holding text labels

--------------------

0 a

1 b

2 c

dtype: object

3 . Cr eate a pandas ser ies w ith


fu nctions by u sing th e
follow ing com m and:

print ("\nHolding
functions\n",'-'*20,
sep='')

print(pd.Series(data=
[sum,print,len]))

Th e ou tpu t is as follow s:

Holding functions

--------------------

0 <built-in function
sum>

1 <built-in function
print>

2 <built-in function
len>
dtype: object

4 . Cr eate a pandas ser ies w ith a


dictionar y by u sing th e
follow ing com m and:

print ("\nHolding
objects from a
dictionary\n",'-'*40,
sep='')

print(pd.Series(data=
[d.keys, d.items,
d.values]))

Th e ou tpu t is as follow s:

Holding objects from a


dictionary

----------------------
------------------

0 <built-in method
keys of dict object at
0x0000...

1 <built-in method
items of dict object
at 0x000...

2 <built-in method
values of dict object
at 0x00...

dtype: object

EXERCISE 39: CREATING


PANDAS DATAFRAMES
Th e p andas Dat aFr ame i s si mi l ar t o an Ex c el
t ab l e or r el at i onal dat ab ase (SQL) t ab l e t h at
c onsi st s of t h r ee mai n c omp onent s: t h e dat a, t h e
i ndex (or r ow s), and t h e c ol u mns. U nder t h e h ood,
i t i s a st ac k of p andas ser i es ob jec t s, w h i c h ar e
t h emsel v es b u i l t on t op of N u mPy ar r ay s. So, al l
of ou r p r ev i ou s k now l edge of N u mPy ar r ay
ap p l i es h er e:

1 . Cr eate a sim ple DataFr am e


fr om a tw o-dim ensional
m atr ix of nu m ber s. Fir st, th e
code dr aw s 2 0 r andom
integer s fr om th e u nifor m
distr ibu tion. Th en, w e need
to r esh ape it into a (5,4 )
Nu m Py ar r ay – 5 r ow s and 4
colu m ns:

matrix_data =
np.random.randint(1,10
,size=20).reshape(5,4)

2 . Define th e r ow s labels as
('A','B','C','D','E')
and colu m n labels as
('W','X','Y','Z'):

row_labels =
['A','B','C','D','E']

column_headings =
['W','X','Y','Z']

df =
pd.DataFrame(data=matr
ix_data,
index=row_labels,

columns=column_heading
s)

3 . Th e fu nction to cr eate a
DataFr am e is pd.DataFrame
and it is called in next:

print("\nThe data
frame looks like\n",'-
'*45, sep='')

print(df)

Th e sam ple ou tpu t is as


follow s:

The data frame looks


like

----------------------
----------------------
-

W X Y Z

A 6 3 3 3

B 1 9 9 4

C 4 3 6 9

D 4 8 6 7

E 6 6 9 1

4 . Cr eate a DataFr am e fr om a
Py th on dictionar y of som e
lists of integer s by u sing th e
follow ing com m and:

d={'a':[10,20],'b':
[30,40],'c':[50,60]}

5. Pass th is dictionar y as th e
data ar gu m ent to th e
pd.DataFrame fu nction. Pass
on a list of r ow s or indices.
Notice h ow th e dictionar y
key s becam e th e colu m n
nam es and th at th e v alu es
w er e distr ibu ted am ong
m u ltiple r ow s:

df2=pd.DataFrame(data=
d,index=['X','Y'])

print(df2)

Th e ou tpu t is as follow s:

a b c

X 10 30 50

Y 20 40 60

Note

The most common w ay that


you w ill encounter to create a
pandas DataFrame w ill be to
read tabular data from a file
on your local disk or over the
internet – CSV, text, JSON,
HTML, Excel, and so on. We
w ill cover some of these in the
next chapter.

EXERCISE 40: VIEWING A


DATAFRAME PARTIALLY
In t h e p r ev i ou s sec t i on, w e u sed print(df) t o
p r i nt t h e w h ol e Dat aFr ame. For a l ar ge dat aset ,
w e w ou l d l i k e t o p r i nt onl y sec t i ons of dat a. In
t h i s ex er c i se, w e w i l l r ead a p ar t of t h e
Dat aFr ame:

1 . Execu te th e follow ing code to


cr eate a DataFr am e w ith 2 5
r ow s and fill it w ith r andom
nu m ber s:
# 25 rows and 4
columns

matrix_data =
np.random.randint(1,10
0,100).reshape(25,4)

column_headings =
['W','X','Y','Z']

df =
pd.DataFrame(data=matr
ix_data,columns=column
_headings)

2 . Ru n th e follow ing code to


v iew only th e fir st fiv e r ow s
of th e DataFr am e:

df.head()

Th e sam ple ou tpu t is as


follow s (note th at y ou r
ou tpu t cou ld be differ ent du e
to r andom ness):

Figure 3.1: First five rows of the


DataFrame

By defau lt, head sh ow s only


fiv e r ow s. If y ou w ant to see
any specific nu m ber of r ow s
ju st pass th at as an
ar gu m ent.

3 . Pr int th e fir st eigh t r ow s by


u sing th e follow ing
com m and:

df.head(8)

Th e sam ple ou tpu t is as


follow s:
Figure 3.2: First eight rows of the
DataFrame

Ju st like head sh ow s th e fir st


few r ow s, tail sh ow s th e last
few r ow s.

4 . Pr int th e DataFr am e u sing


th e tail com m and, as
follow s:

df.tail(10)

Th e sam ple ou tpu t is as


follow s:
Figure 3.3: Last ten rows of the DataFrame

INDEXING AND SLICING


COLUMNS
Th er e ar e t w o met h ods f or i ndex i ng and sl i c i ng
c ol u mns f r om a Dat aFr ame. Th ey ar e as f ol l ow s:

DOT met hod

Bracket met hod

Th e DOT met h od i s good t o f i nd sp ec i f i c el ement .


Th e b r ac k et met h od i s i nt u i t i v e and easy t o
f ol l ow . In t h i s met h od, y ou c an ac c ess t h e dat a b y
t h e gener i c name/h eader of t h e c ol u mn.

Th e f ol l ow i ng c ode i l l u st r at es t h ese c onc ep t s.


Ex ec u t e t h em i n y ou r Ju p y t er not eb ook :

print("\nThe 'X' column\n",'-'*25, sep='')

print(df['X'])

print("\nType of the column: ",


type(df['X']), sep='')

print("\nThe 'X' and 'Z' columns indexed


by passing a list\n",'-'*55, sep='')

print(df[['X','Z']])

print("\nType of the pair of columns: ",


type(df[['X','Z']]), sep='')

Th e ou t p u t i s as f ol l ow s (a sc r eensh ot i s sh ow n
h er e b ec au se t h e ac t u al c ol u mn i s l ong):

Figure 3.4: Rows of the 'X' columns

Th i s i s t h e ou t p u t sh ow i ng t h e t y p e of c ol u mn:

Figure 3.5: Type of 'X' column


Th i s i s t h e ou t p u t sh ow i ng t h e X and Z c ol u mn
i ndex ed b y p assi ng a l i st :

Figure 3.6: Rows of the 'Y' columns

Th i s i s t h e ou t p u t sh ow i ng t h e t y p e of t h e p ai r of
c ol u mn:

Figure 3.7: Type of 'Y' column

Note
For m ore th a n one c olu m n, th e obje c t tu rns into a
Da ta Fra m e . Bu t for a s ing le c olu m n, it is a
p a nd a s s e rie s obje c t.

INDEXING AND SLICING ROWS


Index i ng and sl i c i ng r ow s i n a Dat aFr ame c an
al so b e done u si ng f ol l ow i ng met h ods:

Label-based 'loc' met hod

Index based 'iloc' met hod

Th e loc met h od i s i nt u i t i v e and easy t o f ol l ow . In


t h i s met h od, y ou c an ac c ess t h e dat a b y t h e
gener i c name of t h e r ow . On t h e ot h er h and, t h e
iloc met h od al l ow s y ou t o ac c ess t h e r ow s b y
t h ei r nu mer i c al i ndex . It c an b e v er y u sef u l f or
a l ar ge t ab l e w i t h t h ou sands of r ow s, esp ec i al l y
w h en y ou w ant t o i t er at e ov er t h e t ab l e i n a l oop
w i t h a nu mer i c al c ou nt er . Th e f ol l ow i ng c ode
i l l u st r at e t h e c onc ep t s of iloc:

matrix_data =
np.random.randint(1,10,size=20).reshape(5,
4)

row_labels = ['A','B','C','D','E']

column_headings = ['W','X','Y','Z']
df = pd.DataFrame(data=matrix_data,
index=row_labels,

columns=column_headings)

print("\nLabel-based 'loc' method for


selecting row(s)\n",'-'*60, sep='')

print("\nSingle row\n")

print(df.loc['C'])

print("\nMultiple rows\n")

print(df.loc[['B','C']])

print("\nIndex position based 'iloc'


method for selecting row(s)\n",'-'*70,
sep='')

print("\nSingle row\n")

print(df.iloc[2])

print("\nMultiple rows\n")

print(df.iloc[[1,2]])

Th e samp l e ou t p u t i s as f ol l ow s:

Figure 3.8: Output of the loc and iloc methods

EXERCISE 41: CREATING AND


DELETING A NEW COLUMN OR
ROW
One of t h e most c ommon t ask s i n dat a w r angl i ng
i s c r eat i ng or del et i ng c ol u mns or r ow s of dat a
f r om y ou r Dat aFr ame. Somet i mes, y ou w ant t o
c r eat e a new c ol u mn b ased on some mat h emat i c al
op er at i on or t r ansf or mat i on i nv ol v i ng t h e
ex i st i ng c ol u mns. Th i s i s si mi l ar t o
mani p u l at i ng dat ab ase r ec or ds and i nser t i ng a
new c ol u mn b ased on si mp l e t r ansf or mat i ons.
W e sh ow some of t h ese c onc ep t s i n t h e f ol l ow i ng
c ode b l oc k s:

1 . Cr eate a new colu m n u sing


th e follow ing snippet:

print("\nA column is
created by assigning
it in relation\n",'-
'*75, sep='')

df['New'] =
df['X']+df['Z']

df['New (Sum of X and


Z)'] = df['X']+df['Z']

print(df)

Th e sam ple ou tpu t is as


follow s:

Figure 3.9: Output a er adding a new


column

2 . Dr op a colu m n u sing th e
df.drop m eth od:

print("\nA column is
dropped by using
df.drop() method\n",'-
'*55, sep='')

df = df.drop('New',
axis=1) # Notice the
axis=1 option, axis =
0 is #default, so one
has to change it to 1

print(df)

Th e sam ple ou tpu t is as


follow s:

Figure 3.10: Output a er dropping a


column

3 . Dr op a specific r ow u sing th e
df.drop m eth od:

df1=df.drop('A')

print("\nA row is
dropped by using
df.drop method and
axis=0\n",'-'*65,
sep='')

print(df1)

Th e sam ple ou tpu t is as


follow s:

Figure 3.11: Output a er dropping a row

Dr opping m eth ods cr eates a


copy of th e DataFr am e and
does not ch ange th e or iginal
DataFr am e.
4 . Ch ange th e or iginal
DataFr am e by setting th e
inplace ar gu m ent to True:

print("\nAn in-place
change can be done by
making inplace=True in
the drop method\n",'-
'*75, sep='')

df.drop('New (Sum of X
and Z)', axis=1,
inplace=True)

print(df)

A sam ple ou tpu t is as follow s:

Figure 3.12: Output a er using the inplace argument

Note
All th e norm a l op e ra tions a re not in-p la c e , th a t is ,
th e y d o not im p a c t th e orig ina l Da ta Fra m e obje c t
bu t re tu rn a c op y of th e orig ina l w ith a d d ition (or
d e le tion). Th e la s t bit of c od e s h ow s h ow to m a k e
a c h a ng e in th e e xis ting Da ta Fra m e w ith th e
inplace=True a rg u m e nt. Ple a s e note th a t th is
c h a ng e is irre v e rs ible a nd s h ou ld be u s e d w ith
c a u tion.

Statistics and
Visualization with
NumPy and Pandas
One of t h e gr eat adv ant ages of u si ng l i b r ar i es
su c h as N u mPy and p andas i s t h at a p l et h or a of
b u i l t -i n st at i st i c al and v i su al i zat i on met h ods
ar e av ai l ab l e, f or w h i c h w e don't h av e t o sear c h
f or and w r i t e new c ode. Fu r t h er mor e, most of
t h ese su b r ou t i nes ar e w r i t t en u si ng C or For t r an
c ode (and p r e-c omp i l ed), mak i ng t h em ex t r emel y
f ast t o ex ec u t e.
REFRESHER OF BASIC
DESCRIPTIVE STATISTICS
(AND THE MATPLOTLIB
LIBRARY FOR
VISUALIZATION)
For any dat a w r angl i ng t ask , i t i s qu i t e u sef u l t o
ex t r ac t b asi c desc r i p t i v e st at i st i c s f r om t h e dat a
and c r eat e some si mp l e v i su al i zat i ons/p l ot s.
Th ese p l ot s ar e of t en t h e f i r st st ep i n i dent i f y i ng
f u ndament al p at t er ns as w el l as oddi t i es (i f
p r esent ) i n t h e dat a. In any st at i st i c al anal y si s,
desc r i p t i v e st at i st i c s i s t h e f i r st st ep , f ol l ow ed
b y i nf er ent i al st at i st i c s, w h i c h t r i es t o i nf er t h e
u nder l y i ng di st r i b u t i on or p r oc ess f r om w h i c h
t h e dat a mi gh t h av e b een gener at ed.

A s t h e i nf er ent i al st at i st i c s ar e i nt i mat el y
c ou p l ed w i t h t h e mac h i ne l ear ni ng/p r edi c t i v e
model i ng st age of a dat a sc i enc e p i p el i ne,
desc r i p t i v e st at i st i c s nat u r al l y b ec omes
assoc i at ed w i t h t h e dat a w r angl i ng asp ec t .

Th er e ar e t w o b r oad ap p r oac h es f or desc r i p t i v e


st at i st i c al anal y si s:

Gr aph ical tech niqu es: Bar


plots, scatter plots, line
ch ar ts, box plots,
h istogr am s, and so on

Calcu lation of centr al


tendency and spr ead: Mean,
m edian, m ode, v ar iance,
standar d dev iation, r ange,
and so on

In t h i s t op i c , w e w i l l demonst r at e h ow y ou c an
ac c omp l i sh b ot h of t h ese t ask s u si ng Py t h on.
A p ar t f r om N u mPy and p andas, w e w i l l need t o
l ear n t h e b asi c s of anot h er gr eat p ac k age –
matplotlib – w h i c h i s t h e most p ow er f u l and
v er sat i l e v i su al i zat i on l i b r ar y i n Py t h on.

EXERCISE 42: INTRODUCTION


TO MATPLOTLIB THROUGH A
SCATTER PLOT
In t h i s ex er c i se, w e w i l l demonst r at e t h e p ow er
and si mp l i c i t y of mat p l ot l i b b y c r eat i ng a
si mp l e sc at t er p l ot f r om some dat a ab ou t t h e age,
w ei gh t , and h ei gh t of a f ew p eop l e:

1 . Fir st, w e define sim ple lists of


nam es, age, w eigh t (in kgs),
and h eigh t (in centim eter s):

people =
['Ann','Brandon','Chen
','David','Emily','Far
ook',

'Gagan','Hamish','Imra
n','Joseph','Katherine
','Lily']

age =
[21,12,32,45,37,18,28,
52,5,40,48,15]

weight =
[55,35,77,68,70,60,72,
69,18,65,82,48]

height =
[160,135,170,165,173,1
68,175,159,105,171,155
,158]

2 . Im por t th e m ost im por tant


m odu le fr om m atplotlib,
called pyplot:

import
matplotlib.pyplot as
plt

3 . Cr eate sim ple scatter plots of


age v er su s w eigh t:

plt.scatter(age,weight
)

plt.show()

Th e ou tpu t is as follow s:
Figure 3.13: A screenshot of a scatter plot
containing age and weight

Th e plot can be im pr ov ed by
enlar ging th e figu r e size,
cu stom izing th e aspect r atio,
adding a title w ith a pr oper
font size, adding X-axis and
Y-axis labels w ith a
cu stom ized font size, adding
gr id lines, ch anging th e Y-
axis lim it to be betw een 0
and 1 00, adding X and Y-tick
m ar ks, cu stom izing th e
scatter plot's color , and
ch anging th e size of th e
scatter dots.

4 . Th e code for th e im pr ov ed
plot is as follow s:

plt.figure(figsize=
(8,6))

plt.title("Plot of Age
vs. Weight (in
kgs)",fontsize=20)

plt.xlabel("Age
(years)",fontsize=16)

plt.ylabel("Weight
(kgs)",fontsize=16)

plt.grid (True)

plt.ylim(0,100)

plt.xticks([i*5 for i
in
range(12)],fontsize=15
)

plt.yticks(fontsize=15
)

plt.scatter(x=age,y=we
ight,c='orange',s=150,
edgecolors='k')

plt.text(x=20,y=85,s="
Weights after 18-20
years of
age",fontsize=15)

plt.vlines(x=20,ymin=0
,ymax=80,linestyles='d
ashed',color='blue',lw
=3)

plt.legend(['Weight in
kgs'],loc=2,fontsize=1
2)

plt.show()

Th e ou tpu t is as follow s:

Figure 3.14: A screenshot of a scatter plot showing age


versus weight
Ob ser v e t h e f ol l ow i ng:

A tuple (8,6) is passed as


an ar gu m ent for th e figu r e
size.

A list com pr eh ension is u sed


inside Xticks to cr eate a
cu stom ized list of 5-1 0-1 5-
…-55.

A new line (\n) ch ar acter is


u sed inside th e plt.text()
fu nction to br eak u p and
distr ibu te th e text in tw o
lines.

Th e plt.show() fu nction is
u sed at th e v er y end. Th e
idea is to keep on adding
v ar iou s gr aph ics pr oper ties
(font, color , axis lim its, text,
legend, gr id, and so on) u ntil
y ou ar e satisfied and th en
sh ow th e plot w ith one
fu nction. Th e plot w ill not be
display ed w ith ou t th is last
fu nction call.

DEFINITION OF STATISTICAL
MEASURES – CENTRAL
TENDENCY AND SPREAD
A measu r e of c ent r al t endenc y i s a si ngl e v al u e
t h at at t emp t s t o desc r i b e a set of dat a b y
i dent i f y i ng t h e c ent r al p osi t i on w i t h i n t h at set
of dat a. Th ey ar e al so c at egor i zed as su mmar y
st at i st i c s:

Mean: Mean is th e su m of all


v alu es div ided by th e total
nu m ber of v alu es.

Median: Th e m edian is th e
m iddle v alu e. It is th e v alu e
th at splits th e dataset in
h alf. To find th e m edian,
or der y ou r data fr om
sm allest to lar gest, and th en
find th e data point th at h as
an equ al am ou nt of v alu es
abov e it and below it.
Mode: Th e m ode is th e v alu e
th at occu r s th e m ost
fr equ ently in y ou r dataset.
On a bar ch ar t, th e m ode is
th e h igh est bar .

Gener al l y , t h e mean i s a b et t er measu r e t o u se


f or sy mmet r i c dat a and medi an i s a b et t er
measu r e f or dat a w i t h a sk ew ed (l ef t or r i gh t
h eav y ) di st r i b u t i on. For c at egor i c al dat a, y ou
h av e t o u se t h e mode:

Figure 3.15: A screenshot of a curve showing the mean,


median, and mode

Th e sp r ead of t h e dat a i s a measu r e of b y h ow


mu c h t h e v al u es i n t h e dat aset ar e l i k el y t o
di f f er f r om t h e mean of t h e v al u es. If al l t h e
v al u es ar e c l ose t oget h er t h en t h e sp r ead i s l ow ;
on t h e ot h er h and, i f some or al l of t h e v al u es
di f f er b y a l ar ge amou nt f r om t h e mean (and eac h
ot h er ), t h en t h er e i s a l ar ge sp r ead i n t h e dat a:

Variance: Th is is th e m ost
com m on m easu r e of spr ead.
V ar iance is th e av er age of
th e squ ar es of th e dev iations
fr om th e m ean. Squ ar ing th e
dev iations ensu r es th at
negativ e and positiv e
dev iations do not cancel each
oth er ou t.
St andard Dev iat ion:
Becau se v ar iance is pr odu ced
by squ ar ing th e distance
fr om th e m ean, its u nit does
not m atch th at of th e
or iginal data. Standar d
dev iation is a m ath em atical
tr ick to br ing back th e
par ity . It is th e positiv e
squ ar e r oot of th e v ar iance.

RANDOM VARIABLES AND


PROBABILITY DISTRIBUTION
A random v ariable i s def i ned as t h e v al u e of a
gi v en v ar i ab l e t h at r ep r esent s t h e ou t c ome of a
st at i st i c al ex p er i ment or p r oc ess.

A l t h ou gh i t sou nds v er y f or mal , p r et t y mu c h


ev er y t h i ng ar ou nd u s t h at w e c an measu r e c an
b e t h ou gh t of as a r andom v ar i ab l e.

Th e r eason b eh i nd t h i s i s t h at al most al l nat u r al ,


soc i al , b i ol ogi c al , and p h y si c al p r oc esses ar e t h e
f i nal ou t c ome of a l ar ge nu mb er of c omp l ex
p r oc esses, and w e c annot k now t h e det ai l s of
t h ose f u ndament al p r oc esses. A l l w e c an do i s
ob ser v e and measu r e t h e f i nal ou t c ome.

Ty p i c al ex amp l es of r andom v ar i ab l es t h at ar e
ar ou nd u s ar e as f ol l ow s:

Th e econom ic ou tpu t of a
nation

Th e blood pr essu r e of a
patient

Th e tem per atu r e of a


ch em ical pr ocess in a factor y

Nu m ber of fr iends of a per son


on Facebook

Th e stock m ar ket pr ice of a


com pany

Th ese v al u es c an t ak e any di sc r et e or c ont i nu ou s


v al u e and t h ey f ol l ow a p ar t i c u l ar p at t er n
(al t h ou gh t h e p at t er n may v ar y ov er t i me).
Th er ef or e, t h ey c an al l b e c l assi f i ed as r andom
v ar i ab l es.

WHAT IS A PROBABILITY
DISTRIBUTION?
A probability distribution i s a f u nc t i on t h at
desc r i b es t h e l i k el i h ood of ob t ai ni ng t h e
p ossi b l e v al u es t h at a r andom v ar i ab l e c an
assu me. In ot h er w or ds, t h e v al u es of a v ar i ab l e
v ar y b ased on t h e u nder l y i ng p r ob ab i l i t y
di st r i b u t i on.

Su p p ose y ou go t o a sc h ool and measu r e t h e


h ei gh t s of st u dent s w h o h av e b een sel ec t ed
r andoml y . H ei gh t i s an ex amp l e of a r andom
v ar i ab l e h er e. A s y ou measu r e h ei gh t , y ou c an
c r eat e a di st r i b u t i on of h ei gh t . Th i s t y p e of
di st r i b u t i on i s u sef u l w h en y ou need t o k now
w h i c h ou t c omes ar e most l i k el y , t h e sp r ead of
p ot ent i al v al u es, and t h e l i k el i h ood of di f f er ent
r esu l t s.

Th e c onc ep t s of c ent r al t endenc y and sp r ead ar e


ap p l i c ab l e t o a di st r i b u t i on and ar e u sed t o
desc r i b e t h e p r op er t i es and b eh av i or of a
di st r i b u t i on.

St at i st i c i ans gener al l y di v i de al l di st r i b u t i ons


i nt o t w o b r oad c at egor i es:

Discr ete distr ibu tions

Continu ou s distr ibu tions

DISCRETE DISTRIBUTIONS
Discre te probability functions ar e al so k now n
as probability mass functions and c an assu me a
di sc r et e nu mb er of v al u es. For ex amp l e, c oi n
t osses and c ou nt s of ev ent s ar e di sc r et e
f u nc t i ons. You c an h av e onl y h eads or t ai l s i n a
c oi n t oss. Si mi l ar l y , i f y ou 'r e c ou nt i ng t h e
nu mb er of t r ai ns t h at ar r i v e at a st at i on p er
h ou r , y ou c an c ou nt 1 1 or 1 2 t r ai ns, b u t not h i ng
i n-b et w een.

Some p r omi nent di sc r et e di st r i b u t i ons ar e as


f ol l ow s:

Binomial dist ribut ion to


m odel binar y data, su ch as
coin tosses

Poisson dist ribut ion to


m odel cou nt data, su ch as
th e cou nt of libr ar y book
ch eckou ts per h ou r

Uniform dist ribut ion to


m odel m u ltiple ev ents w ith
th e sam e pr obability , su ch as
r olling a die

CONTINUOUS
DISTRIBUTIONS
Continuous probability functions ar e al so
k now n as probability de nsity functions. You
h av e a c ont i nu ou s di st r i b u t i on i f t h e v ar i ab l e
c an assu me an i nf i ni t e nu mb er of v al u es
b et w een any t w o v al u es. Cont i nu ou s v ar i ab l es
ar e of t en measu r ement s on a r eal nu mb er sc al e,
su c h as h ei gh t , w ei gh t , and t emp er at u r e.

Th e most w el l -k now n c ont i nu ou s di st r i b u t i on i s


t h e normal distribution, w h i c h i s al so k now n as
t h e Gaussian distribution or t h e be ll curv e . Th i s
sy mmet r i c di st r i b u t i on f i t s a w i de v ar i et y of
p h enomena, su c h as h u man h ei gh t and IQ sc or es.

Th e nor mal di st r i b u t i on i s l i nk ed t o t h e f amou s


68-95-99.7 rule , w h i c h desc r i b es t h e p er c ent age
of dat a t h at f al l s w i t h i n 1 , 2, or 3 st andar d
dev i at i ons aw ay f r om t h e mean i f t h e dat a
f ol l ow s a nor mal di st r i b u t i on. Th i s means t h at
y ou c an qu i c k l y l ook at some samp l e dat a,
c al c u l at e t h e mean and st andar d dev i at i on, and
c an h av e a c onf i denc e (a st at i st i c al measu r e of
u nc er t ai nt y ) t h at any f u t u r e i nc omi ng dat a w i l l
f al l w i t h i n t h ose 68%-95%-99.7% b ou ndar i es. Th i s
r u l e i s w i del y u sed i n i ndu st r i es, medi c i ne,
ec onomi c s, and soc i al sc i enc e:
Figure 3.16: Curve showing the normal distribution of
the famous 68-95-99.7 rule
DATA WRANGLING IN
STATISTICS AND
VISUALIZATION
A good dat a w r angl i ng p r of essi onal i s ex p ec t ed
t o enc ou nt er a di zzy i ng ar r ay of di v er se dat a
sou r c es eac h day . A s w e ex p l ai ned p r ev i ou sl y ,
du e t o a mu l t i t u de of c omp l ex su b -p r oc esses and
mu t u al i nt er ac t i ons t h at gi v e r i se t o su c h dat a,
t h ey al l f al l i nt o t h e c at egor y of di sc r et e or
c ont i nu ou s r andom v ar i ab l es.

It w i l l b e ex t r emel y di f f i c u l t and c onf u si ng t o


t h e dat a w r angl er or dat a sc i enc e t eam i f al l of
t h i s dat a c ont i nu es t o b e t r eat ed as c omp l et el y
r andom and w i t h ou t any sh ap e or p at t er n. A
f or mal st at i st i c al b asi s mu st b e gi v en t o su c h
r andom dat a st r eams, and one of t h e si mp l est
w ay s t o st ar t t h at p r oc ess i s t o measu r e t h ei r
desc r i p t i v e st at i st i c s.

A ssi gni ng a st r eam of dat a t o a p ar t i c u l ar


di st r i b u t i on f u nc t i on (or a c omb i nat i on of many
di st r i b u t i ons) i s ac t u al l y p ar t of infe re ntial
statistics. H ow ev er , i nf er ent i al st at i st i c s st ar t s
onl y w h en desc r i p t i v e st at i st i c s i s done
al ongsi de measu r i ng al l t h e i mp or t ant
p ar amet er s of t h e p at t er n of t h e dat a.

Th er ef or e, as t h e f r ont l i ne of a dat a sc i enc e


p i p el i ne, dat a w r angl i ng mu st deal w i t h
measu r i ng and qu ant i f y i ng su c h desc r i p t i v e
st at i st i c s of t h e i nc omi ng dat a. A l ong w i t h t h e
f or mat t ed and c l eaned-u p dat a, t h e p r i mar y job
of a dat a w r angl er i s t o h and ov er t h ese measu r es
(and somet i mes ac c omp any i ng p l ot s) t o t h e nex t
t eam memb er of anal y t i c s.

Plotting and v isualization al so h el p a dat a


w r angl i ng t eam i dent i f y p ot ent i al ou t l i er s and
mi sf i t s i n t h e i nc omi ng dat a st r eam and h el p
t h em t o t ak e ap p r op r i at e ac t i on. W e w i l l see
some ex amp l es of su c h t ask s i n t h e nex t c h ap t er ,
w h er e w e w i l l i dent i f y odd dat a p oi nt s b y
c r eat i ng sc at t er p l ot s or h i st ogr ams and ei t h er
i mp u t e or omi t t h e dat a p oi nt .

USING NUMPY AND PANDAS


TO CALCULATE BASIC
DESCRIPTIVE STATISTICS ON
THE DATAFRAME
N ow t h at w e h av e some b asi c k now l edge of
N u mPy , p andas, and mat p l ot l i b u nder ou r b el t ,
w e c an ex p l or e a f ew addi t i onal t op i c s r el at ed t o
t h ese l i b r ar i es, su c h as h ow w e c an b r i ng t h em
t oget h er f or adv anc ed dat a gener at i on, anal y si s,
and v i su al i zat i on.
RANDOM NUMBER
GENERATION USING NUMPY
N u mPy of f er s a di zzy i ng ar r ay of r andom
nu mb er gener at i on u t i l i t y f u nc t i ons, al l of
w h i c h c or r esp ond t o v ar i ou s st at i st i c al
di st r i b u t i ons, su c h as u ni f or m, b i nomi al ,
Gau ssi an nor mal , Bet a/Gamma, and c h i -squ ar e.
Most of t h ese f u nc t i ons ar e ex t r emel y u sef u l and
ap p ear c ou nt l ess t i mes i n adv anc ed st at i st i c al
dat a mi ni ng and mac h i ne l ear ni ng t ask s. H av i ng
a sol i d k now l edge of t h em i s st r ongl y enc ou r aged
f or al l t h e st u dent s t ak i ng t h i s b ook .

H er e, w e w i l l di sc u ss t h r ee of t h e most
i mp or t ant di st r i b u t i ons t h at may c ome i n h andy
f or dat a w r angl i ng t ask s – u ni f or m, b i nomi al ,
and gau ssi an nor mal . Th e goal h er e i s t o sh ow an
ex amp l e of si mp l e f u nc t i on c al l s t h at c an
gener at e one or mor e r andom nu mb er s/ar r ay s
w h enev er t h e u ser needs t h em.

Note
Th e re s u lts w ill be d iffe re nt for e a c h s tu d e nt
w h e n th e y u s e th e s e fu nc tions a s th e y a re
s u p p os e d to be ra nd om .

EXERCISE 43: GENERATING


RANDOM NUMBERS FROM A
UNIFORM DISTRIBUTION
In t h i s ex er c i se, w e w i l l b e gener at i ng r andom
nu mb er s f r om a u ni f or m di st r i b u t i on:

1 . Gener ate a r andom integer


betw een 1 and 10:

x =
np.random.randint(1,10
)

print(x)

Th e sam ple ou tpu t is as


follow s (y ou r ou tpu t cou ld be
differ ent):

2 . Gener ate a r andom integer


betw een 1 and 1 0 bu t w ith
size= 1 as an ar gu m ent. It
gener ates a Nu m Py ar r ay of
size 1 :

x =
np.random.randint(1,10
,size=1)

print(x)

Th e sam ple ou tpu t is as


follow s (y ou r ou tpu t cou ld be
differ ent du e to r andom
dr aw ):

[8]

Th er efor e, w e can easily


w r ite th e code to gener ate
th e ou tcom e of a dice being
th r ow n (a nor m al 6 -sided
dice) for 1 0 tr ials.

How abou t m ov ing aw ay


fr om th e integer s and
gener ating som e r eal
nu m ber s? Let's say th at w e
w ant to gener ate ar tificial
data for w eigh ts (in kgs) of
2 0 adu lts and w e can
m easu r e th e accu r ate
w eigh ts u p to tw o decim al
places.

3 . Gener ate decim al data u sing


th e follow ing com m and:

x =
50+50*np.random.random
(size=15)

x= x.round(decimals=2)

print(x)

Th e sam ple ou tpu t is as


follow s:

[56.24 94.67 50.66


94.36 77.37 53.81
61.47 71.13 59.3 65.3
63.02 65.

58.21 81.21 91.62]

We ar e not only r estr icted to


one-dim ensional ar r ay s.

4 . Gener ate and sh ow a 3 x3


m atr ix w ith r andom
nu m ber s betw een 0 and 1:
x =
np.random.rand(3,3)

print(x)

Th e sam ple ou tpu t is as


follow s (note th at y ou r
specific ou tpu t cou ld be
differ ent du e to
r andom ness):

[[0.99240105 0.9149215
0.04853315]

[0.8425871 0.11617792
0.77983995]

[0.82769081 0.57579771
0.11358125]]

EXERCISE 44: GENERATING


RANDOM NUMBERS FROM A
BINOMIAL DISTRIBUTION
AND BAR PLOT
A b i nomi al di st r i b u t i on i s t h e p r ob ab i l i t y
di st r i b u t i on of get t i ng a sp ec i f i c nu mb er of
su c c esses i n a sp ec i f i c nu mb er of t r i al s of an
ev ent w i t h a p r e-det er mi ned c h anc e or
p r ob ab i l i t y .

Th e most ob v i ou s ex amp l e of t h i s i s a c oi n t oss. A


f ai r c oi n may h av e an equ al c h anc e of h eads or
t ai l s, b u t an u nf ai r c oi n may h av e mor e c h anc es
of t h e h ead c omi ng u p or v i c e v er sa. W e c an
si mu l at e a c oi n t oss i n N u mPy i n t h e f ol l ow i ng
manner .

Su p p ose w e h av e a b i ased c oi n w h er e t h e
p r ob ab i l i t y of h eads i s 0.6. W e t oss t h i s c oi n t en
t i mes and not e dow n t h e nu mb er of h eads t u r ni ng
u p eac h t i me. Th at i s one t r i al or ex p er i ment .
N ow , w e c an r ep eat t h i s ex p er i ment (1 0 c oi n
t osses) any nu mb er of t i mes, say 8 t i mes. Eac h
t i me, w e r ec or d t h e nu mb er of h eads:

1 . Th e exper im ent can be


sim u lated u sing th e
follow ing code:

x =
np.random.binomial(10,
0.6,size=8)

print(x)

Th e sam ple ou tpu t is as


follow s (note y ou r specific
ou tpu t cou ld be differ ent du e
to r andom ness):

[6 6 5 6 5 8 4 5]

2 . Plot th e r esu lt u sing a bar


ch ar t:

plt.figure(figsize=
(7,4))

plt.title("Number of
successes in coin
toss",fontsize=16)

plt.bar(left=np.arange
(1,9),height=x)

plt.xlabel("Experiment
number",fontsize=15)

plt.ylabel("Number of
successes",fontsize=15
)

plt.show()

Th e sam ple ou tpu t is as


follow s:

Figure 3.17: A screenshot of a graph showing the


binomial distribution and the bar plot

EXERCISE 45: GENERATING


RANDOM NUMBERS FROM
NORMAL DISTRIBUTION AND
HISTOGRAMS
W e di sc u ssed t h e nor mal di st r i b u t i on i n t h e l ast
t op i c and ment i oned t h at i t i s t h e most i mp or t ant
p r ob ab i l i t y di st r i b u t i on b ec au se many p i ec es of
nat u r al , soc i al , and b i ol ogi c al dat a f ol l ow t h i s
p at t er n c l osel y w h en t h e nu mb er of samp l es i s
l ar ge. N u mPy p r ov i des an easy w ay t o gener at e
r andom nu mb er s c or r esp ondi ng t o t h i s
di st r i b u t i on:

1 . Dr aw a single sam ple fr om a


nor m al distr ibu tion by u sing
th e follow ing com m and:

x = np.random.normal()

print(x)

Th e sam ple ou tpu t is as


follow s (note th at y ou r
specific ou tpu t cou ld be
differ ent du e to
r andom ness):

-1.2423774071573694

We know th at nor m al
distr ibu tion is ch ar acter ized
by tw o par am eter s – m ean
(µ ) and standar d dev iation
(σ ). In fact, th e defau lt
v alu es for th is par ticu lar
fu nction ar e µ = 0.0 and σ =
1 .0.

Su ppose w e know th at th e
h eigh ts of th e teenage (1 2 -1 6
y ear s) stu dents in a
par ticu lar sch ool is
distr ibu ted nor m ally w ith a
m ean h eigh t of 1 55 cm and a
standar d dev iation of 1 0 cm .

2 . Gener ate a h istogr am of 1 00


stu dents by u sing th e
follow ing com m and:

# Code to generate the


100 samples (heights)

heights =
np.random.normal(loc=1
55,scale=10,size=100)

# Plotting code
#---------------------
--

plt.figure(figsize=
(7,5))

plt.hist(heights,color
='orange',edgecolor='k
')

plt.title("Histogram
of teen aged
students's
height",fontsize=18)

plt.xlabel("Height in
cm",fontsize=15)

plt.xticks(fontsize=15
)

plt.yticks(fontsize=15
)

plt.show()

Th e sam ple ou tpu t is as


follow s:

Figure 3.18: Histogram of teenage student's height

N ot e t h e u se of t h e loc p ar amet er f or t h e mean


(=1 55) and t h e scale p ar amet er f or st andar d
dev i at i on (=1 0). Th e si ze p ar amet er i s set t o 1 00
f or t h at may samp l es' gener at i on.
EXERCISE 46: CALCULATION
OF DESCRIPTIVE STATISTICS
FROM A DATAFRAME
Rec ol l ec t t h e age, weight, and height p ar amet er s
t h at w e def i ned f or t h e p l ot t i ng ex er c i se. Let 's
p u t t h at dat a i n a Dat aFr ame t o c al c u l at e v ar i ou s
desc r i p t i v e st at i st i c s ab ou t t h em.

Th e b est p ar t of w or k i ng w i t h a p andas
Dat aFr ame i s t h at i t h as a b u i l t -i n u t i l i t y
f u nc t i on t o sh ow al l of t h ese desc r i p t i v e
st at i st i c s w i t h a si ngl e l i ne of c ode. It does t h i s
b y u si ng t h e describe met h od:

1 . Constr u ct a dictionar y w ith


th e av ailable ser ies data by
u sing th e follow ing
com m and:
people_dict=
{'People':people,'Age'
:age,'Weight':weight,'
Height':height}

people_df=pd.DataFrame
(data=people_dict)

people_df

Th e ou tpu t is as follow s:
Figure 3.19: Output of the created
dictionary

2 . Find th e nu m ber of r ow s and


colu m ns of th e DataFr am e
by execu ting th e follow ing
com m and:

print(people_df.shape)

Th e ou tpu t is as follow s:

(12, 4)

3 . Obtain a sim ple count (any


colu m n can be u sed for th is
pu r pose) by execu ting th e
follow ing com m and:

print(people_df['Age']
.count())

Th e ou tpu t is as follow s:

12

4 . Calcu late th e sum total of age


by u sing th e follow ing
com m and:

print(people_df['Age']
.sum())

Th e ou tpu t is as follow s:

353

5. Calcu late th e mean age by


u sing th e follow ing
com m and:
print(people_df['Age']
.mean())

Th e ou tpu t is as follow s:

29.416666666666668

6 . Calcu late th e median w eigh t


by u sing th e follow ing
com m and:

print(people_df['Weigh
t'].median())

Th e ou tpu t is as follow s:

66.5
7 . Calcu late th e maximum
h eigh t by u sing th e follow ing
com m and:

print(people_df['Heigh
t'].max())

Th e ou tpu t is as follow s:

175

8. Calcu late th e standard


deviation of th e w eigh ts by
u sing th e follow ing
com m and:

print(people_df['Weigh
t'].std())

Th e ou tpu t is as follow s:

18.45120510148239

Note h ow w e ar e calling th e
statistical fu nctions dir ectly
fr om a DataFr am e object.

9 . To calcu late percentile, w e


can call a fu nction fr om
Nu m Py and pass on th e
par ticu lar colu m n (a pandas
ser ies). For exam ple, to
calcu late th e 7 5th and 2 5th
per centiles of age
distr ibu tion and th eir
differ ence (called th e inter -
qu ar tile r ange), u se th e
follow ing code:

pcnt_75 =
np.percentile(people_d
f['Age'],75)

pcnt_25 =
np.percentile(people_d
f['Age'],25)

print("Inter-quartile
range: ",pcnt_75-
pcnt_25)

Th e ou tpu t is as follow s:

Inter-quartile range:
24.0
1 0. Use th e describe com m and
to find a detailed descr iption
of th e DataFr am e:

print(people_df.descri
be())

Th e ou tpu t is as follow s:

Figure 3.20: Output of the DataFrame using the


describe method

Note
Th is fu nc tion w ork s only on th e c olu m ns w h e re
nu m e ric d a ta is p re s e nt. I t h a s no im p a c t on th e
non-nu m e ric c olu m ns , for e xa m p le , Pe op le in th is
Da ta Fra m e .

EXERCISE 47: BUILT-IN


PLOTTING UTILITIES
Dat aFr ame al so h as b u i l t -i n p l ot t i ng u t i l i t i es
t h at w r ap ar ou nd mat p l ot l i b f u nc t i ons and
c r eat e b asi c p l ot s of nu mer i c dat a:

1 . Find th e h istogr am of th e
w eigh ts by u sing th e hist
fu nction:

people_df['Weight'].hi
st()

plt.show()

Th e ou tpu t is as follow s:
Figure 3.21: Histogram of the weights

2 . Cr eate a sim ple scatter plot


dir ectly fr om th e DataFr am e
to plot th e r elationsh ip
betw een w eigh t and h eigh ts
by u sing th e follow ing
com m and:

people_df.plot.scatter
('Weight','Height',s=1
50,

c='orange',edgecolor='
k')

plt.grid(True)

plt.title("Weight vs.
Height scatter
plot",fontsize=18)

plt.xlabel("Weight (in
kg)",fontsize=15)

plt.ylabel("Height (in
cm)",fontsize=15)

plt.show()

Th e ou tpu t is as follow s:
Figure 3.22: Weight versus Height scatter plot

Note
You c a n try re g u la r m a tp lotlib m e th od s a rou nd
th is fu nc tion c a ll to m a k e y ou r p lot p re tty .

ACTIVITY 5: GENERATING
STATISTICS FROM A CSV FILE
Su p p ose y ou ar e w or k i ng w i t h t h e f amou s Bost on
h ou si ng p r i c e (f r om 1 960) dat aset . Th i s dat aset i s
f amou s i n t h e mac h i ne l ear ni ng c ommu ni t y .
Many r egr essi on p r ob l ems c an b e f or mu l at ed,
and mac h i ne l ear ni ng al gor i t h ms c an b e r u n on
t h i s dat aset . You w i l l do p er f or m a b asi c dat a
w r angl i ng ac t i v i t y (i nc l u di ng p l ot t i ng some
t r ends) on t h i s dat aset b y r eadi ng i t as a p andas
Dat aFr ame.

Note
Th e p a nd a s fu nc tion for re a d ing a CS V file is
read_csv.

Th ese st ep s w i l l h el p y ou c omp l et e t h i s ac t i v i t y :

1 . Load th e necessar y libr ar ies.

2 . Read in th e Boston h ou sing


dataset (giv en as a .csv file)
fr om th e local dir ector y .

3 . Ch eck th e fir st 1 0 r ecor ds.


Find th e total nu m ber of
r ecor ds.
4 . Cr eate a sm aller DataFr am e
w ith colu m ns th at do not
inclu de CHAS, NOX, B, and
LSTAT.

5. Ch eck th e last sev en r ecor ds


of th e new DataFr am e y ou
ju st cr eated.

6 . Plot th e h istogr am s of all th e


v ar iables (colu m ns) in th e
new DataFr am e.

7 . Plot th em all at once u sing a


for loop. Tr y to add a u niqu e
title to a plot.

8. Cr eate a scatter plot of cr im e


r ate v er su s pr ice.

9 . Plot u sing log10(crime)


v er su s price.

1 0. Calcu late som e u sefu l


statistics, su ch as m ean
r oom s per dw elling, m edian
age, m ean distances to fiv e
Boston em ploy m ent center s,
and th e per centage of h ou ses
w ith a low pr ice (<
$2 0,000).

Note

The solution for this activity


can be found on page 292.

Summary
In t h i s c h ap t er , w e st ar t ed w i t h t h e b asi c s of
N u mPy ar r ay s, i nc l u di ng h ow t o c r eat e t h em and
t h ei r essent i al p r op er t i es. W e di sc u ssed and
sh ow ed h ow a N u mPy ar r ay i s op t i mi zed f or
v ec t or i zed el ement -w i se op er at i ons and di f f er s
f r om a r egu l ar Py t h on l i st . Th en, w e mov ed on t o
p r ac t i c i ng v ar i ou s op er at i ons on N u mPy ar r ay s
su c h as i ndex i ng, sl i c i ng, f i l t er i ng, and
r esh ap i ng. W e al so c ov er ed sp ec i al one-
di mensi onal and t w o-di mensi onal ar r ay s, su c h as
zer os, ones, i dent i t y mat r i c es, and r andom ar r ay s.

In t h e sec ond major t op i c of t h i s c h ap t er , w e


st ar t ed w i t h p andas ser i es ob jec t s and qu i c k l y
mov ed on t o a c r i t i c al l y i mp or t ant ob jec t –
p andas Dat aFr ames. It i s anal ogou s t o Ex c el or
MA TLA B or a dat ab ase t ab , b u t w i t h many u sef u l
p r op er t i es f or dat a w r angl i ng. W e demonst r at ed
some b asi c op er at i ons on Dat aFr ames, su c h as
i ndex i ng, su b set t i ng, r ow and c ol u mn addi t i on,
and del et i on.

N ex t , w e c ov er ed t h e b asi c s of p l ot t i ng w i t h
mat p l ot l i b , t h e most w i del y u sed and p op u l ar
Py t h on l i b r ar y f or v i su al i zat i on. A l ong w i t h
p l ot t i ng ex er c i ses, w e t ou c h ed u p on r ef r esh er
c onc ep t s of desc r i p t i v e st at i st i c s (su c h as
c ent r al t endenc y and measu r e of sp r ead) and
p r ob ab i l i t y di st r i b u t i ons (su c h as u ni f or m,
b i nomi al , and nor mal ).

In t h e nex t c h ap t er , w e w i l l c ov er mor e
adv anc ed op er at i on w i t h p andas Dat aFr ames
t h at w i l l c ome i n v er y h andy f or day -t o-day
w or k i ng i n a dat a w r angl i ng job .
Chapter 4
A Deep Dive into Data
Wrangling with Python
Learning Objectives
By t h e end of t h i s c h ap t er , y ou w i l l b e ab l e t o:

Per for m su bsetting, filter ing,


and gr ou ping on pandas
DataFr am es

A pply Boolean filter ing and


indexing fr om a DataFr am e
to ch oose specific elem ents

Per for m JOIN oper ations in


pandas th at ar e analogou s to
th e SQL com m and

Identify m issing or cor r u pted


data and ch oose to dr op or
apply im pu tation tech niqu es
on m issing or cor r u pted data

In t h i s c h ap t er , w e w i l l l ear n ab ou t p andas
Dat aFr ames i n det ai l .

Introduction
In t h i s c h ap t er , w e w i l l l ear n ab ou t sev er al
adv anc ed op er at i ons i nv ol v i ng p andas
Dat aFr ames and N u mPy ar r ay s. On c omp l et i ng
t h e det ai l ed ac t i v i t y f or t h i s c h ap t er , y ou w i l l
h av e h andl ed r eal -l i f e dat aset s and u nder st ood
t h e p r oc ess of dat a w r angl i ng.

Subsetting, Filtering,
and Grouping
One of t h e most i mp or t ant asp ec t s of dat a
w r angl i ng i s t o c u r at e t h e dat a c ar ef u l l y f r om
t h e del u ge of st r eami ng dat a t h at p ou r s i nt o an
or gani zat i on or b u si ness ent i t y f r om v ar i ou s
sou r c es. Lot s of dat a i s not al w ay s a good t h i ng;
r at h er , dat a needs t o b e u sef u l and of h i gh -
qu al i t y t o b e ef f ec t i v el y u sed i n dow nst r eam
ac t i v i t i es of a dat a sc i enc e p i p el i ne su c h as
mac h i ne l ear ni ng and p r edi c t i v e model
b u i l di ng. Mor eov er , one dat a sou r c e c an b e u sed
f or mu l t i p l e p u r p oses and t h i s of t en r equ i r es
di f f er ent su b set s of dat a t o b e p r oc essed b y a dat a
w r angl i ng modu l e. Th i s i s t h en p assed on t o
sep ar at e anal y t i c s modu l es.

For ex amp l e, l et 's say y ou ar e doi ng dat a


w r angl i ng on U S St at e l ev el ec onomi c ou t p u t . It
i s a f ai r l y c ommon sc enar i o t h at one mac h i ne
l ear ni ng model may r equ i r e dat a f or l ar ge and
p op u l ou s st at es (su c h as Cal i f or ni a, Tex as, and so
on), w h i l e anot h er model demands p r oc essed dat a
f or smal l and sp ar sel y p op u l at ed st at es (su c h as
Mont ana or N or t h Dak ot a). A s t h e f r ont l i ne of
t h e dat a sc i enc e p r oc ess, i t i s t h e r esp onsi b i l i t y
of t h e dat a w r angl i ng modu l e t o sat i sf y t h e
r equ i r ement s of b ot h t h ese mac h i ne l ear ni ng
model s. Th er ef or e, as a dat a w r angl i ng engi neer ,
y ou h av e t o f i l t er and gr ou p dat a ac c or di ngl y
(b ased on t h e p op u l at i on of t h e st at e) b ef or e
p r oc essi ng t h em and p r odu c i ng sep ar at e dat aset s
as t h e f i nal ou t p u t f or sep ar at e mac h i ne
l ear ni ng model s.

A l so, i n some c ases, dat a sou r c es may b e b i ased, or


t h e measu r ement may c or r u p t t h e i nc omi ng dat a
oc c asi onal l y . It i s a good i dea t o t r y t o f i l t er onl y
t h e er r or -f r ee, good dat a f or dow nst r eam
model i ng. Fr om t h ese ex amp l es and di sc u ssi ons,
i t i s c l ear t h at f i l t er i ng and gr ou p i ng/b u c k et i ng
dat a i s an essent i al sk i l l t o h av e f or any engi neer
t h at 's engaged i n t h e t ask of dat a w r angl i ng. Let 's
p r oc eed t o l ear n ab ou t a f ew of t h ese sk i l l s w i t h
p andas.

EXERCISE 48: LOADING AND


EXAMINING A SUPERSTORE'S
SALES DATA FROM AN EXCEL
FILE
In t h i s ex er c i se, w e w i l l l oad and ex ami ne an
Ex c el f i l e.

1 . To r ead an Excel file into


pandas, y ou w ill need a
sm all package called xlrd to
be installed on y ou r sy stem .
If y ou ar e w or king fr om
inside th is book's Docker
container , th en th is package
m ay not be av ailable next
tim e y ou star t y ou r
container , and y ou h av e to
follow th e sam e step. Use th e
follow ing code to install th e
xlr d package:

!pip install xlrd


2 . Load th e Excel file fr om
GitHu b by u sing th e sim ple
pandas m eth od read_excel:

import numpy as np

import pandas as pd

import
matplotlib.pyplot as
plt

df =
pd.read_excel("Sample
- Superstore.xls")

df.head()

Exam ine all th e colu m ns and


ch eck if th ey ar e u sefu l for
analy sis:

Figure 4.1 Output of the Excel file in a


DataFrame
On exam ining th e file, w e
can see th at th e fir st colu m n,
called Row ID, is not v er y
u sefu l.

3 . Dr op th is colu m n altogeth er
fr om th e DataFr am e by
u sing th e drop m eth od:

df.drop('Row
ID',axis=1,inplace=Tru
e)

4 . Ch eck th e nu m ber of r ow s
and colu m ns in th e new ly
cr eated dataset. We w ill u se
th e shape fu nction h er e:

df.shape

Th e ou tpu t is as follow s:

(9994, 20)

We can see th at th e dataset


h as 9 ,9 9 4 r ow s and 2 0
colu m ns.

SUBSETTING THE
DATAFRAME
Subse tting i nv ol v es t h e ex t r ac t i on of p ar t i al
dat a b ased on sp ec i f i c c ol u mns and r ow s, as p er
b u si ness needs. Su p p ose w e ar e i nt er est ed onl y
i n t h e f ol l ow i ng i nf or mat i on f r om t h i s dat aset :
Cu st omer ID, Cu st omer N ame, Ci t y , Post al Code,
and Sal es. For demonst r at i on p u r p oses, l et 's
assu me t h at w e ar e onl y i nt er est ed i n 5 r ec or ds –
r ow s 5-9. W e c an su b set t h e Dat aFr ame t o ex t r ac t
onl y t h i s mu c h i nf or mat i on u si ng a si ngl e l i ne of
Py t h on c ode.

U se t h e loc met h od t o i ndex t h e dat aset b y name


of t h e c ol u mns and i ndex of t h e r ow s:

df_subset = df.loc[

[i for i in range(5,10)],

['Customer ID','Customer
Name','City','Postal Code',

'Sales']]

df_subset

Th e ou t p u t i s as f ol l ow s:
Figure 4.2: DataFrame indexed by name of the
columns

W e need t o p ass on t w o ar gu ment s t o t h e loc


met h od – one f or i ndi c at i ng t h e r ow s, and
anot h er f or i ndi c at i ng t h e c ol u mns. Th ese sh ou l d
b e Py t h on l i st s.

For t h e r ow s, w e h av e t o p ass a l i st [5,6,7 ,8,9], b u t


i nst ead of w r i t i ng t h at ex p l i c i t l y , w e u se a l i st
c omp r eh ensi on, t h at i s, [i for i in
range(5,10)].

Bec au se t h e c ol u mns w e ar e i nt er est ed i n ar e not


c ont i gu ou s, w e c annot ju st p u t a c ont i nu ou s
r ange and need t o p ass on a l i st c ont ai ni ng t h e
sp ec i f i c names. So, t h e sec ond ar gu ment i s ju st a
si mp l e l i st w i t h sp ec i f i c c ol u mn names.

Th e dat aset sh ow s t h e f u ndament al c onc ep t s of


t h e p r oc ess of su b set t i ng a Dat aFr ame b ased on
b u si ness r equ i r ement s.

AN EXAMPLE USE CASE:


DETERMINING STATISTICS ON
SALES AND PROFIT
Th i s qu i c k sec t i on sh ow s a t y p i c al u se c ase of
su b set t i ng. Su p p ose w e w ant t o c al c u l at e
desc r i p t i v e st at i st i c s (mean, medi an, st andar d
dev i at i on, and so on) of r ec or ds 1 00-1 99 f or sal es
and p r of i t . Th i s i s h ow su b set t i ng h el p s u s t o
ac h i ev e t h at :

df_subset = df.loc[[i for i in


range(100,200)],['Sales','Profit']]

df_subset.describe()

Th e ou t p u t i s as f ol l ow s:
Figure 4.3 Output of descriptive statistics of data

Fu r t h er mor e, w e c an c r eat e b ox p l ot s of sal es and


p r of i t f i gu r es f r om t h i s f i nal dat a.

W e si mp l y ex t r ac t r ec or ds 1 00-1 99 and r u n t h e
describe f u nc t i on on i t b ec au se w e don't w ant t o
p r oc ess al l t h e dat a! For t h i s p ar t i c u l ar b u si ness
qu est i on, w e ar e onl y i nt er est ed i n sal es and
p r of i t nu mb er s and t h er ef or e w e sh ou l d not t ak e
t h e easy r ou t e and r u n a desc r i b e f u nc t i on on al l
t h e dat a. For a r eal -l i f e dat aset , t h e nu mb er of
r ow s and c ol u mns c ou l d of t en b e i n t h e mi l l i ons,
and w e don't w ant t o c omp u t e any t h i ng t h at i s
not ask ed f or i n t h e dat a w r angl i ng t ask . W e
al w ay s ai m t o su b set t h e ex ac t dat a t h at i s needed
t o b e p r oc essed and r u n st at i st i c al or p l ot t i ng
f u nc t i ons on t h at p ar t i al dat a:
Figure 4.4: Boxplot of sales and profit

EXERCISE 49: THE UNIQUE


FUNCTION
Bef or e c ont i nu i ng f u r t h er w i t h f i l t er i ng
met h ods, l et 's t ak e a qu i c k det ou r and ex p l or e a
su p er u sef u l f u nc t i on c al l ed unique. A s t h e name
su ggest s, t h i s f u nc t i on i s u sed t o sc an t h r ou gh t h e
dat a qu i c k l y and ex t r ac t onl y t h e u ni qu e v al u es
i n a c ol u mn or r ow .

A f t er l oadi ng t h e su p er st or e sal es dat a, y ou w i l l


not i c e t h at t h er e ar e c ol u mns l i k e "Cou nt r y ",
"St at e", and "Ci t y ". A nat u r al qu est i on w i l l b e t o
ask h ow many c ou nt r i es/st at es/c i t i es ar e
p r esent i n t h e dat aset :

1 . Extr act th e
cou ntr ies/states/cities for
w h ich th e infor m ation is in
th e database, w ith one
sim ple line of code, as follow s:

df['State'].unique()

Th e ou tpu t is as follow s:
Figure 4.5: Different states present in the
dataset

You w ill see a list of all th e


states w h ose data is pr esent
in th e dataset.

2 . Use th e nunique m eth od to


cou nt th e nu m ber of u niqu e
v alu es, like so:

df['State'].nunique()

Th e ou tpu t is as follow s:

49

Th is r etu r ns 4 9 for th is
dataset. So, one ou t of 50
states in th e US does not
appear in th is dataset.

Si mi l ar l y , i f w e r u n t h i s f u nc t i on on t h e
Cou nt r y c ol u mn, w e get an ar r ay w i t h onl y one
el ement , United States. Immedi at el y , w e c an see
t h at w e don't need t o k eep t h e c ou nt r y c ol u mn at
al l , b ec au se t h er e i s no u sef u l i nf or mat i on i n
t h at c ol u mn ex c ep t t h at al l t h e ent r i es ar e t h e
same. Th i s i s h ow a si mp l e f u nc t i on h el p ed u s t o
dec i de ab ou t dr op p i ng a c ol u mn al t oget h er –
t h at i s, r emov i ng 9,994 p i ec es of u nnec essar y
dat a!

CONDITIONAL SELECTION
AND BOOLEAN FILTERING
Of t en, w e don't w ant t o p r oc ess t h e w h ol e dat aset
and w ou l d l i k e t o sel ec t onl y a p ar t i al dat aset
w h ose c ont ent s sat i sf y a p ar t i c u l ar c ondi t i on.
Th i s i s p r ob ab l y t h e most c ommon u se c ase of any
dat a w r angl i ng t ask .

In t h e c ont ex t of ou r su p er st or e sal es dat aset ,


t h i nk of t h ese c ommon qu est i ons t h at may ar i se
f r om t h e dai l y ac t i v i t y of t h e b u si ness anal y t i c s
t eam:

Wh at ar e th e av er age sales
and pr ofit figu r es in
Califor nia?

Wh ich states h av e th e
h igh est and low est total
sales?

Wh at consu m er segm ent h as


th e m ost v ar iance in
sales/pr ofit?

A m ong th e top 5 states in


sales, w h ich sh ipping m ode
and pr odu ct categor y ar e th e
m ost popu lar ch oices?

Cou nt l ess ex amp l es c an b e gi v en w h er e t h e


b u si ness anal y t i c s t eam or t h e ex ec u t i v e
management w ant t o gl ean i nsi gh t f r om a
p ar t i c u l ar su b set of dat a t h at meet c er t ai n
c r i t er i a.

If y ou h av e any p r i or ex p er i enc e w i t h SQL, y ou


w i l l k now t h at t h ese k i nds of qu est i ons r equ i r e
f ai r l y c omp l ex SQL qu er y w r i t i ng. Rememb er
t h e W H ERE c l au se?

W e w i l l sh ow y ou h ow t o u se c ondi t i onal
su b set t i ng and Bool ean f i l t er i ng t o answ er su c h
qu est i ons.

Fi r st , w e need t o u nder st and t h e c r i t i c al c onc ep t


of b ool ean i ndex i ng. Th i s p r oc ess essent i al l y
ac c ep t s a c ondi t i onal ex p r essi on as an ar gu ment
and r et u r ns a dat aset of b ool eans i n w h i c h t h e
TRUE v al u e ap p ear s i n p l ac es w h er e t h e
c ondi t i on w as sat i sf i ed. A si mp l e ex amp l e i s
sh ow n i n t h e f ol l ow i ng c ode. For demonst r at i on
p u r p oses, w e su b set a smal l dat aset of 1 0 r ec or ds
and 3 c ol u mns:

df_subset = df.loc[[i for i in range


(10)],['Ship Mode','State','Sales']]

df_subset

Th e ou t p u t i s as f ol l ow s:
Figure 4.6: Sample dataset

N ow , i f w e ju st w ant t o k now t h e r ec or ds w i t h
sal es h i gh er t h an $1 00, t h en w e c an w r i t e t h e
f ol l ow i ng:

df_subset>100

Th i s p r odu c es t h e f ol l ow i ng b ool ean Dat aFr ame:


Figure 4.7: Records with sales higher than $100

N ot e t h e Tr u e and Fal se ent r i es i n t h e Sale s


c ol u mn. V al u es i n t h e Ship Mode and State
c ol u mns w er e not i mp ac t ed b y t h i s c ode b ec au se
t h e c omp ar i son w as w i t h a nu mer i c al qu ant i t y ,
and t h e onl y nu mer i c c ol u mn i n t h e or i gi nal
Dat aFr ame w as Sale s.

N ow , l et 's see w h at h ap p ens i f w e p ass t h i s


b ool ean Dat aFr ame as an i ndex t o t h e or i gi nal
Dat aFr ame:

df_subset[df_subset>100]

Th e ou t p u t i s as f ol l ow s:
Figure 4.8: Results a er passing the boolean
DataFrame as an index to the original DataFrame

Th e N aN v al u es c ame f r om t h e f ac t t h at t h e
p r ec edi ng c ode t r i ed t o c r eat e a Dat aFr ame w i t h
TRU E i ndi c es (i n t h e Bool ean Dat aFr ame) onl y .

Th e v al u es w h i c h w er e TRU E i n t h e b ool en
Dat aFr ame w er e r et ai ned i n t h e f i nal ou t p u t
Dat aFr ame.

Th e p r ogr am i nser t ed NaN v al u es f or t h e r ow s


w h er e dat a w as not av ai l ab l e (b ec au se t h ey w er e
di sc ar ded du e t o t h e Sal es v al u e b ei ng < $1 00).

N ow , w e p r ob ab l y don't w ant t o w or k w i t h t h i s
r esu l t i ng Dat aFr ame w i t h NaN v al u es. W e
w ant ed a smal l er Dat aFr ame w i t h onl y t h e r ow s
w h er e Sales > $100. W e c an ac h i ev e t h at b y
si mp l y p assi ng onl y t h e Sales c ol u mn:

df_subset[df_subset['Sales']>100]

Th i s p r odu c es t h e ex p ec t ed r esu l t :

Figure 4.9: Results a er removing the NaN values

W e ar e not l i mi t ed t o c ondi t i onal ex p r essi ons


i nv ol v i ng nu mer i c qu ant i t i es onl y . Let 's t r y t o
ex t r ac t h i gh sal es v al u es (> $1 00) f or ent r i es
t h at do not i nv ol v e Col or ado.
W e c an w r i t e t h e f ol l ow i ng c ode t o ac c omp l i sh
t h at :

df_subset[(df_subset['State']!='Colorado')
& (df_subset['Sales']>100)]

N ot e t h e u se of a c ondi t i onal i nv ol v i ng st r i ng. In


t h i s ex p r essi on, w e ar e joi ni ng t w o c ondi t i onal s
b y an & op er at or . Bot h c ondi t i ons mu st b e
w r ap p ed i nsi de p ar ent h eses.

Th e f i r st c ondi t i onal ex p r essi on si mp l y mat c h es


t h e ent r i es i n t h e State c ol u mn t o t h e st r i ng
Colorado and assi gns TRU E/FA LSE ac c or di ngl y .
Th e sec ond c ondi t i onal i s t h e same as b ef or e.
Toget h er , joi ned b y t h e & op er at or , t h ey ex t r ac t
onl y t h ose r ow s f or w h i c h State i s not Colorado
and Sales i s > $100. W e get t h e f ol l ow i ng r esu l t :

Figure 4.10: Results where State is not California and


Sales is higher than $100

Note
Alth ou g h , in th e ory , th e re is no lim it on h ow
c om p le x a c ond itiona l y ou c a n bu ild u s ing
ind iv id u a l e xp re s s ions a nd & (LOGI CAL AND)
a nd | (LOGI CAL OR) op e ra tors , it is a d v is a ble to
c re a te inte rm e d ia te boole a n Da ta Fra m e s w ith
lim ite d c ond itiona l e xp re s s ions a nd bu ild y ou r
fina l Da ta Fra m e s te p by s te p . Th is k e e p s th e c od e
le g ible a nd s c a la ble .

EXERCISE 50: SETTING AND


RESETTING THE INDEX
Somet i mes, w e may need t o r eset or el i mi nat e t h e
def au l t i ndex of a Dat aFr ame and assi gn a new
c ol u mn as an i ndex :

1 . Cr eate th e matrix_data,
row_labels, and
column_headings fu nctions
u sing th e follow ing
com m and:

matrix_data =
np.matrix(

'22,66,140;42,70,148;3
0,62,125;35,68,160;25,
62,152')

row_labels =
['A','B','C','D','E']
column_headings =
['Age', 'Height',
'Weight']

2 . Cr eate a DataFr am e u sing


th e matrix_data,
row_labels, and
column_headings fu nctions:

df1 =
pd.DataFrame(data=matr
ix_data,

index=row_labels,

columns=column_heading
s)

print("\nThe
DataFrame\n",'-'*25,
sep='')

print(df1)

Th e ou tpu t is as follow s:

Figure 4.11: The original DataFrame

3 . Reset th e index, as follow s:

print("\nAfter
resetting index\n",'-
'*35, sep='')

print(df1.reset_index(
))
Figure 4.12: DataFrame a er resetting
the index

4 . Reset th e index w ith drop set


to True, as follow s:

print("\nAfter
resetting index with
'drop' option
TRUE\n",'-'*45,
sep='')

print(df1.reset_index(
drop=True))

Figure 4.13: DataFrame a er resetting


the index with the drop option set to true

5. A dd a new colu m n u sing th e


follow ing com m and:

print("\nAdding a new
column
'Profession'\n",'-
'*45, sep='')

df1['Profession'] =
"Student Teacher
Engineer Doctor
Nurse".split()

print(df1)
Th e ou tpu t is as follow s:

Figure 4.14: DataFrame a er adding a


new column called Profession

6 . Now , set th e Profession


colu m n as an index u sing
th e follow ing code:

print("\nSetting
'Profession' column as
index\n",'-'*45,
sep='')

print
(df1.set_index('Profes
sion'))

Th e ou tpu t is as follow s:

Figure 4.15: DataFrame a er setting the Profession as


an index

EXERCISE 51: THE GROUPBY


METHOD
Gr ou p b y r ef er s t o a p r oc ess i nv ol v i ng one or
mor e of t h e f ol l ow i ng st ep s:
Splitting th e data into
gr ou ps based on som e
cr iter ia

A pply ing a fu nction to each


gr ou p independently

Com bining th e r esu lts into a


data str u ctu r e

In many si t u at i ons, w e c an sp l i t t h e dat aset i nt o


gr ou p s and do somet h i ng w i t h t h ose gr ou p s. In
t h e ap p l y st ep , w e mi gh t w i sh t o do one of t h e
f ol l ow i ng:

Aggregat ion: Com pu te a


su m m ar y statistic (or
statistics) for each gr ou p –
su m , m ean, and so on

Transformat ion: Per for m


a gr ou p-specific com pu tation
and r etu r n a like-indexed
object – z-tr ansfor m ation or
filling m issing data w ith a
v alu e

Filt rat ion: Discar d few


gr ou ps, accor ding to a gr ou p-
w ise com pu tation th at
ev alu ates TRUE or FA LSE

Th er e i s, of c ou r se, a desc r i b e met h od t o t h i s


GroupBy ob jec t , w h i c h p r odu c es t h e su mmar y
st at i st i c s i n t h e f or m of a Dat aFr ame.

GroupBy i s not l i mi t ed t o a si ngl e v ar i ab l e. If y ou


p ass on mu l t i p l e v ar i ab l es (as a l i st ), t h en y ou
w i l l get b ac k a st r u c t u r e essent i al l y si mi l ar t o a
Pi v ot Tab l e (f r om Ex c el ). Th e f ol l ow i ng i s an
ex amp l e w h er e w e gr ou p t oget h er al l t h e st at es
and c i t i es f r om t h e w h ol e dat aset (t h e snap sh ot i s
a p ar t i al v i ew onl y ).

Note
Th e na m e GroupBy s h ou ld be q u ite fa m ilia r to
th os e w h o h a v e u s e d a S QL-ba s e d tool be fore .

1 . Cr eate a 1 0-r ecor d su bset


u sing th e follow ing
com m and:

df_subset = df.loc[[i
for i in range (10)],
['Ship
Mode','State','Sales']
]
2 . Cr eate a pandas DataFr am e
u sing th e groupby object, as
follow s:

byState =
df_subset.groupby('Sta
te')

3 . Calcu late th e m ean sales


figu r e by state by u sing th e
follow ing com m and:

print("\nGrouping by
'State' column and
listing mean
sales\n",'-'*50,
sep='')

print(byState.mean())

Th e ou tpu t is as follow s:

Figure 4.16: Output a er grouping the


state with the listing mean sales

4 . Calcu late th e total sales


figu r e by state by u sing th e
follow ing com m and:

print("\nGrouping by
'State' column and
listing total sum of
sales\n",'-'*50,
sep='')

print(byState.sum())

Th e ou tpu t is as follow s:
Figure 4.17: The output a er grouping the
state with the listing sum of sales

5. Su bset th at DataFr am e for a


par ticu lar state and sh ow
th e statistics:

pd.DataFrame(byState.d
escribe().loc['Califor
nia'])

Th e ou tpu t is as follow s:

Figure 4.18: Checking the statistics of a


particular state

6 . Per for m a sim ilar


su m m ar ization by u sing th e
Ship Mode attr ibu te:
df_subset.groupby('Shi
p
Mode').describe().loc[
['Second
Class','Standard
Class']]

Th e ou tpu t w ill be as follow s:

Figure 4.19: Checking the sales by


summarizing the Ship Mode attribute

Note h ow pandas h as
gr ou ped th e data by State
fir st and th en by cities u nder
each state.

7 . Display th e com plete


su m m ar y statistics of sales
by ev er y city in each state –
all by tw o lines of code by
u sing th e follow ing
com m and:

byStateCity=df.groupby
(['State','City'])

byStateCity.describe()
['Sales']

Th e ou tpu t is as follow s:
Figure 4.20: Checking the summary statistics of sales

Detecting Outliers and


Handling Missing Values
Ou t l i er det ec t i on and h andl i ng mi ssi ng v al u es
f al l u nder t h e su b t l e ar t of dat a qu al i t y
c h ec k i ng. A model i ng or dat a mi ni ng p r oc ess i s
f u ndament al l y a c omp l ex ser i es of c omp u t at i ons
w h ose ou t p u t qu al i t y l ar gel y dep ends on t h e
qu al i t y and c onsi st enc y of t h e i np u t dat a b ei ng
f ed. Th e r esp onsi b i l i t y of mai nt ai ni ng and gat e
k eep i ng t h at qu al i t y of t en f al l s on t h e sh ou l der s
of a dat a w r angl i ng t eam.

A p ar t f r om t h e ob v i ou s i ssu e of p oor qu al i t y
dat a, mi ssi ng dat a c an somet i mes w r eak h av oc
w i t h t h e mac h i ne l ear ni ng (ML) model
dow nst r eam. A f ew ML model s, l i k e Bay esi an
l ear ni ng, ar e i nh er ent l y r ob u st t o ou t l i er s and
mi ssi ng dat a, b u t c ommonl y t ec h ni qu es l i k e
Dec i si on Tr ees and Random For est h av e an i ssu e
w i t h mi ssi ng dat a b ec au se t h e f u ndament al
sp l i t t i ng st r at egy emp l oy ed b y t h ese t ec h ni qu es
dep ends on an i ndi v i du al p i ec e of dat a and not a
c l u st er . Th er ef or e, i t i s al most al w ay s
i mp er at i v e t o i mp u t e mi ssi ng dat a b ef or e
h andi ng i t ov er t o su c h a ML model .

Ou t l i er det ec t i on i s a su b t l e ar t . Of t en, t h er e i s
no u ni v er sal l y agr eed def i ni t i on of an ou t l i er . In
a st at i st i c al sense, a dat a p oi nt t h at f al l s ou t si de
a c er t ai n r ange may of t en b e c l assi f i ed as an
ou t l i er , b u t t o ap p l y t h at def i ni t i on, y ou need t o
h av e a f ai r l y h i gh degr ee of c er t ai nt y ab ou t t h e
assu mp t i on of t h e nat u r e and p ar amet er s of t h e
i nh er ent st at i st i c al di st r i b u t i on ab ou t t h e dat a.
It t ak es a l ot of dat a t o b u i l d t h at st at i st i c al
c er t ai nt y and ev en af t er t h at , an ou t l i er may not
b e ju st an u ni mp or t ant noi se b u t a c l u e t o
somet h i ng deep er . Let 's t ak e an ex amp l e w i t h
some f i c t i t i ou s sal es dat a f r om an A mer i c an f ast
f ood c h ai n r est au r ant . If w e w ant t o model t h e
dai l y sal es dat a as a t i me ser i es, w e ob ser v e an
u nu su al sp i k e i n t h e dat a somew h er e ar ou nd
mi d-A p r i l :
Figure 4.21: Fictitious sales data of an American fast
food chain restaurant

A good dat a sc i ent i st or dat a w r angl er sh ou l d


dev el op c u r i osi t y ab ou t t h i s dat a p oi nt r at h er
t h an ju st r ejec t i ng i t ju st b ec au se i t f al l s ou t si de
t h e st at i st i c al r ange. In t h e ac t u al anec dot e, t h e
sal es f i gu r e r eal l y sp i k ed t h at day b ec au se of an
u nu su al r eason. So, t h e dat a w as r eal . Bu t ju st
b ec au se i t w as r eal does not mean i t i s u sef u l . In
t h e f i nal goal of b u i l di ng a smoot h l y v ar y i ng
t i me ser i es model , t h i s one p oi nt sh ou l d not
mat t er and sh ou l d b e r ejec t ed. Bu t t h e c h ap t er
h er e i s t h at w e c annot r ejec t ou t l i er s w i t h ou t
p ay i ng some at t ent i on t o t h em.

Th er ef or e, t h e k ey t o ou t l i er s i s t h ei r sy st emat i c
and t i mel y det ec t i on i n an i nc omi ng st r eam of
mi l l i ons of dat a or w h i l e r eadi ng dat a f r om a
c l ou d-b ased st or age. In t h i s t op i c , w e w i l l
qu i c k l y go ov er some b asi c st at i st i c al t est s f or
det ec t i ng ou t l i er s and some b asi c i mp u t at i on
t ec h ni qu es f or f i l l i ng u p mi ssi ng dat a.

MISSING VALUES IN PANDAS


One of t h e most u sef u l f u nc t i ons t o det ec t
mi ssi ng v al u es i s isnull. H er e, w e h av e a
snap sh ot of a DataFrame c al l ed df_missing
(samp l ed p ar t i al l y f r om t h e su p er st or e
Dat aFr ame w e ar e w or k i ng w i t h ) w i t h some
mi ssi ng v al u es:
Figure 4.22: DataFrame with missing values
N ow , i f w e si mp l y r u n t h e f ol l ow i ng c ode, w e
w i l l get a Dat aFr ame t h at 's t h e same si ze as t h e
or i gi nal w i t h b ool ean v al u es as TRU E f or t h e
p l ac es w h er e a NaN w as enc ou nt er ed. Th er ef or e,
i t i s si mp l e t o t est f or t h e p r esenc e of any
NaN/mi ssi ng v al u e f or any r ow or c ol u mn of t h e
Dat aFr ame. You ju st h av e t o add t h e p ar t i c u l ar
r ow and c ol u mn of t h i s b ool ean Dat aFr ame. If t h e
r esu l t i s gr eat er t h an zer o, t h en y ou k now t h er e
ar e some TRU E v al u es (b ec au se FA LSE h er e i s
denot ed b y 0 and TRU E h er e i s denot ed b y 1 ) and
c or r esp ondi ngl y some mi ssi ng v al u es. Tr y t h e
f ol l ow i ng sni p p et :

df_missing=pd.read_excel("Sample -
Superstore.xls",sheet_name="Missing")

df_missing

Th e ou t p u t i s as f ol l ow s:

Figure 4.23: DataFrame with the Excel values

U se t h e isnull f u nc t i on on t h e Dat aFr ame and


ob ser v e t h e r esu l t s:

df_missing.isnull()
Figure 4.24 Output highlighting the missing values

H er e i s an ex amp l e of some v er y si mp l e c ode t o


det ec t , c ou nt , and p r i nt ou t mi ssi ng v al u es i n
ev er y c ol u mn of a Dat aFr ame:

for c in df_missing.columns:

miss = df_missing[c].isnull().sum()

if miss>0:

print("{} has {} missing


value(s)".format(c,miss))

else:

print("{} has NO missing


value!".format(c))

Th i s c ode sc ans ev er y c ol u mn of t h e Dat aFr ame,


c al l s t h e isnull f u nc t i on, and su ms u p t h e
r et u r ned ob jec t (a p andas Ser i es ob jec t , i n t h i s
c ase) t o c ou nt t h e nu mb er of mi ssi ng v al u es. If
t h e mi ssi ng v al u e i s gr eat er t h an zer o, i t p r i nt s
ou t t h e message ac c or di ngl y . Th e ou t p u t l ook s as
f ol l ow s:

Figure 4.25: Output of counting the missing values

EXERCISE 52: FILLING IN THE


MISSING VALUES WITH
FILLNA
To h andl e mi ssi ng v al u es, y ou sh ou l d f i r st l ook
f or w ay s not t o dr op t h em al t oget h er b u t t o f i l l
t h em someh ow . Th e fillna met h od i s a u sef u l
f u nc t i on f or p er f or mi ng t h i s t ask on p andas
Dat aFr ames. Th e fillna met h od may w or k f or
st r i ng dat a, b u t not f or nu mer i c al c ol u mns l i k e
sal es or p r of i t s. So, w e sh ou l d r est r i c t ou r sel v es
i n r egar ds t o t h i s f i x ed st r i ng r ep l ac ement t o
non-nu mer i c t ex t -b ased c ol u mns onl y . Th e Pad or
ffill f u nc t i on i s u sed t o f i l l f or w ar d t h e dat a,
t h at i s, c op y i t f r om t h e p r ec edi ng dat a of t h e
ser i es.

Th e mean f u nc t i on c an b e u sed t o f i l l u si ng t h e
av er age of t h e t w o v al u es:

1 . Fill all m issing v alu es w ith


th e str ing FILL by u sing th e
follow ing com m and:
df_missing.fillna('FIL
L')

Th e ou tpu t is as follow s:
Figure 4.26: Missing values replaced with
FILL

2 . Fill in th e specified colu m ns


w ith th e str ing FILL by
u sing th e follow ing
com m and:

df_missing[['Customer'
,'Product']].fillna('F
ILL')

Th e ou tpu t is as follow s:

Figure 4.27: Specified columns replaced


with FILL

Note

I n all of these cases, the


function w orks on a copy of
the original DataFrame. So, if
you w ant to make the
changes permanent, you have
to assign the DataFrames that
are returned by these
functions to the original
DataFrame object.

3 . Fill in th e v alu es u sing pad


or backfill by u sing th e
follow ing com m and:

df_missing['Sales'].fi
llna(method='ffill')
4 . Use backfill or bfill to fill
backw ar d, th at is, copy fr om
th e next data in th e ser ies:

df_missing['Sales'].fi
llna(method='bfill')

Figure 4.28: Using forward fill and


backward fill to fill in missing data

5. You can also fill by u sing a


fu nction av er age of
DataFr am es. For exam ple,
w e m ay w ant to fill th e
m issing v alu es in Sales by
th e av er age sales am ou nt.
Her e is h ow w e can do th at:

df_missing['Sales'].fi
llna(df_missing.mean()
['Sales'])
Figure 4.29: Using average to fill in missing data

EXERCISE 53: DROPPING


MISSING VALUES WITH
DROPNA
Th i s f u nc t i on i s u sed t o si mp l y dr op t h e r ow s or
c ol u mns t h at c ont ai n N aN /mi ssi ng v al u es.
H ow ev er , t h er e i s some c h oi c e i nv ol v ed.

If t h e ax i s p ar amet er i s set t o zer o, t h en r ow s


c ont ai ni ng mi ssi ng v al u es ar e dr op p ed; i f t h e
ax i s p ar amet er i s set t o one, t h en c ol u mns
c ont ai ni ng mi ssi ng v al u es ar e dr op p ed. Th ese ar e
u sef u l i f w e don't w ant t o dr op a p ar t i c u l ar
r ow /c ol u mn i f t h e N aN v al u es do not ex c eed a
c er t ai n p er c ent age.

Tw o ar gu ment s t h at ar e u sef u l f or t h e dropna()


met h od ar e as f ol l ow s:

Th e how ar gu m ent
deter m ines if a r ow or
colu m n is r em ov ed fr om a
DataFr am e, w h en w e h av e
at least one NaN or all NaNs

Th e thresh ar gu m ent
r equ ir es th at m any non-
NaN v alu es to keep th e
r ow /colu m n

1 . To set th e axis par am eter to


zer o and dr op all m issing
r ow s, u se th e follow ing
com m and:

df_missing.dropna(axis
=0)

2 . To set th e axis par am eter to


one and dr op all m issing
r ow s, u se th e follow ing
com m and:

df_missing.dropna(axis
=1)
Figure 4.30: Dropping rows or columns to
handle missing data

3 . Dr op th e v alu es w ith th e axis


set to one and th r esh set to
1 0:
df_missing.dropna(axis
=1,thresh=10)

Th e ou tpu t is as follow s:

Figure 4.31: DataFrame with values dropped with


axis=1 and thresh=10

A l l of t h ese met h ods w or k on a t emp or ar y c op y .


To mak e a p er manent c h ange, y ou h av e t o set
inplace=True or assi gn t h e r esu l t t o t h e or i gi nal
Dat aFr ame, t h at i s, ov er w r i t e i t .

OUTLIER DETECTION USING A


SIMPLE STATISTICAL TEST
A s w e'v e al r eady di sc u ssed, ou t l i er s i n a dat aset
c an oc c u r du e t o many f ac t or s and i n many w ay s:

Data entr y er r or s

Exper im ental er r or s (data


extr action r elated)

Measu r em ent er r or s du e to
noise or instr u m ental failu r e

Data pr ocessing er r or s (data


m anipu lation or m u tations
du e to coding er r or )

Sam pling er r or s (extr acting


or m ixing data fr om w r ong
or v ar iou s sou r ces)
It i s i mp ossi b l e t o p i n-p oi nt one u ni v er sal
met h od f or ou t l i er det ec t i on. H er e, w e w i l l sh ow
y ou some si mp l e t r i c k s f or nu mer i c dat a u si ng
st andar d st at i st i c al t est s.

Box p l ot s may sh ow u nu su al v al u es. Cor r u p t t w o


sal es v al u es b y assi gni ng negat i v e, as f ol l ow s:

df_sample = df[['Customer
Name','State','Sales','Profit']].sample(n=
50).copy()

df_sample['Sales'].iloc[5]=-1000.0

df_sample['Sales'].iloc[15]=-500.0

To p l ot t h e b ox p l ot , u se t h e f ol l ow i ng c ode:

df_sample.plot.box()

plt.title("Boxplot of sales and profit",


fontsize=15)

plt.xticks(fontsize=15)

plt.yticks(fontsize=15)

plt.grid(True)

Th e ou t p u t i s as f ol l ow s:

Figure 4.32: Boxplot of sales and profit

W e c an c r eat e si mp l e b ox p l ot s t o c h ec k f or any
u nu su al /nonsensi c al v al u es. For ex amp l e, i n t h e
p r ec edi ng ex amp l e, w e i nt ent i onal l y c or r u p t ed
t w o sal es v al u es t o b e negat i v e and t h ey w er e
r eadi l y c au gh t i n a b ox p l ot .

N ot e t h at p r of i t may b e negat i v e, so t h ose


negat i v e p oi nt s ar e gener al l y not su sp i c i ou s. Bu t
sal es c annot b e negat i v e i n gener al , so t h ey ar e
det ec t ed as ou t l i er s.

W e c an c r eat e a di st r i b u t i on of a nu mer i c al
qu ant i t y and c h ec k f or v al u es t h at l i e at t h e
ex t r eme end t o see i f t h ey ar e t r u l y p ar t of t h e
dat a or ou t l i er . For ex amp l e, i f a di st r i b u t i on i s
al most nor mal , t h en any v al u e mor e t h an 4 or 5
st andar d dev i at i ons aw ay may b e a su sp ec t :

Figure 4.33: Value away from the main outliers

Concatenating, Merging,
and Joining
Mer gi ng and joi ni ng t ab l es or dat aset s ar e h i gh l y
c ommon op er at i ons i n t h e day -t o-day job of a dat a
w r angl i ng p r of essi onal . Th ese op er at i ons ar e
ak i n t o t h e JOIN qu er y i n SQL f or r el at i onal
dat ab ase t ab l es. Of t en, t h e k ey dat a i s p r esent i n
mu l t i p l e t ab l es, and t h ose r ec or ds need t o b e
b r ou gh t i nt o one c omb i ned t ab l e t h at 's mat c h i ng
on t h at c ommon k ey . Th i s i s an ex t r emel y
c ommon op er at i on i n any t y p e of sal es or
t r ansac t i onal dat a, and t h er ef or e mu st b e
mast er ed b y a dat a w r angl er . Th e p andas l i b r ar y
of f er s ni c e and i nt u i t i v e b u i l t -i n met h ods t o
p er f or m v ar i ou s t y p es of JOIN qu er i es
i nv ol v i ng mu l t i p l e Dat aFr ame ob jec t s.
EXERCISE 54:
CONCATENATION
W e w i l l st ar t b y l ear ni ng t h e c onc at enat i on of
Dat aFr ames al ong v ar i ou s ax es (r ow s or
c ol u mns). Th i s i s a v er y u sef u l op er at i on as i t
al l ow s y ou t o gr ow a Dat aFr ame as t h e new dat a
c omes i n or new f eat u r e c ol u mns need t o b e
i nser t ed i n t h e t ab l e:

1 . Sam ple 4 r ecor ds each to


cr eate th r ee DataFr am es at
r andom fr om th e or iginal
sales dataset w e ar e w or king
w ith :

df_1 = df[['Customer
Name','State','Sales',
'Profit']].sample(n=4)

df_2 = df[['Customer
Name','State','Sales',
'Profit']].sample(n=4)

df_3 = df[['Customer
Name','State','Sales',
'Profit']].sample(n=4)

2 . Cr eate a com bined


DataFr am e w ith all th e r ow s
concatenated by u sing th e
follow ing code:

df_cat1 =
pd.concat([df_1,df_2,d
f_3], axis=0)

df_cat1
Figure 4.34: Concatenating DataFrames
together

3 . You can also tr y


concatenating along th e
colu m ns, alth ou gh th at does
not m ake any pr actical sense
for th is par ticu lar exam ple.
How ev er , pandas fills in th e
u nav ailable v alu es w ith
NaN for th at oper ation:

df_cat2 =
pd.concat([df_1,df_2,d
f_3], axis=1)

df_cat2

Figure 4.35: Output a er concatenating the


DataFrames

EXERCISE 55: MERGING BY A


COMMON KEY
Mer gi ng b y a c ommon k ey i s an ex t r emel y
c ommon op er at i on f or dat a t ab l es as i t al l ow s
y ou t o r at i onal i ze mu l t i p l e sou r c es of dat a i n one
mast er dat ab ase – t h at i s, i f t h ey h av e some
c ommon f eat u r es/k ey s.

Th i s i s of t en t h e f i r st st ep i n b u i l di ng a l ar ge
dat ab ase f or mac h i ne l ear ni ng t ask s w h er e dai l y
i nc omi ng dat a may b e p u t i nt o sep ar at e t ab l es.
H ow ev er , at t h e end of t h e day , t h e most r ec ent
t ab l e needs t o b e mer ged w i t h t h e mast er dat a
t ab l e t o b e f ed i nt o t h e b ac k end mac h i ne
l ear ni ng ser v er , w h i c h w i l l t h en u p dat e t h e
model and i t s p r edi c t i on c ap ac i t y .
H er e, w e w i l l sh ow a si mp l e ex amp l e of an i nner
joi n w i t h Cu st omer N ame as t h e k ey :

1 . One DataFr am e, df_1, h ad


sh ipping infor m ation
associated w ith th e cu stom er
nam e, and anoth er table,
df_2, h ad th e pr odu ct
infor m ation tabu lated. Ou r
goal is to m er ge th ese tables
into one DataFr am e on th e
com m on cu stom er nam e:

df_1=df[['Ship
Date','Ship
Mode','Customer
Name']][0:4]

df_1

Th e ou tpu t is as follow s:

Figure 4.36: Entries in table df_1

Th e second DataFr am e is as
follow s:

df_2=df[['Customer
Name','Product
Name','Quantity']]
[0:4]

df_2

Th e ou tpu t is as follow s:

Figure 4.37: Entries in table df_2


2 . Join th ese tw o tables by
inner join by u sing th e
follow ing com m and:

pd.merge(df_1,df_2,on=
'Customer
Name',how='inner')

Th e ou tpu t is as follow s:

Figure 4.38: Inner join on table df_1 and


table df_2

3 . Dr op th e du plicates by u sing
th e follow ing com m and.

pd.merge(df_1,df_2,on=
'Customer
Name',how='inner').dro
p_duplicates()

Th e ou tpu t is as follow s:

Figure 4.39: Inner join on table df_1 and


table df_2 a er dropping the duplicates

4 . Extr act anoth er sm all table


called df_3 to sh ow th e
concept of an ou ter join:

df_3=df[['Customer
Name','Product
Name','Quantity']]
[2:6]

df_3

Th e ou tpu t is as follow s:
Figure 4.40: Creating table df_3

5. Per for m an inner join on


df_1 and df_3 by u sing th e
follow ing com m and:

pd.merge(df_1,df_3,on=
'Customer
Name',how='inner').dro
p_duplicates()

Th e ou tpu t is as follow s:

Figure 4.41: Merging table df_1 and table


df_3 and dropping duplicates

6 . Per for m an ou ter join on


df_1 and df_3 by u sing th e
follow ing com m and:

pd.merge(df_1,df_3,on=
'Customer
Name',how='outer').dro
p_duplicates()

Th e ou tpu t is as follow s:
Figure 4.42: Outer join on table df_1 and table df_2 and
dropping the duplicates

N ot i c e h ow some NaN and NaT v al u es ar e i nser t ed


au t omat i c al l y b ec au se no c or r esp ondi ng ent r i es
c ou l d b e f ou nd f or t h ose r ec or ds, as t h ose ar e t h e
ent r i es w i t h u ni qu e c u st omer names f r om t h ei r
r esp ec t i v e t ab l es. NaT r ep r esent s a N ot a Ti me
ob jec t , as t h e ob jec t s i n t h e Sh i p Dat e c ol u mn ar e
of t h e nat u r e of Ti mest amp ob jec t s.

EXERCISE 56: THE JOIN


METHOD
Joi ni ng i s p er f or med b ased on inde x ke y s and i s
done b y c omb i ni ng t h e c ol u mns of t w o
p ot ent i al l y di f f er ent l y i ndex ed Dat aFr ames i nt o
a si ngl e one. It of f er s a f ast er w ay t o ac c omp l i sh
mer gi ng b y r ow i ndi c es. Th i s i s u sef u l i f t h e
r ec or ds i n di f f er ent t ab l es ar e i ndex ed
di f f er ent l y b u t r ep r esent t h e same i nh er ent dat a
and y ou w ant t o mer ge t h em i nt o a si ngl e t ab l e:

1 . Cr eate th e follow ing tables


w ith cu stom er nam e as th e
index by u sing th e follow ing
com m and:

df_1=df[['Customer
Name','Ship
Date','Ship Mode']]
[0:4]

df_1.set_index(['Custo
mer
Name'],inplace=True)

df_1

df_2=df[['Customer
Name','Product
Name','Quantity']]
[2:6]
df_2.set_index(['Custo
mer
Name'],inplace=True)

df_2

Th e ou tpu ts is as follow s:

Figure 4.43: DataFrames df_1 and df_2

2 . Per for m a left join on df_1


and df_2 by u sing th e
follow ing com m and:

df_1.join(df_2,how='le
ft').drop_duplicates()

Th e ou tpu t is as follow s:

Figure 4.44: Le join on table df_1 and


table df_2 a er dropping the duplicates

3 . Per for m a r igh t join on df_1


and df_2 by u sing th e
follow ing com m and:

df_1.join(df_2,how='ri
ght').drop_duplicates(
)

Th e ou tpu t is as follow s:
Figure 4.45: Right join on table df_1 and
table df_2 a er dropping the duplicates

4 . Per for m an inner join on


df_1 and df_2 by u sing th e
follow ing com m and:

df_1.join(df_2,how='in
ner').drop_duplicates(
)

Th e ou tpu t is as follow s:

Figure 4.46: Inner join on table df_1 and


table df_2 a er dropping the duplicates

5. Per for m an ou ter join on


df_1 and df_2 by u sing th e
follow ing com m and:

df_1.join(df_2,how='ou
ter').drop_duplicates(
)

Th e ou tpu t is as follow s:
Figure 4.47: Outer join on table df_1 and table df_2
a er dropping the duplicates

Useful Methods of
Pandas
In t h i s t op i c , w e w i l l di sc u ss some smal l u t i l i t y
f u nc t i ons t h at ar e of f er ed b y p andas so t h at w e
c an w or k ef f i c i ent l y w i t h Dat aFr ames. Th ey
don't f al l u nder any p ar t i c u l ar gr ou p of
f u nc t i on, so t h ey ar e ment i oned h er e u nder t h e
Mi sc el l aneou s c at egor y .

EXERCISE 57: RANDOMIZED


SAMPLING
Samp l i ng a r andom f r ac t i on of a b i g Dat aFr ame
i s of t en v er y u sef u l so t h at w e c an p r ac t i c e ot h er
met h ods on t h em and t est ou r i deas. If y ou h av e a
dat ab ase t ab l e of 1 mi l l i on r ec or ds, t h en i t i s not
c omp u t at i onal l y ef f ec t i v e t o r u n y ou r t est
sc r i p t s on t h e f u l l t ab l e.

H ow ev er , y ou may al so not w ant t o ex t r ac t onl y


t h e f i r st 1 00 el ement s as t h e dat a may h av e b een
sor t ed b y a p ar t i c u l ar k ey and y ou may get an
u ni nt er est i ng t ab l e b ac k , w h i c h may not
r ep r esent t h e f u l l st at i st i c al di v er si t y of t h e
p ar ent dat ab ase.

In t h ese si t u at i ons, t h e sample met h od c omes i n


su p er h andy so t h at w e c an r andoml y c h oose a
c ont r ol l ed f r ac t i on of t h e Dat aFr ame:

1 . Specify th e nu m ber of
sam ples th at y ou r equ ir e
fr om th e DataFr am e by
u sing th e follow ing
com m and:

df.sample(n=5)
Th e ou tpu t is as follow s:

Figure 4.48: DataFrame with 5 samples

2 . Specify a definite fr action


(per centage) of data to be
sam pled by u sing th e
follow ing com m and:

df.sample(frac=0.1)

Th e ou tpu t is as follow s:
Figure 4.49: DataFrame with 0.1% data
sampled

You can also ch oose if


sam pling is done w ith
r eplacem ent, th at is,
w h eth er th e sam e r ecor d can
be ch osen m or e th an once.
Th e defau lt r eplace ch oice is
FA LSE, th at is, no r epetition,
and sam pling w ill tr y to
ch oose new elem ents only .

3 . Ch oose th e sam pling by


u sing th e follow ing
com m and:

df.sample(frac=0.1,
replace=True)

Th e ou tpu t is as follow s:
Figure 4.50: DataFrame with 0.1% data sampled and
repetition enabled

THE VALUE_COUNTS METHOD


W e di sc u ssed t h e unique met h od b ef or e, w h i c h
f i nds and c ou nt s t h e u ni qu e r ec or ds f r om a
Dat aFr ame. A not h er u sef u l f u nc t i on i n a si mi l ar
v ei n i s value_counts. Th i s f u nc t i on r et u r ns an
ob jec t c ont ai ni ng c ou nt s of u ni qu e v al u es. In t h e
ob jec t t h at i s r et u r ned, t h e f i r st el ement i s t h e
most f r equ ent l y u sed ob jec t . Th e el ement s ar e
ar r anged i n desc endi ng or der .

Let 's c onsi der a p r ac t i c al ap p l i c at i on of t h i s


met h od t o i l l u st r at e t h e u t i l i t y . Su p p ose y ou r
manager ask s y ou t o l i st t h e t op 1 0 c u st omer s
f r om t h e b i g sal es dat ab ase t h at y ou h av e. So, t h e
b u si ness qu est i on i s: w h i c h 1 0 c u st omer s' names
oc c u r t h e most f r equ ent l y i n t h e sal es t ab l e? You
c an ac h i ev e t h e same w i t h an SQL qu er y i f t h e
dat a i s i n a RDBMS, b u t i n p andas, t h i s c an b e
done b y u si ng one si mp l e f u nc t i on:

df['Customer Name'].value_counts()[:10]

Th e ou t p u t i s as f ol l ow s:

Figure 4.51: List of top 10 customers

Th e value_counts met h od r et u r ns a ser i es of t h e


c ou nt s of al l u ni qu e c u st omer names sor t ed b y
t h e f r equ enc y of t h e c ou nt . By ask i ng f or onl y
t h e f i r st 1 0 el ement s of t h at l i st , t h i s c ode
r et u r ns a ser i es of t h e most f r equ ent l y oc c u r r i ng
t op 1 0 c u st omer names.

PIVOT TABLE FUNCTIONALITY


Si mi l ar t o gr ou p b y , p andas al so of f er p i v ot t ab l e
f u nc t i onal i t y , w h i c h w or k s t h e same as a p i v ot
t ab l e i n sp r eadsh eet p r ogr ams l i k e MS Ex c el . For
ex amp l e, i n t h i s sal es dat ab ase, y ou w ant t o k now
t h e av er age sal es, p r of i t , and qu ant i t y sol d, b y
Regi on and St at e (t w o l ev el s of i ndex ).

W e c an ex t r ac t t h i s i nf or mat i on b y u si ng one
si mp l e p i ec e of c ode (w e samp l e 1 00 r ec or ds f i r st
f or k eep i ng t h e c omp u t at i on f ast and t h en ap p l y
t h e c ode):

df_sample = df.sample(n=100)

df_sample.pivot_table(values=
['Sales','Quantity','Profit'],index=
['Region','State'],aggfunc='mean')
Th e ou t p u t i s as f ol l ow s (not e t h at y ou r sp ec i f i c
ou t p u t may b e di f f er ent du e t o r andom
samp l i ng):

Figure 4.52: Sample of 100 records

EXERCISE 58: SORTING BY


COLUMN VALUES – THE
SORT_VALUES METHOD
Sor t i ng a t ab l e b y a p ar t i c u l ar c ol u mn i s one of
t h e most f r equ ent l y u sed op er at i ons i n t h e dai l y
w or k of an anal y st . N ot su r p r i si ngl y , p andas
p r ov i de a si mp l e and i nt u i t i v e met h od f or
sor t i ng c al l ed t h e sort_values met h od:

1 . Take a r andom sam ple of 1 5


r ecor ds and th en sh ow h ow
w e can sor t by th e Sales
colu m n and th en by both th e
Sales and State colu m ns
togeth er :
df_sample=df[['Custome
r
Name','State','Sales',
'Quantity']].sample(n=
15)

df_sample

Th e ou tpu t is as follow s:

Figure 4.53: Sample of 15 records

2 . Sor t th e v alu es w ith r espect


to Sales by u sing th e
follow ing com m and:

df_sample.sort_values(
by='Sales')

Th e ou tpu t is as follow s:
Figure 4.54: DataFrame with the Sales
value sorted

3 . Sor t th e v alu es w ith r espect


to Sales and State:

df_sample.sort_values(
by=['State','Sales'])

Th e ou tpu t is as follow s:
Figure 4.55: DataFrame sorted with respect to Sales
and State

EXERCISE 59: FLEXIBILITY FOR


USER-DEFINED FUNCTIONS
WITH THE APPLY METHOD
Th e p andas l i b r ar y p r ov i des gr eat f l ex i b i l i t y t o
w or k w i t h u ser -def i ned f u nc t i ons of ar b i t r ar y
c omp l ex i t y t h r ou gh t h e apply met h od. Mu c h l i k e
t h e nat i v e Py t h on apply f u nc t i on, t h i s met h od
ac c ep t s a u ser -def i ned f u nc t i on and addi t i onal
ar gu ment s and r et u r ns a new c ol u mn af t er
ap p l y i ng t h e f u nc t i on on a p ar t i c u l ar c ol u mn
el ement -w i se.

A s an ex amp l e, su p p ose w e w ant t o c r eat e a


c ol u mn of c at egor i c al f eat u r es l i k e
h i gh /medi u m/l ow b ased on t h e sal es p r i c e
c ol u mn. N ot e t h at i t i s a c onv er si on f r om a
nu mer i c v al u e t o a c at egor i c al f ac t or (st r i ng)
b ased on c er t ai n c ondi t i ons (t h r esh ol d v al u es of
sal es):

1 . Cr eate a u ser -defined


fu nction, as follow s:
def
categorize_sales(price
):

if price < 50:

return "Low"

elif price < 200:

return "Medium"

else:

return "High"

2 . Sam ple 1 00 r ecor ds


r andom ly fr om th e database:

df_sample=df[['Custome
r
Name','State','Sales']
].sample(n=100)

df_sample.head(10)

Th e ou tpu t is as follow s:

Figure 4.56: 100 sample records from the


database

3 . Use th e apply m eth od to


apply th e categor ization
fu nction onto th e Sales
colu m n:

Note
We need to create a new
column to store the category
string values that are returned
by the function.

df_sample['Sales Price
Category']=df_sample['
Sales'].apply(categori
ze_sales)

df_sample.head(10)

Th e ou tpu t is as follow s:

Figure 4.57: DataFrame with 10 rows


a er using the apply function on the Sales
column

4 . Th e apply m eth od also w or ks


w ith th e bu ilt-in nativ e
Py th on fu nctions. For
pr actice, let's cr eate anoth er
colu m n for stor ing th e
length of th e nam e of th e
cu stom er . We can do th at
u sing th e fam iliar len
fu nction:

df_sample['Customer
Name
Length']=df_sample['Cu
stomer
Name'].apply(len)

df_sample.head(10)
Th e ou tpu t is as follow s:

Figure 4.58: DataFrame with a new


column

5. Instead of w r iting ou t a
separ ate fu nction, w e can
ev en inser t lam bda
expr essions dir ectly into th e
apply m eth od for sh or t
fu nctions. For exam ple, let's
say w e ar e pr om oting ou r
pr odu ct and w ant to sh ow
th e discou nted sales pr ice if
th e or iginal pr ice is > $200.
We can do th is u sing a
lambda fu nction and th e
apply m eth od:

df_sample['Discounted
Price']=df_sample['Sal
es'].apply(lambda
x:0.85*x if x>200 else
x)

df_sample.head(10)

Th e ou tpu t is as follow s:
Figure 4.59: Lambda function

Note
Th e la m bd a fu nc tion c onta ins a c ond itiona l, a nd
a d is c ou nt is a p p lie d to th os e re c ord s w h e re th e
orig ina l s a le s p ric e is > $200.

ACTIVITY 6: WORKING WITH


THE ADULT INCOME DATASET
(UCI)
In t h i s ac t i v i t y , y ou w i l l w or k w i t h t h e A du l t
Inc ome Dat aset f r om t h e U CI mac h i ne l ear ni ng
p or t al . Th e A du l t Inc ome dat aset h as b een u sed i n
many mac h i ne l ear ni ng p ap er s t h at addr ess
c l assi f i c at i on p r ob l ems. You w i l l r ead t h e dat a
f r om a CSV f i l e i nt o a p andas Dat aFr ame and do
some p r ac t i c e on t h e adv anc ed dat a w r angl i ng
y ou l ear ned ab ou t i n t h i s c h ap t er .

Th e ai m of t h i s ac t i v i t y i s t o p r ac t i c e v ar i ou s
adv anc ed p andas Dat aFr ame op er at i ons, f or
ex amp l e, f or su b set t i ng, ap p l y i ng u ser -def i ned
f u nc t i ons, su mmar y st at i st i c s, v i su al i zat i ons,
b ool ean i ndex i ng, gr ou p b y , and ou t l i er
det ec t i on on a r eal -l i f e dat aset . W e h av e t h e dat a
dow nl oaded as a CSV f i l e on t h e di sk f or y ou r
ease. H ow ev er , i t i s r ec ommended t o p r ac t i c e
dat a dow nl oadi ng on y ou r ow n so t h at y ou ar e
f ami l i ar w i t h t h e p r oc ess.

H er e i s t h e U RL f or t h e dat aset :
h t t p s://ar c h i v e.i c s.u c i .edu /ml /mac h i ne-
l ear ni ng-dat ab ases/adu l t /.

H er e i s t h e U RL f or t h e desc r i p t i on of t h e dat aset


and t h e v ar i ab l es:
h t t p s://ar c h i v e.i c s.u c i .edu /ml /mac h i ne-
l ear ni ng-dat ab ases/adu l t /adu l t .names.

Th ese ar e t h e st ep s t h at w i l l h el p y ou sol v e t h i s
ac t i v i t y :

1 . Load th e necessar y libr ar ies.

2 . Read th e adu lt incom e


dataset fr om th e follow ing
URL:
h ttps://gith u b.com /Tr ainin
gBy Packt/Data-Wr angling-
w ith -
Py th on/blob/m aster /Ch apte
r 04 /A ctiv ity 06 /.

3 . Cr eate a scr ipt th at w ill r ead


a text file line by line.

4 . A dd a nam e of Income for th e


r esponse v ar iable to th e
dataset.

5. Find th e m issing v alu es.

6 . Cr eate a DataFr am e w ith


only age, edu cation, and
occu pation by u sing
su bsetting.

7 . Plot a h istogr am of age w ith


a bin size of 2 0.

8. Cr eate a fu nction to str ip th e


w h itespace ch ar acter s.

9 . Use th e apply m eth od to


apply th is fu nction to all th e
colu m ns w ith str ing v alu es,
cr eate a new colu m n, copy
th e v alu es fr om th is new
colu m n to th e old colu m n,
and dr op th e new colu m n.

1 0. Find th e nu m ber of people


w h o ar e aged betw een 3 0
and 50.

1 1 . Gr ou p th e r ecor ds based on
age and edu cation to find
h ow th e m ean age is
distr ibu ted.
1 2 . Gr ou p by occu pation and
sh ow th e su m m ar y statistics
of age. Find w h ich pr ofession
h as th e oldest w or ker s on
av er age and w h ich
pr ofession h as its lar gest
sh ar e of th e w or kfor ce abov e
th e 7 5th per centile.

1 3 . Use su bset and gr ou pby to


find ou tlier s.

1 4 . Plot th e v alu es on a bar


ch ar t.

1 5. Mer ge th e data u sing


com m on key s.

Note

The solution for this activity


can be found on page 297 .

Summary
In t h i s c h ap t er , w e di v ed deep i nt o t h e p andas
l i b r ar y t o l ear n adv anc ed dat a w r angl i ng
t ec h ni qu es. W e st ar t ed w i t h some adv anc ed
su b set t i ng and f i l t er i ng on Dat aFr ames and
r ou nd t h i s u p b y l ear ni ng ab ou t b ool ean
i ndex i ng and c ondi t i onal sel ec t i on of a su b set of
dat a. W e al so c ov er ed h ow t o set and r eset t h e
i ndex of a Dat aFr ame, esp ec i al l y w h i l e
i ni t i al i zi ng.

N ex t , w e l ear ned ab ou t a p ar t i c u l ar t op i c t h at
h as a deep c onnec t i on w i t h t r adi t i onal
r el at i onal dat ab ase sy st ems – t h e gr ou p b y
met h od. Th en, w e di v ed deep i nt o an i mp or t ant
sk i l l f or dat a w r angl i ng - c h ec k i ng f or and
h andl i ng mi ssi ng dat a. W e sh ow ed y ou h ow
p andas h el p i n h andl i ng mi ssi ng dat a u si ng
v ar i ou s i mp u t at i on t ec h ni qu es. W e al so
di sc u ssed met h ods f or dr op p i ng mi ssi ng v al u es.
Fu r t h er mor e, met h ods and u sage ex amp l es of
c onc at enat i on and mer gi ng of Dat aFr ame ob jec t s
w er e sh ow n. W e saw t h e joi n met h od and h ow i t
c omp ar es t o a si mi l ar op er at i on i n SQL.

Last l y , mi sc el l aneou s u sef u l met h ods on


Dat aFr ames, su c h as r andomi zed samp l i ng,
unique, value_count, sort_values, and p i v ot t ab l e
f u nc t i onal i t y w er e c ov er ed. W e al so sh ow ed an
ex amp l e of r u nni ng an ar b i t r ar y u ser -def i ned
f u nc t i on on a Dat aFr ame u si ng t h e apply met h od.

A f t er l ear ni ng ab ou t t h e b asi c and adv anc ed dat a


w r angl i ng t ec h ni qu es w i t h N u mPy and p andas
l i b r ar i es, t h e nat u r al qu est i on of dat a ac qu i r i ng
r i ses. In t h e nex t c h ap t er , w e w i l l sh ow y ou h ow
t o w or k w i t h a w i de v ar i et y of dat a sou r c es, t h at
i s, y ou w i l l l ear n h ow t o r ead dat a i n t ab u l ar
f or mat i n p andas f r om di f f er ent sou r c es.
Chapter 5
Getting Comfortable
with Different Kinds of
Data Sources
Learning Objectives
By t h e end of t h i s c h ap t er , y ou w i l l b e ab l e t o:

Read CSV , Excel, and JSON


files into pandas DataFr am es

Read PDF docu m ents and


HTML tables into pandas
DataFr am es

Per for m basic w eb scr aping


u sing pow er fu l y et easy to
u se libr ar ies su ch as
Beau tifu l Sou p

Extr act str u ctu r ed and


textu al infor m ation fr om
por tals

In t h i s c h ap t er , y ou w i l l b e ex p osed t o r eal -l i f e
dat a w r angl i ng t ec h ni qu es, as ap p l i ed t o w eb
sc r ap i ng.

Introduction
So f ar i n t h i s b ook , w e h av e f oc u sed on l ear ni ng
p andas Dat aFr ame ob jec t s as t h e mai n dat a
st r u c t u r e f or t h e ap p l i c at i on of w r angl i ng
t ec h ni qu es. N ow , w e w i l l l ear n ab ou t v ar i ou s
t ec h ni qu es b y w h i c h w e c an r ead dat a i nt o a
Dat aFr ame f r om ex t er nal sou r c es. Some of t h ose
sou r c es c ou l d b e t ex t -b ased (CSV , H TML, JSON ,
and so on), w h er eas some ot h er s c ou l d b e b i nar y
(Ex c el , PDF, and so on), t h at i s, not i n A SCII
f or mat . In t h i s c h ap t er , w e w i l l l ear n h ow t o deal
w i t h dat a t h at i s p r esent i n w eb p ages or H TML
doc u ment s. Th i s h ol ds v er y h i gh i mp or t anc e i n
t h e w or k of a dat a p r ac t i t i oner .

Note
S inc e w e h a v e g one th rou g h a d e ta ile d e xa m p le
of ba s ic op e ra tions w ith Nu m Py a nd p a nd a s , in
th is c h a p te r, w e w ill ofte n s k ip triv ia l c od e
s nip p e ts s u c h a s v ie w ing a ta ble , s e le c ting a
c olu m n, a nd p lotting . I ns te a d , w e w ill foc u s on
s h ow ing c od e e xa m p le s for th e ne w top ic s w e
a im to le a rn a bou t h e re .

Reading Data from


Different Text-Based
(and Non-Text-Based)
Sources
One of t h e most v al u ed and w i del y u sed sk i l l s of a
dat a w r angl i ng p r of essi onal i s t h e ab i l i t y t o
ex t r ac t and r ead dat a f r om a di v er se ar r ay of
sou r c es i nt o a st r u c t u r ed f or mat . Moder n
anal y t i c s p i p el i nes dep end on t h ei r ab i l i t y t o
sc an and ab sor b a v ar i et y of dat a sou r c es t o b u i l d
and anal y ze a p at t er n-r i c h model . Su c h a f eat u r e-
r i c h , mu l t i -di mensi onal model w i l l h av e h i gh
p r edi c t i v e and gener al i zat i on ac c u r ac y . It w i l l
b e v al u ed b y st ak eh ol der s and end u ser s al i k e f or
any dat a-dr i v en p r odu c t .

In t h e f i r st t op i c of t h i s c h ap t er , w e w i l l go
t h r ou gh v ar i ou s dat a sou r c es and h ow t h ey c an
b e i mp or t ed i nt o p andas Dat aFr ames, t h u s
i mb u i ng w r angl i ng p r of essi onal s w i t h
ex t r emel y v al u ab l e dat a i ngest i on k now l edge.

DATA FILES PROVIDED WITH


THIS CHAPTER
Bec au se t h i s t op i c i s ab ou t r eadi ng f r om v ar i ou s
dat a sou r c es, w e w i l l u se smal l f i l es of v ar i ou s
t y p es i n t h e f ol l ow i ng ex er c i ses. A l l of t h e dat a
f i l es ar e p r ov i ded al ong w i t h t h e Ju p y t er
not eb ook i n t h e c ode r ep osi t or y .

LIBRARIES TO INSTALL FOR


THIS CHAPTER
Bec au se t h i s c h ap t er deal s w i t h r eadi ng v ar i ou s
f i l e f or mat s, w e need t o h av e t h e su p p or t of
addi t i onal l i b r ar i es and sof t w ar e p l at f or ms t o
ac c omp l i sh ou r goal s.

Ex ec u t e t h e f ol l ow i ng c odes i n y ou r Ju p y t er
not eb ook c el l s (don't f or get t h e ! b ef or e eac h l i ne
of c ode) t o i nst al l t h e nec essar y l i b r ar i es:

!apt-get update !apt-get install -y


default-jdk

!pip install tabula-py xlrd lxml

EXERCISE 60: READING DATA


FROM A CSV FILE WHERE
HEADERS ARE MISSING
Th e p andas l i b r ar y p r ov i des a si mp l e di r ec t
met h od c al l ed read_csv t o r ead dat a i n a t ab u l ar
f or mat f r om a c omma-sep ar at ed t ex t f i l e, or CSV .
Th i s i s p ar t i c u l ar l y u sef u l b ec au se a CSV i s a
l i gh t w ei gh t y et ex t r emel y h andy dat a ex c h ange
f or mat f or many ap p l i c at i ons, i nc l u di ng su c h
domai ns as mac h i ne-gener at ed dat a. It i s not a
p r op r i et ar y f or mat and t h er ef or e i s u ni v er sal l y
u sed b y a v ar i et y of dat a-gener at i ng sou r c es.

A t t i mes, h eader s may b e mi ssi ng f r om a CSV f i l e


and y ou may h av e t o add p r op er h eader s/c ol u mn
names of y ou r ow n. Let 's h av e a l ook at h ow t h i s
c an b e done:

1 . Read th e exam ple CSV file


(w ith a pr oper h eader ) u sing
th e follow ing code and
exam ine th e r esu lting
DataFr am e, as follow s:

import numpy as np

import pandas as pd

df1 =
pd.read_csv("CSV_EX_1.
csv")

df1

Th e ou tpu t is as follow s:

Figure 5.1: Output of example CSV file

2 . Read a .csv file w ith no


h eader u sing a pandas
DataFr am e:

df2 =
pd.read_csv("CSV_EX_2.
csv")

df2

Th e ou tpu t is as follow s:
Figure 5.2: Output of the .csv being read
using a DataFrame

Cer tainly , th e top data r ow


h as been m istakenly r ead as
th e colu m n h eader . You can
specify header=None to av oid
th is.

3 . Read th e .csv file by


m entioning th e h eader None,
as follow s:

df2 =
pd.read_csv("CSV_EX_2.
csv",header=None)

df2

How ev er , w ith ou t any


h eader infor m ation, y ou w ill
get back th e follow ing
ou tpu t. Th e defau lt h eader s
w ill be ju st som e defau lt
nu m er ic indices star ting
fr om 0:

Figure 5.3: CSV file with a numeric column


header

Th is m ay be fine for data


analy sis pu r poses, bu t if y ou
w ant th e DataFr am e to tr u ly
r eflect th e pr oper h eader s,
th en y ou w ill h av e to add
th em u sing th e names
ar gu m ent.

4 . A dd th e names ar gu m ent to
get th e cor r ect h eader s:

df2 =
pd.read_csv("CSV_EX_2.
csv",header=None,
names=
['Bedroom','Sq.ft','Lo
cality','Price($)'])

df2

Finally , y ou w ill get a


DataFr am e th at's as follow s:

Figure 5.4: CSV file with correct column header

EXERCISE 61: READING FROM


A CSV FILE WHERE
DELIMITERS ARE NOT
COMMAS
A l t h ou gh CSV st ands f or c omma-sep ar at ed-
v al u es, i t i s f ai r l y c ommon t o enc ou nt er r aw dat a
f i l es w h er e t h e sep ar at or /del i mi t er i s a
c h ar ac t er ot h er t h an a c omma:

1 . Read a .csv file u sing pandas


DataFr am es:

df3 =
pd.read_csv("CSV_EX_3.
csv")

df3
2 . Th e ou tpu t w ill be as follow s:

Figure 5.5: A DataFrame that has a semi-


colon as a separator

3 . Clear ly , th e ; separ ator w as


not expected, and th e
r eading is flaw ed. A sim ple
w or k ar ou nd is to specify th e
separ ator /delim iter
explicitly in th e r ead
fu nction:

df3 =
pd.read_csv("CSV_EX_3.
csv",sep=';')

df3

Th e ou tpu t is as follow s:

Figure 5.6: Semicolons removed from the DataFrame

EXERCISE 62: BYPASSING THE


HEADERS OF A CSV FILE
If y ou r CSV f i l e al r eady c omes w i t h h eader s b u t
y ou w ant t o b y p ass t h em and p u t i n y ou r ow n,
y ou h av e t o sp ec i f i c al l y set header = 0 t o mak e i t
h ap p en. If y ou t r y t o set t h e names v ar i ab l e t o
y ou r h eader l i st , u nex p ec t ed t h i ngs c an h ap p en:
1 . A dd nam es to a .csv file th at
h as h eader s, as follow s:

df4 =
pd.read_csv("CSV_EX_1.
csv",names=
['A','B','C','D'])

df4

Th e ou tpu t is as follow s:

Figure 5.7: CSV file with headers


overlapped

2 . To av oid th is, set header to


zer o and pr ov ide a nam es
list:

df4 =
pd.read_csv("CSV_EX_1.
csv",header=0,names=
['A','B','C','D'])

df4

Th e ou tpu t is as follow s:

Figure 5.8: CSV file with defined headers


EXERCISE 63: SKIPPING
INITIAL ROWS AND FOOTERS
WHEN READING A CSV FILE
Sk i p p i ng i ni t i al r ow s i s a w i del y u sef u l met h od
b ec au se, most of t h e t i me, t h e f i r st f ew r ow s of a
CSV dat a f i l e ar e met adat a ab ou t t h e dat a sou r c e
or si mi l ar i nf or mat i on, w h i c h i s not r ead i nt o
t h e t ab l e:

Figure 5.9: Contents of the CSV file

Note
Th e firs t tw o line s in th e CS V file a re irre le v a nt
d a ta .

1 . Read th e CSV file and


exam ine th e r esu lts:

df5 =
pd.read_csv("CSV_EX_sk
iprows.csv")

df5

Th e ou tpu t is as follow s:
Figure 5.10: DataFrame with an
unexpected error

2 . Skip th e fir st tw o r ow s and


r ead th e file:

df5 =
pd.read_csv("CSV_EX_sk
iprows.csv",skiprows=2
)

df5

Th e ou tpu t is as follow s:

Figure 5.11: Expected DataFrame a er


skipping two rows

3 . Sim ilar to skipping th e


initial r ow s, it m ay be
necessar y to skip th e footer of
a file. For exam ple, w e do not
w ant to r ead th e data at th e
end of th e follow ing file:
Figure 5.12: Contents of the CSV file

We h av e to u se skipfooter
and th e engine='python'
option to enable th is. Th er e
ar e tw o engines for th ese CSV
r eader fu nctions – based on C
or Py th on, of w h ich only th e
Py th on engine su ppor ts th e
skipfooter option.

4 . Use th e skipfooter option in


Py th on:

df6 =
pd.read_csv("CSV_EX_sk
ipfooter.csv",skiprows
=2,

skipfooter=1,engine='p
ython')

df6

Th e ou tpu t is as follow s:

Figure 5.13: DataFrame without a footer


READING ONLY THE FIRST N
ROWS (ESPECIALLY USEFUL
FOR LARGE FILES)
In many si t u at i ons, w e may not w ant t o r ead a
w h ol e dat a f i l e b u t onl y t h e f i r st f ew r ow s. Th i s
i s p ar t i c u l ar l y u sef u l f or ex t r emel y l ar ge dat a
f i l es, w h er e w e may ju st w ant t o r ead t h e f i r st
c ou p l e of h u ndr ed r ow s t o c h ec k an i ni t i al
p at t er n and t h en dec i de t o r ead t h e w h ol e dat a
l at er on. Readi ng t h e ent i r e f i l e c an t ak e a l ong
t i me and sl ow dow n t h e ent i r e dat a w r angl i ng
p i p el i ne.

A si mp l e op t i on, c al l ed nrows, i n t h e read_csv


f u nc t i on enab l es u s t o do ju st t h at :

df7 = pd.read_csv("CSV_EX_1.csv",nrows=2)

df7

Th e ou t p u t i s as f ol l ow s:

Figure 5.14: DataFrame with the first few rows of the


CSV file

EXERCISE 64: COMBINING


SKIPROWS AND NROWS TO
READ DATA IN SMALL
CHUNKS
Cont i nu i ng ou r di sc u ssi on ab ou t r eadi ng a v er y
l ar ge dat a f i l e, w e c an c l ev er l y c omb i ne
skiprows and nrows t o r ead i n su c h a l ar ge f i l e i n
smal l er c h u nk s of p r e-det er mi ned si zes. Th e
f ol l ow i ng c ode demonst r at es ju st t h at :

1 . Cr eate a list w h er e
DataFr am es w ill be stor ed:

list_of_dataframe = []

2 . Stor e th e nu m ber of r ow s to
be r ead into a v ar iable:

rows_in_a_chunk = 10

3 . Cr eate a v ar iable to stor e th e


nu m ber of ch u nks to be r ead:

num_chunks = 5
4 . Cr eate a du m m y DataFr am e
to get th e colu m n nam es:

df_dummy =
pd.read_csv("Boston_ho
using.csv",nrows=2)

colnames =
df_dummy.columns

5. Loop ov er th e CSV file to r ead


only a fixed nu m ber of r ow s
at a tim e:

for i in
range(0,num_chunks*row
s_in_a_chunk,rows_in_a
_chunk):

df =
pd.read_csv("Boston_ho
using.csv",header=0,sk
iprows=i,nrows=rows_in
_a_chunk,names=colname
s)

list_of_dataframe.appe
nd(df)

N ot e h ow t h e iterator v ar i ab l e i s set u p i nsi de


t h e range f u nc t i on t o b r eak i t i nt o c h u nk s. Say
t h e nu mb er of c h u nk s i s 5 and t h e r ow s p er
c h u nk i s 1 0. Th en, t h e i t er at or w i l l h av e a r ange
of (0,5*1 0,1 0), w h er e t h e f i nal 1 0 i s st ep -si ze, t h at
i s, i t w i l l i t er at e w i t h i ndi c es of (0,9,1 9,29,39,49).

SETTING THE
SKIP_BLANK_LINES OPTION
By def au l t , read_csv i gnor es b l ank l i nes. Bu t
somet i mes, y ou may w ant t o r ead t h em i n as N aN
so t h at y ou c an c ou nt h ow many su c h b l ank
ent r i es w er e p r esent i n t h e r aw dat a f i l e. In some
si t u at i ons, t h i s i s an i ndi c at or of t h e def au l t dat a
st r eami ng qu al i t y and c onsi st enc y . For t h i s, y ou
h av e t o di sab l e t h e skip_blank_lines op t i on:

df9 =
pd.read_csv("CSV_EX_blankline.csv",skip_bl
ank_lines=False)

df9

Th e ou t p u t i s as f ol l ow s:
Figure 5.15: DataFrame that has blank rows of a .csv
file

READ CSV FROM A ZIP FILE


Th i s i s an aw esome f eat u r e of p andas, i n t h at i t
al l ow s y ou t o r ead di r ec t l y f r om a c omp r essed
f i l e su c h as .zip, .gz, .bz2, or .xz. Th e onl y
r equ i r ement i s t h at t h e i nt ended dat a f i l e (CSV )
sh ou l d b e t h e onl y f i l e i nsi de t h e c omp r essed
f i l e.

In t h i s ex amp l e, w e c omp r essed t h e ex amp l e CSV


f i l e w i t h a 7 -Zi p p r ogr am and r ead f r om i t
di r ec t l y u si ng t h e read_csv met h od:

df10 = pd.read_csv('CSV_EX_1.zip')

df10

Th e ou t p u t i s as f ol l ow s:

Figure 5.16: DataFrame of a compressed CSV

READING FROM AN EXCEL FILE


USING SHEET_NAME AND
HANDLING A DISTINCT
SHEET_NAME
N ex t , w e w i l l t u r n ou r at t ent i on t o a Mi c r osof t
Ex c el f i l e. It t u r ns ou t t h at most of t h e op t i ons
and met h ods w e l ear ned ab ou t i n t h e p r ev i ou s
ex er c i ses w i t h t h e CSV f i l e ap p l y di r ec t l y t o t h e
r eadi ng of Ex c el f i l es t oo. Th er ef or e, w e w i l l not
r ep eat t h em h er e. Inst ead, w e w i l l f oc u s on t h ei r
di f f er enc es. A n Ex c el f i l e c an c onsi st of mu l t i p l e
w or k sh eet s and w e c an r ead a sp ec i f i c sh eet b y
p assi ng i n a p ar t i c u l ar ar gu ment , t h at i s,
sheet_name.

For ex amp l e, i n t h e assoc i at ed dat a f i l e,


Housing_data.xlsx, w e h av e t h r ee t ab s, and t h e
f ol l ow i ng c ode r eads t h em one b y one i n t h r ee
sep ar at e Dat aFr ames:

df11_1 =
pd.read_excel("Housing_data.xlsx",sheet_na
me='Data_Tab_1')

df11_2 =
pd.read_excel("Housing_data.xlsx",sheet_na
me='Data_Tab_2')

df11_3 =
pd.read_excel("Housing_data.xlsx",sheet_na
me='Data_Tab_3')

If t h e Ex c el f i l e h as mu l t i p l e di st i nc t sh eet s b u t
t h e sheet_name ar gu ment i s set t o None, t h en an
or der ed di c t i onar y w i l l b e r et u r ned b y t h e
read_excel f u nc t i on. Th er eaf t er , w e c an si mp l y
i t er at e ov er t h at di c t i onar y or i t s k ey s t o
r et r i ev e i ndi v i du al Dat aFr ames.

Let 's c onsi der t h e f ol l ow i ng ex amp l e:

dict_df =
pd.read_excel("Housing_data.xlsx",sheet_na
me=None)

dict_df.keys()

Th e ou t p u t i s as f ol l ow s:

odict_keys(['Data_Tab_1', 'Data_Tab_2',
'Data_Tab_3'])

EXERCISE 65: READING A


GENERAL DELIMITED TEXT
FILE
Gener al t ex t f i l es c an b e r ead as easi l y as w e
r ead CSV f i l es. H ow ev er , y ou h av e t o p ass on t h e
p r op er sep ar at or i f i t i s any t h i ng ot h er t h an a
w h i t esp ac e or a t ab :

1 . A com m a-separ ated file,


sav ed w ith th e .txt
extension, w ill r esu lt in th e
follow ing DataFr am e if r ead
w ith ou t explicitly setting th e
separ ator :

df13 =
pd.read_table("Table_E
X_1.txt")
df13

Th e ou tpu t is as follow s:

Figure 5.17: DataFrame that has a


comma-separated CSV file

2 . In th is case, w e h av e to set
th e separ ator explicitly , as
follow s:

df13 =
pd.read_table("Table_E
X_1.txt",sep=',')

df13

Th e ou tpu t is follow s:

Figure 5.18: DataFrame read using a comma separator

READING HTML TABLES


DIRECTLY FROM A URL
Th e p andas l i b r ar y al l ow s u s t o r ead H TML
t ab l es di r ec t l y f r om a U RL. Th i s means t h at t h ey
al r eady h av e some k i nd of b u i l t -i n H TML p ar ser
t h at p r oc esses t h e H TML c ont ent of a gi v en p age
and t r i es t o ex t r ac t v ar i ou s t ab l es i n t h e p age.

Note
Th e read_html m e th od re tu rns a lis t of
Da ta Fra m e s (e v e n if th e p a g e h a s a s ing le
Da ta Fra m e ) a nd y ou h a v e to e xtra c t th e re le v a nt
ta ble s from th e lis t.

Consi der t h e f ol l ow i ng ex amp l e:

url =
'http://www.fdic.gov/bank/individual/faile
d/banklist.html'

list_of_df = pd.read_html(url)

df14 = list_of_df[0]

df14.head()

Th ese r esu l t s ar e sh ow n i n t h e f ol l ow i ng
Dat aFr ame:

Figure 5.19: Results of reading HTML tables

EXERCISE 66: FURTHER


WRANGLING TO GET THE
DESIRED DATA
A s di sc u ssed i n t h e p r ec edi ng ex er c i se, t h i s
H TML-r eadi ng f u nc t i on al most al w ay s r et u r ns
mor e t h an one t ab l e f or a gi v en H TML p age and
w e h av e t o f u r t h er p ar se t h r ou gh t h e l i st t o
ex t r ac t t h e p ar t i c u l ar t ab l e w e ar e i nt er est ed
i n:

1 . For exam ple, if w e w ant to


get th e table of th e 2 01 6
su m m er Oly m pics m edal
tally (by nation), w e can
easily sear ch to get a page on
Wikipedia th at w e can pass
on to pandas. We can do th is
by u sing th e follow ing
com m and:

list_of_df =
pd.read_html("https://
en.wikipedia.org/wiki/
2016_Summer_Olympics_m
edal_table",header=0)

2 . If w e ch eck th e length of th e
list r etu r ned, w e w ill see it is
6:

len(list_of_df)

Th e ou tpu t is as follow s:

3 . To look for th e table, w e can


r u n a sim ple loop:

for t in list_of_df:

print(t.shape)

Th e ou tpu t is as follow s:

Figure 5.20: Shape of the tables

4 . It looks like th e second


elem ent in th is list is th e
table w e ar e looking for :

df15=list_of_df[1]

df15.head()

5. Th e ou tpu t is as follow s:

Figure 5.21: Output of the data in the second table


EXERCISE 67: READING FROM
A JSON FILE
Ov er t h e l ast 1 5 y ear s, JSON h as b ec ome a
u b i qu i t ou s c h oi c e f or dat a ex c h ange on t h e w eb .
Today , i t i s t h e f or mat of c h oi c e f or al most ev er y
p u b l i c l y av ai l ab l e w eb A PI, and i t i s f r equ ent l y
u sed f or p r i v at e w eb A PIs as w el l . It i s a sc h ema-
l ess, t ex t -b ased r ep r esent at i on of st r u c t u r ed dat a
t h at i s b ased on k ey -v al u e p ai r s and or der ed l i st s.

Th e p andas l i b r ar y p r ov i des ex c el l ent su p p or t


f or r eadi ng dat a f r om a JSON f i l e di r ec t l y i nt o a
Dat aFr ame. To p r ac t i c e w i t h t h i s c h ap t er , w e
h av e i nc l u ded a f i l e c al l ed movies.json. Th i s f i l e
c ont ai ns t h e c ast , genr e, t i t l e, and y ear (of
r el ease) i nf or mat i on f or al most al l major mov i es
si nc e 1 900:

1 . Extr act th e cast list for th e


2 01 2 A v enger s m ov ie (fr om
Mar v el com ics):
df16 =
pd.read_json("movies.j
son")

df16.head()

Th e ou tpu t is as follow s:

Figure 5.22: DataFrame displaying the


Avengers movie cast

2 . To look for th e cast w h er e th e


title is "A v enger s", w e can
u se filter ing:

cast_of_avengers=df16[
(df16['title']=="The
Avengers") &
(df16['year']==2012)]
['cast']
print(list(cast_of_ave
ngers))

Th e ou tpu t w ill be as follow s:

[['Robert Downey,
Jr.', 'Chris Evans',
'Mark Ruffalo', 'Chris
Hemsworth', 'Scarlett
Johansson', 'Jeremy
Renner', 'Tom
Hiddleston', 'Clark
Gregg', 'Cobie
Smulders', 'Stellan
SkarsgÃyrd', 'Samuel
L. Jackson']]

READING A STATA FILE


Th e p andas l i b r ar y p r ov i des a di r ec t r eadi ng
f u nc t i on f or St at a f i l es, t oo. St at a i s a p op u l ar
st at i st i c al model i ng p l at f or m t h at 's u sed i n many
gov er nment al and r esear c h or gani zat i ons,
esp ec i al l y b y ec onomi st s and soc i al sc i ent i st s.

Th e si mp l e c ode t o r ead i n a St at a f i l e (.dta


f or mat ) i s as f ol l ow s:

df17 = pd.read_stata("wu-data.dta")

EXERCISE 68: READING


TABULAR DATA FROM A PDF
FILE
A mong t h e v ar i ou s t y p es of dat a sou r c es, t h e PDF
f or mat i s p r ob ab l y t h e most di f f i c u l t t o p ar se i n
gener al . W h i l e t h er e ar e some p op u l ar p ac k ages
i n Py t h on f or w or k i ng w i t h PDF f i l es f or
gener al p age f or mat t i ng, t h e b est l i b r ar y t o u se
f or t ab l e ex t r ac t i on f r om PDF f i l es i s tabula-py.

Fr om t h e Gi t H u b p age of t h i s p ac k age, tabula-py


i s a si mp l e Py t h on w r ap p er of tabula-java,
w h i c h c an r ead a t ab l e f r om a PDF. You c an r ead
t ab l es f r om PDFs and c onv er t t h em i nt o p andas
Dat aFr ames. Th e tabula-py l i b r ar y al so enab l es
y ou t o c onv er t a PDF f i l e i nt o a CSV /TSV /JSON
f i l e.

You w i l l need t h e f ol l ow i ng p ac k ages i nst al l ed


on y ou r sy st em b ef or e y ou c an r u n t h i s, b u t t h ey
ar e f r ee and easy t o i nst al l :

u r llib3

pandas

py test
flake8

distr o

path lib

1 . Find th e PDF file in th e


follow ing link:
h ttps://gith u b.com /Tr ainin
gBy Packt/Data-Wr angling-
w ith -
Py th on/blob/m aster /Ch apte
r 05/Exer cise6 0-
6 8/Hou sing_data.xlsx. Th e
follow ing code r etr iev es th e
tables fr om tw o pages and
joins th em to m ake one table:

from tabula import


read_pdf

df18_1 =
read_pdf('Housing_data
.pdf',pages=
[1],pandas_options=
{'header':None})

df18_1

Th e ou tpu t is as follow s:

Figure 5.23: DataFrame with a table


derived by merging a table flowing over
two pages in a PDF

2 . Retr iev e th e table fr om


anoth er page of th e sam e
PDF by u sing th e follow ing
com m and:

df18_2 =
read_pdf('Housing_data
.pdf',pages=
[2],pandas_options=
{'header':None})
df18_2

Th e ou tpu t is as follow s:

Figure 5.24: DataFrame displaying a


table from another page

3 . To concatenate th e tables
th at w er e der iv ed fr om th e
fir st tw o steps, execu te th e
follow ing code:

df18=pd.concat([df18_1
,df18_2],axis=1)

df18

Th e ou tpu t is as follow s:

Figure 5.25: DataFrame derived by


concatenating two tables

4 . With PDF extr action, m ost of


th e tim e, h eader s w ill be
difficu lt to extr act
au tom atically . You h av e to
pass on th e list of h eader s
w ith th e names ar gu m ent in
th e read-pdf fu nction as
pandas_option, as follow s:

names=
['CRIM','ZN','INDUS','
CHAS','NOX','RM','AGE'
,'DIS','RAD','TAX','PT
RATIO','B','LSTAT','PR
ICE']

df18_1 =
read_pdf('Housing_data
.pdf',pages=
[1],pandas_options=
{'header':None,'names'
:names[:10]})

df18_2 =
read_pdf('Housing_data
.pdf',pages=
[2],pandas_options=
{'header':None,'names'
:names[10:]})

df18=pd.concat([df18_1
,df18_2],axis=1)

df18

Th e ou tpu t is as follow s:

Figure 5.26: DataFrame with correct column headers


for PDF data

W e w i l l h av e a f u l l ac t i v i t y on r eadi ng t ab l es
f r om a PDF r ep or t and p r oc essi ng t h em at t h e end
of t h i s c h ap t er .

Introduction to
Beautiful Soup 4 and
Web Page Parsing
Th e ab i l i t y t o r ead and u nder st and w eb p ages i s
one of p ar amou nt i nt er est f or a p er son
c ol l ec t i ng and f or mat t i ng dat a. For ex amp l e,
c onsi der t h e t ask of gat h er i ng dat a ab ou t mov i es
and t h en f or mat t i ng i t f or a dow nst r eam sy st em.
Dat a f or t h e mov i es i s b est ob t ai ned b y t h e
w eb si t es su c h as IMDB and t h at dat a does not
c ome p r e-p ac k aged i n ni c e f or ms(CSV , JSON < and
so on), so y ou need t o k now h ow t o dow nl oad and
r ead w eb p age.

Fu r t h er mor e, y ou al so need t o b e equ i p p ed w i t h


t h e k now l edge of t h e st r u c t u r e of a w eb p age so
t h at y ou c an desi gn a sy st em t h at c an sear c h f or
(qu er y ) a p ar t i c u l ar p i ec e of i nf or mat i on f r om a
w h ol e w eb p age and get t h e v al u e of i t . Th i s
i nv ol v es u nder st andi ng t h e gr ammar of mar k u p
l angu ages and b ei ng ab l e t o w r i t e somet h i ng t h at
c an p ar se t h em. Doi ng t h i s, and k eep i ng al l t h e
edge c ases i n mi nd, f or somet h i ng l i k e H TML i s
al r eady i nc r edi b l y c omp l ex , and i f y ou ex t end
t h e sc op e of t h e b esp ok e mar k u p l angu age t o
i nc l u de XML as w el l , t h en i t b ec omes f u l l -t i me
w or k f or a t eam of p eop l e.

Th ank f u l l y , w e ar e u si ng Py t h on, and Py t h on h as


a v er y mat u r e and st ab l e l i b r ar y t o do al l of t h e
c omp l i c at ed job s f or u s. Th i s l i b r ar y i s c al l ed
BeautifulSoup (i t i s, at p r esent , i n v er si on 4 and
t h u s w e w i l l c al l i t bs4 i n sh or t f r om now on). bs4
i s a l i b r ar y f or get t i ng dat a f r om H TML or XML
doc u ment s, and i t gi v es y ou a ni c e, nor mal i zed,
i di omat i c w ay of nav i gat i ng and qu er y i ng a
doc u ment . It does not i nc l u de a p ar ser b u t i t
su p p or t s di f f er ent ones.

STRUCTURE OF HTML
Bef or e w e ju mp i nt o bs4 and st ar t w or k i ng w i t h
i t , w e need t o ex ami ne t h e st r u c t u r e of a H TML
doc u ment . Hy p er T ex t Mar k u p Langu age i s a
st r u c t u r ed w ay of t el l i ng w eb b r ow ser s ab ou t
t h e or gani zat i on of a w eb p age, meani ng w h i c h
k i nd of el ement s (t ex t , i mage, v i deo, and so on)
c ome f r om w h er e, i n w h i c h p l ac e i nsi de t h e p age
t h ey sh ou l d ap p ear , w h at t h ey l ook l i k e, w h at
t h ey c ont ai n, and h ow t h ey w i l l b eh av e w i t h
u ser i np u t . H TML5 i s t h e l at est v er si on of H TML.
A n H TML doc u ment c an b e v i ew ed as a t r ee, as w e
c an see f r om t h e f ol l ow i ng di agr am:
Figure 5.27: HTML structure

Eac h node of t h e t r ee r ep r esent s one el ement i n


t h e doc u ment . A n el ement i s any t h i ng t h at st ar t s
w i t h < and ends w i t h >. For ex amp l e, <html>,
<head>, <p>, <br>, <img>, and so on ar e v ar i ou s
H TML el ement s. Some el ement s h av e a st ar t and
end el ement , w h er e t h e end el ement b egi ns w i t h
"</" and h as t h e same name as t h e st ar t el ement ,
su c h as <p> and </p>, and t h ey c an c ont ai n an
ar b i t r ar y nu mb er of el ement s of ot h er t y p es i n
t h em. Some el ement s do not h av e an endi ng p ar t ,
su c h as t h e <br /> el ement , and t h ey c annot
c ont ai n any t h i ng w i t h i n t h em.

Th e onl y ot h er t h i ng t h at w e need t o k now ab ou t


an el ement at t h i s p oi nt i s t h e f ac t t h at el ement s
c an h av e at t r i b u t es, w h i c h ar e t h er e t o modi f y
t h e def au l t b eh av i or of an el ement . A n <a>
el ement r equ i r es a href at t r i b u t e t o t el l t h e
b r ow ser w h i c h w eb si t e i t sh ou l d nav i gat e t o
w h en t h at p ar t i c u l ar <a> i s c l i c k ed, l i k e t h i s: <a
href="http://cnn.com">. Th e CN N new s c h annel ,
</a>, w i l l t ak e y ou t o c nn.c om w h en c l i c k ed:

Figure 5.28: CNN news channel hyperlink

So, w h en y ou ar e at a p ar t i c u l ar el ement of t h e
t r ee, y ou c an v i si t al l t h e c h i l dr en of t h at
el ement t o get t h e c ont ent s and at t r i b u t es of
t h em.

Equ i p p ed w i t h t h i s k now l edge, l et 's see h ow w e


c an r ead and qu er y dat a f r om a H TML doc u ment .

In t h i s t op i c , w e w i l l c ov er t h e r eadi ng and
p ar si ng of w eb p ages, b u t w e do not r equ est t h em
f r om a l i v e w eb si t e. Inst ead, w e r ead t h em f r om
di sk . A sec t i on on r eadi ng t h em f r om t h e i nt er net
w i l l f ol l ow i n a f u t u r e c h ap t er .

EXERCISE 69: READING AN


HTML FILE AND EXTRACTING
ITS CONTENTS USING
BEAUTIFULSOUP
In t h i s ex er c i se, w e w i l l do t h e si mp l est t h i ng
p ossi b l e. W e w i l l i mp or t t h e BeautifulSoup
l i b r ar y and t h en u se i t t o r ead an H TML
doc u ment . Th en, w e w i l l ex ami ne t h e di f f er ent
k i nds of ob jec t s i t r et u r ns. W h i l e doi ng t h e
ex er c i ses f or t h i s t op i c , y ou sh ou l d h av e t h e
ex amp l e H TML f i l e op en i n a t ex t edi t or al l t h e
t i me so t h at y ou c an c h ec k f or t h e di f f er ent t ags
and t h ei r at t r i b u t es and c ont ent s:

1 . Im por t th e bs4 libr ar y :

from bs4 import


BeautifulSoup

2 . Please dow nload th e


follow ing test HTML file and
sav e it on y ou r disk and th e
u se bs4 to r ead it fr om th e
disk:

with open("test.html",
"r") as fd: soup =
BeautifulSoup(fd)
print(type(soup))

Th e ou tpu t is as follow s:

<class
'bs4.BeautifulSoup'>

You can pass a file h andler


dir ectly to th e constr u ctor of
th e BeautifulSoup object
and it w ill r ead th e contents
fr om th e file th at th e
h andler is attach ed to. We
w ill see th at r etu r n-ty pe is
an instance of
bs4.BeautifulSoup. Th is
class h olds all th e m eth ods
w e need to nav igate th r ou gh
th e DOM tr ee th at th e
docu m ent r epr esents.

3 . Pr int th e contents of th e file


in a nice w ay by u sing th e
prettify m eth od fr om th e
class like th is:
print(soup.prettify())

Th e ou tpu t is as follow s:
Figure 5.29: Contents of the HTML file

Th e sam e infor m ation can


also be obtained by u sing th e
soup.contents m em ber
v ar iable. Th e differ ences ar e:
fir st, it w on't pr int any th ing
pr etty and, second, it is
essentially a list.

If w e look car efu lly at th e


contents of th e HTML file in a
separ ate text editor , w e w ill
see th at th er e ar e m any
par agr aph tags, or <p> tags.
Let's r ead content fr om one
su ch <p> tag. We can do th at
u sing th e sim ple . access
m odifier as w e w ou ld h av e
done for a nor m al m em ber
v ar iable of a class.

4 . Th e m agic of bs4 is th e fact


th at it giv es u s th is excellent
w ay to der efer ence tags as
m em ber v ar iables of th e
BeautifulSoup class
instance:

with open("test.html",
"r") as fd:
soup =
BeautifulSoup(fd)

print(soup.p)

Th e ou tpu t is as follow s:

Figure 5.30: Text from the <p> tag

A s w e can see, th is is th e
content of a <p> tag.

We saw h ow to r ead a tag in


th e last exer cise, bu t w e can
easily see th e pr oblem w ith
th is appr oach . Wh en w e look
into ou r HTML docu m ent, w e
can see th at w e h av e m or e
th an one <p> tag th er e. How
can w e access all th e <p>
tags? It tu r ns ou t th at th is is
easy .

5. Use th e findall m eth od to


extr act th e content fr om th e
tag:

with open("test.html",
"r") as fd:

soup =
BeautifulSoup(fd)

all_ps =
soup.find_all('p')

print("Total number of
<p> ---
{}".format(len(all_ps)
))

Th e ou tpu t is as follow s:

Total number of <p> --


- 6

Th is w ill pr int 6 , w h ich is


exactly th e nu m ber of
<p>tags in th e docu m ent.

We h av e seen h ow to access
all th e tags of th e sam e ty pe.
We h av e also seen h ow to get
th e content of th e entir e
HTML docu m ent.

6 . Now , w e w ill see h ow to get


th e contents u nder a
par ticu lar HTML tag, as
follow s:

with open("test.html",
"r") as fd:

soup =
BeautifulSoup(fd)

table = soup.table

print(table.contents)

Th e ou tpu t is as follow s:
Figure 5.31: Content under the <table>
tag

Her e, w e ar e getting th e
(fir st) table fr om th e
docu m ent and th en u sing
th e sam e "." notation, to get
th e contents u nder th at tag.

We saw in th e pr ev iou s
exer cise th at w e can access
th e entir e content u nder a
par ticu lar tag. How ev er ,
HTML is r epr esented as a
tr ee and w e ar e able to
tr av er se th e ch ildr en of a
par ticu lar node. Th er e ar e a
few w ay s to do th is.

7 . Th e fir st w ay is by u sing th e
children gener ator fr om
any bs4 instance, as follow s:

with open("test.html",
"r") as fd:

soup =
BeautifulSoup(fd)

table = soup.table

for child in
table.children:

print(child)

print("*****")

Wh en w e execu te th e code,
w e w ill see som eth ing like
th e follow ing:
Figure 5.32: Traversing the children of a
table node

It seem s th at th e loop h as
only been execu ted tw ice!
Well, th e pr oblem w ith th e
"children" gener ator is th at
it only takes into accou nt th e
im m ediate ch ildr en of th e
tag. We h av e <tbody> u nder
th e <table> and ou r w h ole
table str u ctu r e is w r apped in
it. Th at's w h y it w as
consider ed a single ch ild of
th e <table> tag.

We looked into h ow to br ow se
th e im m ediate ch ildr en of a
tag. We w ill see h ow w e can
br ow se all th e possible
ch ildr en of a tag and not
only th e im m ediate one.

8. To do th at, w e u se th e
descendants gener ator fr om
th e bs4 instance, as follow s:

with open("test.html",
"r") as fd:

soup =
BeautifulSoup(fd)

table = soup.table

children =
table.children

des =
table.descendants

print(len(list(childre
n)), len(list(des)))

Th e ou tpu t is as follow s:

9 61

Th e c omp ar i son p r i nt at t h e end of t h e c ode


b l oc k w i l l sh ow u s t h e di f f er enc e b et w een
children and descendants. Th e l engt h of t h e l i st
w e got f r om children i s onl y 9, w h er eas t h e
l engt h of t h e l i st w e got f r om descendants i s 61 .
EXERCISE 70: DATAFRAMES
AND BEAUTIFULSOUP
So f ar , w e h av e seen some b asi c w ay s t o nav i gat e
t h e t ags i nsi de a H TML doc u ment u si ng bs4. N ow ,
w e ar e goi ng t o go one st ep f u r t h er and u se t h e
p ow er of bs4 c omb i ned w i t h t h e p ow er of p andas
t o gener at e a Dat aFr ame ou t of a p l ai n H TML
t ab l e. Th i s p ar t i c u l ar k now l edge i s v er y u sef u l
f or u s. W i t h t h e k now l edge w e w i l l ac qu i r e now ,
i t w i l l b e f ai r l y easy f or u s t o p r ep ar e a p andas
Dat aFr ame t o p er f or m EDA (ex p l or at or y dat a
anal y si s) or model i ng. W e ar e goi ng t o sh ow t h i s
p r oc ess on a si mp l e smal l t ab l e f r om t h e t est
H TML f i l e, b u t t h e ex ac t same c onc ep t ap p l i es t o
any ar b i t r ar i l y l ar ge t ab l e as w el l :

1 . Im por t pandas and r ead th e


docu m ent, as follow s:

import pandas as pd

fd = open("test.html",
"r")

soup =
BeautifulSoup(fd)

data =
soup.findAll('tr')

print("Data is a {}
and {} items
long".format(type(data
), len(data)))

Th e ou tpu t is as follow s:

Data is a <class
'bs4.element.ResultSet
'> and 4 items long

2 . Ch eck th e or iginal table


str u ctu r e in th e HTML
sou r ce. You w ill see th at th e
fir st r ow is th e colu m n
h eadings and all of th e
follow ing r ow s ar e th e data.
We assign tw o differ ent
v ar iables for th e tw o
sections, as follow s:

data_without_header =
data[1:]

headers = data[0]

header
Th e ou tpu t is as follow s:

<tr>

<th>Entry Header
1</th>

<th>Entry Header
2</th>

<th>Entry Header
3</th>

<th>Entry Header
4</th>

</tr>

Note

Keep in mind that the art of


scraping a HTML page goes
hand in hand w ith an
understanding of the source
HTML structure. So,
w henever you w ant to scrape
a page, the first thing you
need to do is right-click on it
and then use "View Source"
from the brow ser to see the
source HTML.

3 . Once w e h av e separ ated th e


tw o sections, w e need tw o list
com pr eh ensions to m ake
th em r eady to go in a
DataFr am e. For th e h eader ,
th is is easy :

col_headers =
[th.getText() for th
in
headers.findAll('th')]

col_headers

Th e ou tpu t is as follow s:

['Entry Header 1',


'Entry Header 2',
'Entry Header 3',
'Entry Header 4']

4 . Data pr epar ation is a bit


tr icky for a pandas
DataFr am e. You need to
h av e a tw o-dim ensional list,
w h ich is a list of lists. We
accom plish th at in th e
follow ing w ay :

df_data =
[[td.getText() for td
in tr.findAll('td')]
for tr in
data_without_header]

df_data

Th e ou tpu t is as follow s:

Figure 5.33: Output as a two-dimensional


list

5. Inv oke th e pd.DataFrame


m eth od and su pply th e r igh t
ar gu m ents by u sing th e
follow ing code:

df =
pd.DataFrame(df_data,
columns=col_headers)

df.head()
Figure 5.34: Output in tabular format with column
headers

EXERCISE 71: EXPORTING A


DATAFRAME AS AN EXCEL FILE
In t h i s ex er c i se, w e w i l l see h ow w e c an sav e a
Dat aFr ame as an Ex c el f i l e. Pandas c an nat i v el y
do t h i s, b u t i t needs t h e h el p of t h e openpyxl
l i b r ar y t o ac h i ev e t h i s goal :

1 . Install th e openpyxl libr ar y


by u sing th e follow ing
com m and:

!pip install openpyxl

2 . To sav e th e DataFr am e as an
Excel file, u se th e follow ing
com m and fr om inside of th e
Ju py ter notebook:

writer =
pd.ExcelWriter('test_o
utput.xlsx')df.to_exce
l(writer,
"Sheet1")writer.save()

writer

Th e ou tpu t is as follow s:

<pandas.io.excel._Xlsx
Writer at
0x24feb2939b0>

EXERCISE 72: STACKING URLS


FROM A DOCUMENT USING
BS4
Pr ev i ou sl y (w h i l e di sc u ssi ng st ac k ), w e
ex p l ai ned h ow i mp or t ant i t i s t o h av e a st ac k
t h at w e c an p u sh t h e U RLs f r om a w eb p age t o so
t h at w e c an p op t h em at a l at er t i me t o f ol l ow
eac h of t h em. H er e, i n t h i s ex er c i se, w e w i l l see
h ow t h at w or k s.

In t h e gi v en t est , H TML f i l e l i nk s or <a> t ags ar e


u nder a <ul> t ag, and eac h of t h em i s c ont ai ned
i nsi de a </li> t ag:

1 . Find all th e <a> tags by


u sing th e follow ing
com m and:
d = open("test.html",
"r")

soup =
BeautifulSoup(fd)

lis =
soup.find('ul').findAl
l('li')

stack = []

for li in lis: a =
li.find('a',
href=True)

2 . Define a stack befor e y ou


star t th e loop. Th en, inside
th e loop, u se th e append
m eth od to pu sh th e links in
th e stack:

stack.append(a['href']
)

3 . Pr int th e stack:

Figure 5.35: Output of the stack

ACTIVITY 7: READING
TABULAR DATA FROM A WEB
PAGE AND CREATING
DATAFRAMES
In t h i s ac t i v i t y , y ou h av e b een gi v en a
W i k i p edi a p age w h er e y ou h av e t h e GDP of al l
c ou nt r i es l i st ed. You h av e b een ask ed t o c r eat e
t h r ee DataFrames f r om t h e t h r ee sou r c es
ment i oned i n t h e p age
(h t t p s://en.w i k i p edi a.or g/w i k i /Li st _of _c ou nt r i e
s_b y _GDP_(nomi nal )):

You w i l l h av e t o do t h e f ol l ow i ng:

1 . Open th e page in a separ ate


Ch r om e/Fir efox tab and u se
som eth ing like an Inspect
Element tool to v iew th e
sou r ce HTML and
u nder stand its str u ctu r e
2 . Read th e page u sing bs4

3 . Find th e table str u ctu r e y ou


w ill need to deal w ith (h ow
m any tables th er e ar e?)

4 . Find th e r igh t table u sing


bs4

5. Separ ate th e sou r ce nam es


and th eir cor r esponding data

6 . Get th e sou r ce nam es fr om


th e list of sou r ces y ou h av e
cr eated

7 . Separ ate th e h eader and


data fr om th e data th at y ou
separ ated befor e for th e fir st
sou r ce only , and th en cr eate
a DataFr am e u sing th at

8. Repeat th e last task for th e


oth er tw o data sou r ces

Note

The solution for this activity


can be found on page 308.

Summary
In t h i s t op i c , w e l ook ed at t h e st r u c t u r e of an
H TML doc u ment . H TML doc u ment s ar e t h e
c or ner st one of t h e W or l d W i de W eb and, gi v en
t h e amou nt of dat a t h at 's c ont ai ned on i t , w e c an
easi l y i nf er t h e i mp or t anc e of H TML as a dat a
sou r c e.

W e l ear ned ab ou t b s4 (Beau t i f u l Sou p 4), a Py t h on


l i b r ar y t h at gi v es u s Py t h oni c w ay s t o r ead and
qu er y H TML doc u ment s. W e u sed b s4 t o l oad an
H TML doc u ment and al so ex p l or ed sev er al
di f f er ent w ay s t o nav i gat e t h e l oaded doc u ment .
W e al so got nec essar y i nf or mat i on ab ou t t h e
di f f er enc e b et w een al l of t h ese met h ods.

W e l ook ed at h ow w e c an c r eat e a p andas


Dat aFr ame f r om an H TML doc u ment (w h i c h
c ont ai ns a t ab l e). A l t h ou gh t h er e ar e some b u i l t -
i n w ay s t o do t h i s job i n p andas, t h ey f ai l as soon
as t h e t ar get t ab l e i s enc oded i nsi de a c omp l ex
h i er ar c h y of el ement s. So, t h e k now l edge w e
gat h er ed i n t h i s t op i c b y t r ansf or mi ng an H TML
t ab l e i nt o a p andas Dat aFr ame i n a st ep -b y -st ep
manner i s i nv al u ab l e.

Fi nal l y , w e l ook ed at h ow w e c an c r eat e a st ac k


i n ou r c ode, w h er e w e p u sh al l t h e U RLs t h at w e
enc ou nt er w h i l e r eadi ng t h e H TML f i l e and t h en
u se t h em at a l at er t i me. In t h e nex t c h ap t er , w e
w i l l di sc u ss l i st c omp r eh ensi ons, zi p , f or mat and
ou t l i er det ec t i on and c l eani ng.
Chapter 6
Learning the Hidden
Secrets of Data
Wrangling
Learning Objectives
By t h e end of t h i s c h ap t er , y ou w i l l b e ab l e t o:

Clean and h andle r eal-life


m essy data

Pr epar e data for data


analy sis by for m atting data
in th e for m at r equ ir ed by
dow nstr eam sy stem s

Identify and r em ov e ou tlier s


fr om data

In t h i s c h ap t er , y ou w i l l l ear n ab ou t dat a i ssu es


t h at h ap p en i n r eal -l i f e. You w i l l al so l ear n h ow
t o sol v e t h ese i ssu es.

Introduction
In t h i s c h ap t er , w e w i l l l ear n ab ou t t h e sec r et
sau c e b eh i nd c r eat i ng a su c c essf u l dat a
w r angl i ng p i p el i ne. In t h e p r ev i ou s c h ap t er s,
w e w er e i nt r odu c ed t o t h e b asi c dat a st r u c t u r es
and b u i l di ng b l oc k s of Dat a W r angl i ng, su c h as
p andas and N u mPy . In t h i s c h ap t er , w e w i l l l ook
at t h e dat a h andl i ng sec t i on of dat a w r angl i ng.

Imagi ne t h at y ou h av e a dat ab ase of p at i ent s w h o


h av e h ear t di seases, and l i k e any su r v ey , t h e dat a
i s ei t h er mi ssi ng, i nc or r ec t , or h as ou t l i er s.
Ou t l i er s ar e v al u es t h at ar e ab nor mal and t end
t o b e f ar aw ay f r om t h e c ent r al t endenc y , and
t h u s i nc l u di ng i t i nt o y ou r f anc y mac h i ne
l ear ni ng model may i nt r odu c e a t er r i b l e b i as
t h at w e need t o av oi d. Of t en, t h ese p r ob l ems c an
c au se a h u ge di f f er enc e i n t er ms of money , man-
h ou r s, and ot h er or gani zat i onal r esou r c es. It i s
u ndeni ab l e t h at someone w i t h t h e sk i l l s t o sol v e
t h ese p r ob l ems w i l l p r ov e t o b e an asset t o an
or gani zat i on.

ADDITIONAL SOFTWARE
REQUIRED FOR THIS SECTION
Th e c ode f or t h i s ex er c i se dep ends on t w o
addi t i onal l i b r ar i es. W e need t o i nst al l SciPy and
python-Levenshtein, and w e ar e goi ng t o i nst al l
t h em i n t h e r u nni ng Doc k er c ont ai ner . Be w ar y
of t h i s, as w e ar e not i n t h e c ont ai ner .

To i nst al l t h e l i b r ar i es, t y p e t h e f ol l ow i ng
c ommand i n t h e r u nni ng Ju p y t er not eb ook :

!pip install scipy python-Levenshtein

Advanced List
Comprehension and the
zip Function
In t h i s t op i c , w e w i l l deep di v e i nt o t h e h ear t of
l i st c omp r eh ensi on. W e h av e al r eady seen a
b asi c f or m of i t , i nc l u di ng somet h i ng as si mp l e as
a = [i for i in range(0, 30)] t o somet h i ng a
b i t mor e c omp l ex t h at i nv ol v es one c ondi t i onal
st at ement . H ow ev er , as w e al r eady ment i oned,
l i st c omp r eh ensi on i s a v er y p ow er f u l t ool and,
i n t h i s t op i c , w e w i l l ex p l or e t h e p ow er of t h i s
amazi ng t ool f u r t h er . W e w i l l i nv est i gat e
anot h er c l ose r el at i v e of l i st c omp r eh ensi on
c al l ed generators, and al so w or k w i t h zip and i t s
r el at ed f u nc t i ons and met h ods. By t h e end of t h i s
t op i c , y ou w i l l b e c onf i dent i n h andl i ng
c omp l i c at ed l ogi c al p r ob l ems.

INTRODUCTION TO
GENERATOR EXPRESSIONS
Pr ev i ou sl y , w h i l e di sc u ssi ng adv anc ed dat a
st r u c t u r es, w e w i t nessed f u nc t i ons su c h as
repeat. W e sai d t h at t h ey r ep r esent a sp ec i al
t y p e of f u nc t i on k now n as i t er at or s. W e al so
sh ow ed y ou h ow t h e l azy ev al u at i on of an
i t er at or c an l ead t o an enor mou s amou nt of sp ac e
sav i ng and t i me ef f i c i enc y .

It er at or s ar e one b r i c k i n t h e f u nc t i onal
p r ogr ammi ng c onst r u c t t h at Py t h on h as t o of f er .
Fu nc t i onal p r ogr ammi ng i s i ndeed a v er y
ef f i c i ent and saf e w ay t o ap p r oac h a p r ob l em. It
of f er s v ar i ou s adv ant ages ov er ot h er met h ods,
su c h as modu l ar i t y , ease of deb u ggi ng and
t est i ng, c omp osab i l i t y , f or mal p r ov ab i l i t y (a
t h eor et i c al c omp u t er sc i enc e c onc ep t ), and so on.

EXERCISE 73: GENERATOR


EXPRESSIONS
In t h i s ex er c i se, w e w i l l b e i nt r odu c ed t o
gener at or ex p r essi ons, w h i c h ar e c onsi der ed
anot h er b r i c k of f u nc t i onal p r ogr ammi ng (as a
mat t er of f ac t , t h ey ar e i nsp i r ed b y t h e p u r e
f u nc t i onal l angu age k now n as H ask el l ). Si nc e w e
h av e seen some amou nt of l i st c omp r eh ensi on
al r eady , gener at or ex p r essi ons w i l l l ook
f ami l i ar t o u s. H ow ev er , t h ey al so of f er some
adv ant ages ov er l i st c omp r eh ensi on:

1 . Wr ite th e follow ing code


u sing list com pr eh ension to
gener ate a list of all th e odd
nu m ber s betw een 0 and
1 00,000:

odd_numbers2 = [x for
x in range(100000) if
x % 2 != 0]

2 . Use getsizeof fr om sys by


u sing th e follow ing code:

from sys import


getsizeof

getsizeof(odd_numbers2
)

Th e ou tpu t is as follow s:

406496

We w ill see th at it takes a


good am ou nt of m em or y to
do th is. It is also not v er y
tim e efficient. How can w e
ch ange th at? Using
som eth ing like repeat is not
applicable h er e becau se w e
need to h av e th e logic of th e
list com pr eh ension.
For tu nately , w e can tu r n
any list com pr eh ension into
a gener ator expr ession.

3 . Wr ite th e equ iv alent


gener ator expr ession for th e
afor em entioned list
com pr eh ension:

odd_numbers = (x for x
in range(100000) if x
% 2 != 0)

Notice th at th e only ch ange


w e m ade is to su r r ou nd th e
list com pr eh ension
statem ent w ith r ou nd
br ackets instead of squ ar e
ones. Th at m akes it sh r ink to
only ar ou nd 1 00 by tes! Th is
m akes it becom e a lazy
ev alu ation, and th u s is m or e
efficient.

4 . Pr int th e fir st 1 0 odd


nu m ber s, as follow s:

for i, number in
enumerate(odd_numbers)
:

print(number)

if i > 10:

break

Th e ou tpu t is as follow s:

11

13

15

17

19

21

23

EXERCISE 74: ONE-LINER


GENERATOR EXPRESSION
In t h i s ex er c i se, w e w i l l u se ou r k now l edge of
gener at or ex p r essi ons t o gener at e an ex p r essi on
t h at w i l l r ead one w or d at a t i me f r om a l i st of
w or ds and w i l l r emov e new l i ne c h ar ac t er s at t h e
end of t h em and mak e t h em l ow er c ase. Th i s c an
c er t ai nl y b e done u si ng a for l oop ex p l i c i t l y :

1 . Cr eate a w or ds str ing, as


follow s:

words = ["Hello\n",
"My name", "is\n",
"Bob", "How are you",
"doing\n"]
2 . Wr ite th e follow ing
gener ator expr ession to
ach iev e th e task, as follow s:

modified_words =
(word.strip().lower()
for word in words)

3 . Cr eate a list com pr eh ension


to get w or ds one by one fr om
th e gener ator expr ession and
finally pr int th e list, as
follow s:

final_list_of_word =
[word for word in
modified_words]

final_list_of_word

Th e ou tpu t is as follow s:

Figure 6.1: List comprehension of words

EXERCISE 75: EXTRACTING A


LIST WITH SINGLE WORDS
If w e l ook at t h e ou t p u t of t h e p r ev i ou s ex er c i se,
w e w i l l not i c e t h at du e t o t h e messy nat u r e of
t h e sou r c e dat a (w h i c h i s nor mal i n t h e r eal
w or l d), w e ended u p w i t h a l i st w h er e, i n some
c ases, w e h av e mor e t h an one w or d t oget h er ,
sep ar at ed b y a sp ac e. To i mp r ov e t h i s, and t o get a
l i st of si ngl e w or ds, w e w i l l h av e t o modi f y t h e
gener at or ex p r essi ons:

1 . Wr ite th e gener ator


expr ession and th en w r ite
th e equ iv alent nested for
loops so th at w e can com par e
th e r esu lts:

words = ["Hello\n",
"My name", "is\n",
"Bob", "How are you",
"doing\n"]

modified_words2 =
(w.strip().lower() for
word in words for w in
word.split(" "))

final_list_of_word =
[word for word in
modified_words2]

final_list_of_word

Th e ou tpu t is as follow s:

Figure 6.2: List of words from the string

2 . Wr ite an equ iv alent to th is


by follow ing a nested for
loop, as follow s:

modified_words3 = []

for word in words:

for w in word.split("
"):

modified_words3.append
(w.strip().lower())

modified_words3

Th e ou tpu t is as follow s:

Figure 6.3: List of words from the string using a nested


loop

W e mu st admi t t h at t h e gener at or ex p r essi on


w as not onl y sp ac e and t i me sav i ng b u t al so a
mor e el egant w ay t o w r i t e t h e same l ogi c .

To r ememb er h ow t h e nest ed l oop i n gener at or


ex p r essi ons w or k s, k eep i n mi nd t h at t h e l oop s
ar e ev al u at ed f r om l ef t t o r i gh t and t h e f i nal
l oop v ar i ab l e (i n ou r ex amp l e, w h i c h i s denot ed
b y t h e si ngl e l et t er "w ") i s gi v en b ac k (t h u s w e
c ou l d c al l strip and lower on i t ).

Th e f ol l ow i ng di agr am w i l l h el p y ou r ememb er
t h e t r i c k ab ou t nest ed for l oop s i n l i st
c omp r eh ensi on or gener at or ex p r essi on:

Figure 6.4: Nested loops illustration

W e h av e l ear ned ab ou t nest ed for l oop s i n


gener at or ex p r essi ons p r ev i ou sl y , b u t now w e
ar e goi ng t o l ear n ab ou t i ndep endent for l oop s i n
a gener at or ex p r essi on. W e w i l l h av e t w o ou t p u t
v ar i ab l es f r om t w o for l oop s and t h ey mu st b e
t r eat ed as a t u p l e so t h at t h ey don't h av e
amb i gu ou s gr ammar i n Py t h on.

Cr eat e t h e f ol l ow i ng t w o l i st s:

marbles = ["RED", "BLUE", "GREEN"]

counts = [1, 5, 13]

You ar e ask ed t o gener at e al l p ossi b l e


c omb i nat i ons of mar b l es and c ou nt s af t er b ei ng
gi v en t h e p r ec edi ng t w o l i st s. H ow w i l l y ou do
t h at ? Su r el y u si ng a nest ed for l oop and w i t h
l i st 's append met h od y ou c an ac c omp l i sh t h e t ask .
H ow ab ou t a gener at or ex p r essi on? A mor e
el egant and easy sol u t i on i s as f ol l ow s:

marble_with_count = ((m, c) for m in


marbles for c in counts)

Th i s gener at or ex p r essi on c r eat es a t u p l e i n


eac h i t er at i on of t h e si mu l t aneou s for l oop s.
Th i s c ode i s equ i v al ent t o t h e f ol l ow i ng ex p l i c i t
c ode:
marble_with_count_as_list_2 = []

for m in marbles:

for c in counts:

marble_with_count_as_list_2.append((m, c))

marble_with_count_as_list_2

Th e ou t p u t i s as f ol l ow s:

Figure 6.5: Appending the marbles and counts

Th i s gener at or ex p r essi on c r eat es a t u p l e i n


eac h i t er at i on of t h e si mu l t aneou s f or l oop s.
Onc e agai n, t h e gener at or ex p r essi on i s easy ,
el egant , and ef f i c i ent .

EXERCISE 76: THE ZIP


FUNCTION
In t h i s ex er c i se, w e w i l l ex ami ne t h e zip
f u nc t i on and c omp ar e i t w i t h t h e gener at or
ex p r essi on w e w r ot e i n t h e p r ev i ou s ex er c i se.
Th e p r ob l em w i t h t h e p r ev i ou s gener at or
ex p r essi on i s t h e f ac t t h at , i t p r odu c ed al l
p ossi b l e c omb i nat i ons. For i nst anc e, i f w e need
t o r el at e c ou nt r i es w i t h i t s c ap i t al s, doi ng so
u si ng gener at or ex p r essi on w i l l b e di f f i c u l t .
For t u nat el y , Py t h on gi v es u s a b u i l t -i n f u nc t i on
c al l ed zip f or ju st t h i s p u r p ose:

1 . Cr eate th e follow ing tw o


lists:

countries = ["India",
"USA", "France", "UK"]
capitals = ["Delhi",
"Washington", "Paris",
"London"]

2 . Gener ate a list of tu ples


w h er e th e fir st elem ent is th e
nam e of th e cou ntr y and th e
second elem ent is th e nam e
of th e capital by u sing th e
follow ing com m ands:

countries_and_capitals
= [t for t in
zip(countries,
capitals)]

3 . Th is is not v er y w ell
r epr esented. We can u se
dict w h er e key s ar e th e
nam es of th e cou ntr ies,
w h er eas th e v alu es ar e th e
nam es of th e capitals by
u sing th e follow ing
com m and:

countries_and_capitals
_as_dict =
dict(zip(countries,
capitals))

Th e ou tpu t is as follow s:

Figure 6.6: Dictionary with countries and capitals

EXERCISE 77: HANDLING


MESSY DATA
A s al w ay s, i n r eal l i f e, dat a i s messy . So, t h e ni c e
equ al l engt h l i st s of c ou nt r i es and c ap i t al s t h at
w e ju st saw ar e not av ai l ab l e.
Th e zip f u nc t i on c annot b e u sed w i t h u nequ al
l engt h l i st s, b ec au se zip w i l l st op w or k i ng as
soon as one of t h e l i st s c omes t o an end. To sav e u s
i n su c h a si t u at i on, w e h av e ziplongest i n t h e
itertools modu l e:

1 . Cr eate tw o lists of u nequ al


length , as follow s:

countries = ["India",
"USA", "France", "UK",
"Brasil", "Japan"]

capitals = ["Delhi",
"Washington", "Paris",
"London"]

2 . Cr eate th e final dict and pu t


None as th e v alu e to th e
cou ntr ies w h o do not h av e a
capital in th e capital's list:

from itertools import


zip_longest

countries_and_capitals
_as_dict_2 =
dict(zip_longest(count
ries, capitals))

countries_and_capitals
_as_dict_2

Th e ou tpu t is as follow s:
Figure 6.7: Output using ziplongest

W e sh ou l d p au se h er e f or a sec ond and t h i nk


ab ou t h ow many l i nes of ex p l i c i t c ode and
di f f i c u l t -t o-u nder st and if-else c ondi t i onal
l ogi c w e ju st sav ed b y c al l i ng a si ngl e f u nc t i on
and ju st gi v i ng i t t h e t w o sou r c e dat a l i st s. It i s
i ndeed amazi ng!

W i t h t h ese ex er c i ses, w e ar e endi ng t h e f i r st


t op i c of t h i s c h ap t er . A dv anc ed l i st
c omp r eh ensi on, gener at or ex p r essi ons, and
f u nc t i ons su c h as zip and ziplongest ar e some
v er y i mp or t ant t r i c k s t h at w e need t o mast er i f
w e w ant t o w r i t e c l ean, ef f i c i ent , and
mai nt ai nab l e c ode. Code t h at does not h av e t h ese
t h r ee qu al i t i es ar e c onsi der ed su b -p ar i n t h e
i ndu st r y , and w e c er t ai nl y don't w ant t o w r i t e
su c h c ode.

H ow ev er , w e di d not c ov er one i mp or t ant ob jec t


h er e, t h at i s, g e ne rators. Gener at or s ar e a
sp ec i al t y p e of f u nc t i on t h at sh ar es t h e
b eh av i or al t r ai t s w i t h gener at or ex p r essi ons.
H ow ev er , b ei ng a f u nc t i on, t h ey h av e a b r oader
sc op e and t h ey ar e mu c h mor e f l ex i b l e. W e
st r ongl y enc ou r age y ou t o l ear n ab ou t t h em.

Data Formatting
In t h i s t op i c , w e w i l l f or mat a gi v en dat aset . Th e
mai n mot i v at i ons b eh i nd f or mat t i ng dat a
p r op er l y ar e as f ol l ow s:

It h elps all th e dow nstr eam


sy stem s to h av e a single and
pr e-agr eed for m of data for
each data point, th u s
av oiding su r pr ises and, in
effect, br eaking it.

To pr odu ce a h u m an-
r eadable r epor t fr om low er -
lev el data th at is, m ost of th e
tim e, cr eated for m ach ine
consu m ption.

To find er r or s in data.

Th er e ar e a f ew w ay s t o do dat a f or mat t i ng i n
Py t h on. W e w i l l b egi n w i t h t h e modu l u s
op er at or .

THE % OPERATOR
Py t h on gi v es u s t h e % op er at or t o ap p l y b asi c
f or mat t i ng on dat a. To demonst r at e t h i s, w e w i l l
l oad t h e dat a f i r st b y r eadi ng t h e CSV f i l e, and
t h en w e w i l l ap p l y some b asi c f or mat t i ng on i t .

Load t h e dat a f r om t h e CSV f i l e b y u si ng t h e


f ol l ow i ng c ommand:

from csv import DictReader

raw_data = []

with open("combinded_data.csv", "rt") as


fd:

data_rows = DictReader(fd)

for data in data_rows:

raw_data.append(dict(data))

N ow , w e h av e a l i st c al l ed raw_data t h at c ont ai ns
al l t h e r ow s of t h e CSV f i l e. Feel f r ee t o p r i nt i t
t o c h ec k ou t w h at i t l ook s l i k e.

Th e ou t p u t i s as f ol l ow s:
Figure 6.8: Raw data

W e w i l l b e p r odu c i ng a r ep or t on t h i s dat a. Th i s
r ep or t w i l l c ont ai n one sec t i on f or eac h dat a
p oi nt and w i l l r ep or t t h e name, age, w ei gh t ,
h ei gh t , h i st or y of f ami l y di sease, and f i nal l y t h e
p r esent h ear t c ondi t i on of t h e p er son. Th ese
p oi nt s mu st b e c l ear and easi l y u nder st andab l e
Engl i sh sent enc es.

W e do t h i s i n t h e f ol l ow i ng w ay :

for data in raw_data:


report_str = """%s is %s years old and is
%s meter tall weighing about %s kg.\n

Has a history of family illness: %s.\n

Presently suffering from a heart disease:


%s

""" % (data["Name"], data["Age"],


data["Height"], data["Weight"],
data["Disease_history"],
data["Heart_problem"])

print(report_str)

Th e ou t p u t i s as f ol l ow s:
Figure 6.9: Raw data in a presentable format

Th e % op er at or i s u sed i n t w o di f f er ent w ay s:

Wh en u sed inside a qu ote, it


signifies w h at kind of data to
expect h er e. %s stands for
str ing, w h er eas %d stands for
integer . If w e indicate a
w r ong data ty pe, it w ill
th r ow an er r or . Th u s, w e can
effectiv ely u se th is kind of
for m atting as an er r or filter
in th e incom ing data.

Wh en w e u se th e % oper ator
ou tside th e qu ote, it basically
tells Py th on to star t th e
r eplacem ent of all th e data
inside w ith th e v alu es
pr ov ided for th em ou tside.

USING THE FORMAT


FUNCTION
In t h i s sec t i on, w e w i l l b e l ook i ng at t h e ex ac t
same f or mat t i ng p r ob l em, b u t t h i s t i me w e w i l l
u se a mor e adv anc ed ap p r oac h . W e w i l l u se
Py t h on's format f u nc t i on.

To u se t h e format f u nc t i on, w e do t h e f ol l ow i ng:

for data in raw_data:

report_str = """{} is {} years old and is


{} meter tall weighing about {} kg.\n

Has a history of family illness: {}.\n

Presently suffering from a heart disease:


{}

""".format(data["Name"], data["Age"],
data["Height"], data["Weight"],
data["Disease_history"],
data["Heart_problem"])

print(report_str)

Th e ou t p u t i s as f ol l ow s:
Figure 6.10: Data formatted using the format function
of the string

N ot i c e t h at w e h av e r ep l ac ed t h e %s w i t h {} and
i nst ead of t h e % ou t si de t h e qu ot e, w e h av e c al l ed
t h e format f u nc t i on.
W e w i l l see h ow t h e p ow er f u l format f u nc t i on
c an mak e t h e p r ev i ou s c ode a l ot mor e r eadab l e
and u nder st andab l e. Inst ead of si mp l e and b l ank
{}, w e ment i on t h e k ey names i nsi de and t h en u se
t h e sp ec i al Py t h on ** op er at i on on a dict t o
u np ac k i t and gi v e t h at t o t h e f or mat f u nc t i on. It
i s smar t enou gh t o f i gu r e ou t h ow t o r ep l ac e t h e
k ey names i nsi de t h e qu ot e w i t h t h e v al u es f r om
t h e ac t u al dict b y u si ng t h e f ol l ow i ng c ommand:

for data in raw_data:

report_str = """{Name} is {Age} years old


and is {Height} meter tall weighing about
{Weight} kg.\n

Has a history of family illness:


{Disease_history}.\n

Presently suffering from a heart disease:


{Heart_problem}

""".format(**data)

print(report_str)

Th e ou t p u t i s as f ol l ow s:

Figure 6.11: Readable file using the ** operation

Th i s ap p r oac h i s i ndeed mu c h mor e c onc i se and


mai nt ai nab l e.

EXERCISE 78: DATA


REPRESENTATION USING {}
Th e {} not at i on i nsi de t h e qu ot e i s p ow er f u l and
w e c an c h ange ou r dat a r ep r esent at i on
si gni f i c ant l y b y u si ng i t :

1 . Ch ange a decim al nu m ber to


its binar y for m by u sing th e
follow ing com m and:

original_number = 42

print("The binary
representation of 42
is -
{0:b}".format(original
_number))

Th e ou tpu t is as follow s:

Figure 6.12: A number in its binary


representation

2 . Pr inting a str ing th at's


center or iented:

print("
{:^42}".format("I am
at the center"))

Th e ou tpu t is as follow s:

Figure 6.13: A string that's been center


formatted

3 . Pr inting a str ing th at's


center or iented, bu t th is
tim e w ith padding on both
sides:

print("
{:=^42}".format("I am
at the center"))

Th e ou tpu t is as follow s:
Figure 6.14: A string that's been center formatted with
padding

A s w e'v e al r eady ment i oned, t h e f or mat


st at ement i s a p ow er f u l one.

It i s i mp or t ant t o f or mat dat e as dat e h as v ar i ou s


f or mat s dep endi ng on w h at t h e sou r c e of t h e dat a
i s, and i t may need sev er al t r ansf or mat i ons
i nsi de t h e dat a w r angl i ng p i p el i ne.

W e c an u se t h e f ami l i ar dat e f or mat t i ng


not at i ons w i t h f or mat as f ol l ow s:

from datetime import datetime

print("The present datetime is {:%Y-%m-%d


%H:%M:%S}".format(datetime.utcnow()))

Th e ou t p u t i s as f ol l ow s:

Figure 6.15: Data a er being formatted

Comp ar e i t w i t h t h e ac t u al ou t p u t of
datetime.utcnow and y ou w i l l see t h e p ow er of
t h i s ex p r essi on easi l y .

Identify and Clean


Outliers
W h en c onf r ont ed w i t h r eal -w or l d dat a, w e of t en
see a sp ec i f i c t h i ng i n a set of r ec or ds: t h er e ar e
some dat a p oi nt s t h at do not f i t w i t h t h e r est of
t h e r ec or ds. Th ey h av e some v al u es t h at ar e t oo
b i g, or t oo smal l , or c omp l et el y mi ssi ng. Th ese
k i nds of r ec or ds ar e c al l ed outliers.

St at i st i c al l y , t h er e i s a p r op er def i ni t i on and
i dea ab ou t w h at an ou t l i er means. A nd of t en, y ou
need deep domai n ex p er t i se t o u nder st and w h en
t o c al l a p ar t i c u l ar r ec or d an ou t l i er . H ow ev er ,
i n t h i s p r esent ex er c i se, w e w i l l l ook i nt o some
b asi c t ec h ni qu es t h at ar e c ommonp l ac e t o f l ag
and f i l t er ou t l i er s i n r eal -w or l d dat a f or day -t o-
day w or k .

EXERCISE 79: OUTLIERS IN


NUMERICAL DATA
In t h i s ex er c i se, w e w i l l f i r st c onst r u c t a not i on
of an ou t l i er b ased on nu mer i c al dat a. Imagi ne a
c osi ne c u r v e. If y ou r ememb er t h e mat h f or t h i s
f r om h i gh sc h ool , t h en a c osi ne c u r v e i s a v er y
smoot h c u r v e w i t h i n t h e l i mi t of [1, -1]:

1 . To constr u ct a cosine cu r v e,
execu te th e follow ing
com m and:

from math import cos,


pi

ys = [cos(i*(pi/4))
for i in range(50)]

2 . Plot th e data by u sing th e


follow ing code:

import
matplotlib.pyplot as
plt

plt.plot(ys)

Th e ou tpu t is as follow s:

Figure 6.16: Cosine wave

A s w e can see, it is a v er y
sm ooth cu r v e, and th er e is
no ou tlier . We ar e going to
intr odu ce som e now .

3 . Intr odu ce som e ou tlier s by


u sing th e follow ing
com m and:

ys[4] = ys[4] + 5.0

ys[20] = ys[20] + 8.0

4 . Plot th e cu r v e:

plt.plot(ys)
Figure 6.17: Wave with outliers

W e c an see t h at w e h av e su c c essf u l l y i nt r odu c ed


t w o v al u es i n t h e c u r v e, w h i c h b r ok e t h e
smoot h ness and h enc e c an b e c onsi der ed as
ou t l i er s.

A good w ay t o det ec t i f ou r dat aset h as an ou t l i er


i s t o c r eat e a b ox p l ot . A b ox p l ot i s a w ay of
p l ot t i ng nu mer i c al dat a b ased on t h ei r c ent r al
t endenc y and some buckets (i n r eal i t y , w e c al l
t h em quartiles). In a b ox p l ot , t h e ou t l i er s ar e
u su al l y dr aw n as sep ar at e p oi nt s. Th e matplotlib
l i b r ar y h el p s dr aw b ox p l ot s ou t of a ser i es of
nu mer i c al dat a, w h i c h i sn't h ar d at al l . Th i s i s
h ow w e do i t :

plt.boxplot(ys)

Onc e y ou ex ec u t e t h e p r ec edi ng c ode, y ou w i l l


b e ab l e t o see t h at t h er e i s a ni c e b ox p l ot w h er e
t h e t w o ou t l i er s t h at w e h ad c r eat ed ar e c l ear l y
sh ow n, ju st l i k e i n t h e f ol l ow i ng di agr am:

Figure 6.18: Boxplot with outliers


Z-SCORE
A z-sc or e i s a measu r e on a set of dat a t h at gi v es
y ou a v al u e f or eac h dat a p oi nt r egar di ng h ow
mu c h t h at dat a p oi nt i s sp r ead ou t w i t h r esp ec t
t o t h e st andar d dev i at i on and mean of t h e dat aset .
W e c an u se z-sc or e t o nu mer i c al l y det ec t
ou t l i er s i n a set of dat a. N or mal l y , any dat a p oi nt
w i t h a z-sc or e gr eat er t h an +3 or l ess t h en -3 i s
c onsi der ed an ou t l i er . W e c an u se t h i s c onc ep t
w i t h a b i t of h el p f r om t h e ex c el l ent Sc i Py and
pandas l i b r ar i es t o f i l t er ou t t h e ou t l i er s.

U se Sc i Py and c al c u l at e t h e z-sc or e b y u si ng t h e
f ol l ow i ng c ommand:

from scipy import stats

cos_arr_z_score = stats.zscore(ys)

Cos_arr_z_score

Th e ou t p u t i s as f ol l ow s:
Figure 6.19: The z-score values

EXERCISE 80: THE Z-SCORE


VALUE TO REMOVE OUTLIERS
In t h i s ex er c i se, w e w i l l di sc u ss h ow t o get r i d of
ou t l i er s i n a set of dat a. In t h e l ast ex er c i se, w e
c al c u l at ed t h e z-sc or e of eac h dat a p oi nt . In t h i s
ex er c i se, w e w i l l u se t h at t o r emov e ou t l i er s
f r om ou r dat a:

1 . Im por t pandas and cr eate a


DataFr am e:

import pandas as pd

df_original =
pd.DataFrame(ys)
2 . A ssign ou tlier s w ith a z-scor e
less th an 3 :

cos_arr_without_outlie
rs =
df_original[(cos_arr_z
_score < 3)]

3 . Use th e print fu nction to


pr int th e new and old sh ape:

print(cos_arr_without_
outliers.shape)

print(df_original.shap
e)

Fr om th e tw o pr ints (4 8, 1
and 50, 1 ), it is clear th at th e
der iv ed DataFr am e h as tw o
r ow s less. Th ese ar e ou r
ou tlier s. If w e plot th e
cos_arr_without_outliers
DataFr am e, th en w e w ill see
th e follow ing ou tpu t:

Figure 6.20: Cosine wave without outliers

A s ex p ec t ed, w e got b ac k t h e smoot h c u r v e and


got r i d of t h e ou t l i er s.

Det ec t i ng and get t i ng r i d of ou t l i er s i s an


i nv ol v i ng and c r i t i c al p r oc ess i n any dat a
w r angl i ng p i p el i ne. Th ey need deep domai n
k now l edge, ex p er t i se i n desc r i p t i v e st at i st i c s,
mast er y ov er t h e p r ogr ammi ng l angu age (and al l
t h e u sef u l l i b r ar i es), and a l ot of c au t i on. W e
r ec ommend b ei ng v er y c ar ef u l w h en doi ng t h i s
op er at i on on a dat aset .
EXERCISE 81: FUZZY
MATCHING OF STRINGS
In t h i s ex er c i se, w e w i l l l ook i nt o a sl i gh t l y
di f f er ent p r ob l em w h i c h , at t h e f i r st gl anc e, may
l ook l i k e an ou t l i er . H ow ev er , u p on c ar ef u l
ex ami nat i on, w e w i l l see t h at i t i s i ndeed not ,
and w e w i l l l ear n ab ou t a u sef u l c onc ep t t h at i s
somet i mes r ef er r ed t o as f u zzy mat c h i ng of
st r i ngs.

Lev ensh t ei n di st anc e i s an adv anc ed c onc ep t . W e


c an t h i nk of i t as t h e mi ni mu m nu mb er of si ngl e-
c h ar ac t er edi t s t h at ar e needed t o c onv er t one
st r i ng i nt o anot h er . W h en t w o st r i ngs ar e
i dent i c al , t h e di st anc e b et w een t h em i s 0 – t h e
mor e t h e di f f er enc e, t h e h i gh er t h e nu mb er . W e
c an c onsi der a t h r esh ol d of di st anc e u nder w h i c h
w e w i l l c onsi der t w o st r i ngs as t h e same. Th u s,
w e c an not onl y r ec t i f y h u man er r or b u t al so
sp r ead a saf et y net so t h at w e don't p ass al l t h e
c andi dat es.

Lev ensh t ei n di st anc e c al c u l at i on i s an i nv ol v i ng


p r oc ess, and w e ar e not goi ng t o i mp l ement i t
f r om sc r at c h h er e. Th ank f u l l y , l i k e a l ot of ot h er
t h i ngs, t h er e i s a l i b r ar y av ai l ab l e f or u s t o do
t h i s. It i s c al l ed p y t h on-Lev ensh t ei n:

1 . Cr eate th e load data of a sh ip


on th r ee differ ent dates:

Figure 6.21: Initialized ship_data variable


If y ou look car efu lly , y ou w ill
notice th at th e nam e of th e
sh ip is spelled differ ently in
all th r ee differ ent cases. Let's
assu m e th at th e actu al nam e
of th e sh ip is "Sea Pr incess".
Fr om a nor m al per spectiv e,
it does look like th er e h ad
been a h u m an er r or and th e
data points do descr ibe a
single sh ip. Rem ov ing tw o of
th em on a str ict basis of
ou tlier s m ay not be th e best
th ing to do.

2 . Th en, w e sim ply need to


im por t th e distance fu nction
fr om it and pass tw o str ings
to it to calcu late th e distance
betw een th em :

from Levenshtein
import distance

name_of_ship = "Sea
Princess"

for k, v in
ship_data.items():

print("{} {}
{}".format(k,
name_of_ship,
distance(name_of_ship,
k)))

Th e ou tpu t is as follow s:
Figure 6.22: Distance between the strings

W e w i l l not i c e t h at t h e di st anc e b et w een t h e


st r i ngs ar e di f f er ent . It i s 0 w h en t h ey ar e
i dent i c al , and i t i s a p osi t i v e i nt eger w h en t h ey
ar e not . W e c an u se t h i s c onc ep t i n ou r dat a
w r angl i ng job s and say t h at st r i ngs w i t h di st anc e
l ess t h an or equ al t o a c er t ai n nu mb er i s t h e same
st r i ng.

H er e, agai n, w e need t o b e c au t i ou s ab ou t w h en
and h ow t o u se t h i s k i nd of f u zzy st r i ng mat c h i ng.
Somet i mes, t h ey ar e needed, and ot h er t i mes t h ey
w i l l r esu l t i n a v er y b ad b u g.

Activity 8: Handling
Outliers and Missing
Data
In t h i s ac t i v i t y , w e w i l l i dent i f y and get r i d of
ou t l i er s. H er e, w e h av e a CSV f i l e. Th e goal h er e
i s t o c l ean t h e dat a b y u si ng t h e k now l edge t h at
w e h av e l ear ned ab ou t so f ar and c ome u p w i t h a
ni c el y f or mat t ed Dat aFr ame. Ident i f y t h e t y p e of
ou t l i er s and t h ei r ef f ec t on t h e dat a and c l ean
t h e messy dat a.

Th e st ep s t h at w i l l h el p y ou sol v e t h i s ac t i v i t y
ar e as f ol l ow s:

1 . Read th e visit_data.csv
file.

2 . Ch eck for du plicates.

3 . Ch eck if any essential


colu m n contains NaN.

4 . Get r id of th e ou tlier s.

5. Repor t th e size differ ence.

6 . Cr eate a box plot to ch eck for


ou tlier s.

7 . Get r id of any ou tlier s.

Note

The solution for this activity


can be found on page 312.

Summary
In t h i s c h ap t er , w e l ear ned ab ou t i nt er est i ng
w ay s t o deal w i t h l i st dat a b y u si ng a generator
ex p r essi on. Th ey ar e easy and el egant and onc e
mast er ed, t h ey gi v e u s a p ow er f u l t r i c k t h at w e
c an u se r ep eat edl y t o si mp l i f y sev er al c ommon
dat a w r angl i ng t ask s. W e al so ex ami ned
di f f er ent w ay s t o f or mat dat a. For mat t i ng of dat a
i s not onl y u sef u l f or p r ep ar i ng b eau t i f u l
r ep or t s – i t i s of t en v er y i mp or t ant t o gu ar ant ee
dat a i nt egr i t y f or t h e dow nst r eam sy st em.

W e ended t h e c h ap t er b y c h ec k i ng ou t some
met h ods t o i dent i f y and r emov e ou t l i er s. Th i s i s
i mp or t ant f or u s b ec au se w e w ant ou r dat a t o b e
p r op er l y p r ep ar ed and r eady f or al l ou r f anc y
dow nst r eam anal y si s job s. W e al so ob ser v ed h ow
i mp or t ant i t i s t o t ak e t i me and u se domai n
ex p er t i se t o set u p r u l es f or i dent i f y i ng ou t l i er s,
as doi ng t h i s i nc or r ec t l y c an do mor e h ar m t h an
good.

In t h e nex t c h ap t er , w e w i l l c ov er t h e h ow t o
r ead w eb p ages, XML f i l es, and A PIs.
Chapter 7
Advanced Web Scraping
and Data Gathering
Learning Objectives
By t h e end of t h i s c h ap t er , y ou w i l l b e ab l e t o:

Make u se of requests and


BeautifulSoup to r ead
v ar iou s w eb pages and
gath er data fr om th em

Per for m r ead oper ations on


XML files and th e w eb u sing
an A pplication Pr ogr am
Inter face (A PI)

Make u se of r egex tech niqu es


to scr ape u sefu l infor m ation
fr om a lar ge and m essy text
cor pu s

In t h i s c h ap t er , y ou w i l l l ear n h ow t o gat h er
dat a f r om w eb p ages, XML f i l es, and A PIs.

Introduction
Th e p r ev i ou s c h ap t er c ov er ed h ow t o c r eat e a
su c c essf u l dat a w r angl i ng p i p el i ne. In t h i s
c h ap t er , w e w i l l b u i l d a r eal -l i f e w eb sc r ap er
u si ng al l of t h e t ec h ni qu es t h at w e h av e l ear ned
so f ar . Th i s c h ap t er b u i l ds on t h e f ou ndat i on of
BeautifulSoup and i nt r odu c es v ar i ou s met h ods
f or sc r ap i ng a w eb p age and u si ng an A PI t o
gat h er dat a.

The Basics of Web


Scraping and the
Beautiful Soup Library
In t oday 's c onnec t ed w or l d, one of t h e most
v al u ed and w i del y u sed sk i l l f or a dat a
w r angl i ng p r of essi onal i s t h e ab i l i t y t o ex t r ac t
and r ead dat a f r om w eb p ages and dat ab ases
h ost ed on t h e w eb . Most or gani zat i ons h ost dat a
on t h e c l ou d (p u b l i c or p r i v at e), and t h e major i t y
of w eb mi c r oser v i c es t h ese day s p r ov i de some
k i nd of A PI f or t h e ex t er nal u ser s t o ac c ess dat a:
Figure 7.1: Data wrangling HTTP request and an
XML/JSON reply

It i s nec essar y t h at , as a dat a w r angl i ng engi neer ,


y ou k now ab ou t t h e st r u c t u r e of w eb p ages and
Py t h on l i b r ar i es so t h at y ou ar e ab l e t o ex t r ac t
dat a f r om a w eb p age. Th e W or l d W i de W eb i s an
ev er -gr ow i ng, ev er -c h angi ng u ni v er se, i n w h i c h
di f f er ent dat a ex c h ange p r ot oc ol s and f or mat s
ar e u sed. A f ew of t h ese ar e w i del y u sed and h av e
b ec ome st andar d.

LIBRARIES IN PYTHON
Py t h on c omes equ i p p ed w i t h b u i l t -i n modu l es,
su c h as urllib 3, w h i c h t h at c an p l ac e H TTP
r equ est s ov er t h e i nt er net and r ec ei v e dat a f r om
t h e c l ou d. H ow ev er , t h ese modu l es op er at e at a
l ow er l ev el and r equ i r e deep er k now l edge of
H TTP p r ot oc ol s, enc odi ng, and r equ est s.

W e w i l l t ak e adv ant age of t w o Py t h on l i b r ar i es


i n t h i s c h ap t er : Requests and BeautifulSoup. To
av oi d deal i ng w i t h H TTP met h ods on a l ow er
l ev el , w e w i l l u se t h e Requests l i b r ar y . It i s an
A PI b u i l t on t op of p u r e Py t h on w eb u t i l i t y
l i b r ar i es, w h i c h mak es p l ac i ng H TTP r equ est s
easy and i nt u i t i v e.

Be autifulSoup i s one of t h e most p op u l ar H TML


p ar ser p ac k ages. It p ar ses t h e H TML c ont ent y ou
p ass on and b u i l ds a det ai l ed t r ee of al l t ags and
mar k u p s w i t h i n t h e p age f or easy and i nt u i t i v e
t r av er sal . Th i s t r ee c an b e u sed b y a p r ogr ammer
t o l ook f or c er t ai n mar k u p el ement s (f or
ex amp l e, a t ab l e, a h y p er l i nk , or a b l ob of t ex t
w i t h i n a p ar t i c u l ar di v ID) t o sc r ap e u sef u l dat a.

EXERCISE 81: USING THE


REQUESTS LIBRARY TO GET A
RESPONSE FROM THE
WIKIPEDIA HOME PAGE
Th e W i k i p edi a h ome p age c onsi st s of many
el ement s and sc r i p t s, al l of w h i c h ar e a mi x of
H TML, CSS, and Jav aSc r i p t c ode b l oc k s. To r ead
t h e h ome p age of W i k i p edi a and ex t r ac t some
u sef u l t ex t u al i nf or mat i on, w e need t o mov e st ep
b y st ep , as w e ar e not i nt er est ed i n al l of t h e c ode
or mar k u p t ags; onl y some sel ec t ed p or t i ons
of t ex t .

In t h i s ex er c i se, w e w i l l p eel of f t h e l ay er s of
H TML/CSS/Jav aSc r i p t t o p r y aw ay t h e
i nf or mat i on w e ar e i nt er est ed i n.

1 . Im por t th e requests
libr ar y :

import requests
2 . A ssign th e h om e page URL to
a v ar iable, wiki_home:

# First assign the URL


of Wikipedia home page
to a strings

wiki_home =
"https://en.wikipedia.
org/wiki/Main_Page"

3 . Use th e get m eth od fr om th e


requests libr ar y to get a
r esponse fr om th is page:

response =
requests.get(wiki_home
)

4 . To get infor m ation abou t th e


r esponse object, enter th e
follow ing code:

type(response)

Th e ou tpu t is as follow s:

requests.models.Respon
se

It i s a model dat a st r u c t u r e t h at 's def i ned i n t h e


requests l i b r ar y .

Th e w eb i s an ex t r emel y dy nami c p l ac e. It i s
p ossi b l e t h at t h e h ome p age of W i k i p edi a w i l l
h av e c h anged b y t h e t i me someb ody u ses y ou r
c ode, or t h at a p ar t i c u l ar w eb ser v er w i l l b e
dow n and y ou r r equ est w i l l essent i al l y f ai l . If
y ou p r oc eed t o w r i t e mor e c omp l ex and
el ab or at e c ode w i t h ou t c h ec k i ng t h e st at u s of
y ou r r equ est , t h en al l t h at su b sequ ent w or k w i l l
b e f r u i t l ess.

A w eb p age r equ est gener al l y c omes b ac k w i t h


v ar i ou s c odes. H er e ar e some of t h e c ommon c odes
y ou may enc ou nt er :
Figure 7.2: Web requests and their description

So, w e w r i t e a f u nc t i on t o c h ec k t h e c ode and


p r i nt ou t messages as needed. Th ese k i nds of
smal l h el p er /u t i l i t y f u nc t i ons ar e i nc r edi b l y
u sef u l f or c omp l ex p r ojec t s.

EXERCISE 82: CHECKING THE


STATUS OF THE WEB
REQUEST
N ex t , w e w i l l w r i t e a smal l u t i l i t y f u nc t i on t o
c h ec k t h e st at u s of t h e r esp onse.

W e w i l l st ar t b y get t i ng i nt o t h e a h ab i t of
w r i t i ng smal l f u nc t i ons t o ac c omp l i sh smal l
modu l ar t ask s, i nst ead of w r i t i ng l ong sc r i p t s,
w h i c h ar e h ar d t o deb u g and t r ac k :

1 . Cr eate a status_check
fu nction by u sing th e
follow ing com m and:

def status_check(r):

if r.status_code==200:

print("Success!")

return 1

else:

print("Failed!")

return -1

Note th at, along w ith


pr inting th e appr opr iate
m essage, w e ar e r etu r ning
eith er 1 or -1 fr om th is
fu nction. Th is is im por tant.

2 . Ch eck th e r esponse u sing th e


status_check com m and:

status_check(response)

Th e ou tpu t is as follow s:

Figure 7.3 : The output of status_check

In t h i s c h ap t er , w e w i l l not u se t h ese r et u r ned


v al u es, b u t l at er , f or mor e c omp l ex
p r ogr ammi ng ac t i v i t y , y ou w i l l p r oc eed onl y i f
y ou get one as t h e r et u r n v al u e f or t h i s f u nc t i on,
t h at i s, y ou w i l l w r i t e a c ondi t i onal st at ement t o
c h ec k t h e r et u r n v al u e and t h en ex ec u t e t h e
su b sequ ent c ode b ased on t h at .

CHECKING THE ENCODING OF


THE WEB PAGE
W e c an al so w r i t e a u t i l i t y f u nc t i on t o c h ec k t h e
enc odi ng of t h e w eb p age. V ar i ou s enc odi ngs ar e
p ossi b l e w i t h any H TML doc u ment , al t h ou gh t h e
most p op u l ar i s U TF-8. Some of t h e most p op u l ar
enc odi ngs ar e A SCII, U ni c ode, and U TF-8. A SCII i s
t h e si mp l est , b u t i t c annot c ap t u r e t h e c omp l ex
sy mb ol s u sed i n v ar i ou s sp ok en and w r i t t en
l angu ages al l ov er t h e w or l d, so U TF-8 h as
b ec ome t h e al most u ni v er sal st andar d i n w eb
dev el op ment t h ese day s.

W h en w e r u n t h i s f u nc t i on on t h e W i k i p edi a
h ome p age, w e get b ac k t h e p ar t i c u l ar enc odi ng
t y p e t h at 's u sed f or t h at p age. Th i s f u nc t i on, l i k e
t h e p r ev i ou s one, t ak es t h e requests r esp onse
ob jec t as an ar gu ment and r et u r ns a v al u e:

def encoding_check(r):

return (r.encoding)

Ch ec k t h e r esp onse:

encoding_check(response)

Th e ou t p u t i s as f ol l ow s:

'UTF-8'

H er e, U TF-8 denot es t h e most p op u l ar c h ar ac t er


enc odi ng sc h eme t h at 's u sed i n t h e di gi t al
medi u m and on t h e w eb t oday . It emp l oy s
v ar i ab l e-l engt h enc odi ng w i t h 1 -4 b y t es,
t h er eb y r ep r esent i ng al l U ni c ode c h ar ac t er s i n
v ar i ou s l angu ages ar ou nd t h e w or l d.

EXERCISE 83: CREATING A


FUNCTION TO DECODE THE
CONTENTS OF THE
RESPONSE AND CHECK ITS
LENGTH
Th e f i nal ai m of t h i s ser i es of st ep s i s t o get a
p age's c ont ent s as a b l ob of t ex t or as a st r i ng
ob jec t t h at Py t h on c an p r oc ess af t er w ar d. Ov er
t h e i nt er net , dat a st r eams mov e i n an enc oded
f or mat . Th er ef or e, w e need t o dec ode t h e c ont ent
of t h e r esp onse ob jec t . For t h i s p u r p ose, w e need
t o p er f or m t h e f ol l ow i ng st ep s:

1 . Wr ite a u tility fu nction to


decode th e contents of th e
r esponse:
def
decode_content(r,encod
ing):

return
(r.content.decode(enco
ding))

contents =
decode_content(respons
e,encoding_check(respo
nse))

2 . Ch eck th e ty pe of th e decoded
object:

type(contents)

Th e ou tpu t is as follow s:

str

We finally got a str ing object


by r eading th e HTML page!

Note

Note that the answ er in this


chapter and in the exercise in
Jupyter notebook may vary
because of updates that have
been made to the Wikipedia
page.

3 . Ch eck th e length of th e
object and tr y pr inting som e
of it:

len(contents)

Th e ou tpu t is as follow s:

74182

If y ou pr int th e fir st 1 0,000


ch ar acter s of th is str ing, it
w ill look som e sim ilar to th is:
Figure 7.4: Output showing a mixed blob of HTML
markup tags, text and element names, and properties

Ob v i ou sl y , t h i s i s a mi x ed b l ob of v ar i ou s H TML
mar k u p t ags, t ex t , and el ement s
names/p r op er t i es. W e c annot h op e t o ex t r ac t
meani ngf u l i nf or mat i on f r om t h i s w i t h ou t u si ng
sop h i st i c at ed f u nc t i ons or met h ods. For t u nat el y ,
t h e BeautifulSoup l i b r ar y p r ov i des su c h
met h ods, and w e w i l l see h ow t o u se t h em nex t .

EXERCISE 84: EXTRACTING


HUMAN-READABLE TEXT
FROM A BEAUTIFULSOUP
OBJECT
It t u r ns ou t t h at a BeautifulSoup ob jec t h as a
text met h od, w h i c h c an b e u sed ju st t o ex t r ac t
t ex t :

1 . Im por t th e package and th en


pass on th e w h ole str ing
(HTML content) to a m eth od
for par sing:

from bs4 import


BeautifulSoup

soup =
BeautifulSoup(contents
, 'html.parser')

2 . Execu te th e follow ing code in


y ou r notebook:

txt_dump=soup.text
3 . Find th e ty pe of th e
txt_dmp:

type(txt_dump)

Th e ou tpu t is as follow s:

str

4 . Find th e length of th e
txt_dmp:

len(txt_dump)

Th e ou tpu t is as follow s:

15326

5. Now , th e length of th e text


du m p is m u ch sm aller th an
th e r aw HTML's str ing
length . Th is is becau se bs4
h as par sed th r ou gh th e
HTML and extr acted only
h u m an-r eadable text for
fu r th er pr ocessing.

6 . Pr int th e initial por tion of


th is text.

print(txt_dump[10000:1
1000])

You w ill see som eth ing


sim ilar to th e follow ing:

Figure 7.5: Output showing the initial portion of text


EXTRACTING TEXT FROM A
SECTION
N ow , l et 's mov e on t o a mor e ex c i t i ng dat a
w r angl i ng t ask . If y ou op en t h e W i k i p edi a h ome
p age, y ou ar e l i k el y t o see a sec t i on c al l ed F rom
today 's fe ature d article . Th i s i s an ex c er p t f r om
t h e day 's p r omi nent ar t i c l e, w h i c h i s r andoml y
sel ec t ed and p r omot ed on t h e h ome p age. In f ac t ,
t h i s ar t i c l e c an al so c h ange t h r ou gh ou t t h e day :
Figure 7.6: Sample Wikipedia page highlighting the
"From today's featured article" section
You need t o ex t r ac t t h e t ex t f r om t h i s sec t i on.
Th er e ar e nu mb er of w ay s t o ac c omp l i sh t h i s
t ask . W e w i l l go t h r ou gh a si mp l e and i nt u i t i v e
met h od f or doi ng so h er e.

Fi r st , w e t r y t o i dent i f y t w o i ndi c es – t h e st ar t
i ndex and end i ndex of t h e st r i ng, w h i c h
demar c at e t h e st ar t and end of t h e t ex t w e ar e
i nt er est ed i n. In t h e nex t sc r eensh ot , t h e i ndi c es
ar e sh ow n:
Figure 7.7: Wikipedia page highlighting the text to be
extracted
Th e f ol l ow i ng c ode ac c omp l i sh es t h e ex t r ac t i on:

idx1=txt_dump.find("From today's featured


article")

idx2=txt_dump.find("Recently featured")

print(txt_dump[idx1+len("From today's
featured article"):idx2])

N ot e, t h at w e h av e t o add t h e l engt h of t h e From


today's featured article st r i ng t o idx1 and
t h en p ass t h at as t h e st ar t i ng i ndex . Th i s i s
b ec au se i dx 1 f i nds w h er e t h e From tod a y 's
fe a tu re d a rtic le st r i ng st ar t s, not ends.

It p r i nt s ou t somet h i ng l i k e t h i s (t h i s i s a samp l e
ou t p u t ):

Figure 7.8: The extracted text

EXTRACTING IMPORTANT
HISTORICAL EVENTS THAT
HAPPENED ON TODAY'S DATE
N ex t , w e w i l l t r y t o ex t r ac t t h e t ex t
c or r esp ondi ng t o t h e i mp or t ant h i st or i c al
ev ent s t h at h ap p ened on t oday 's dat e. Th i s c an
gener al l y b e f ou nd at t h e b ot t om-r i gh t c or ner as
sh ow n i n t h e f ol l ow i ng sc r eensh ot :
Figure 7.9: Wikipedia page highlighting the "On this
day" section

So, c an w e ap p l y t h e same t ec h ni qu e as w e di d f or
"F rom today 's fe ature d article "? A p p ar ent l y
not , b ec au se t h er e i s t ex t ju st b el ow w h er e w e
w ant ou r ex t r ac t i on t o end, w h i c h i s not f i x ed,
u nl i k e i n t h e p r ev i ou s c ase. N ot e t h at , i n t h e
p r ev i ou s ex er c i se, t h e f i x ed st r i ng "Re ce ntly
fe ature d" oc c u r s at t h e ex ac t p l ac e w h er e w e
w ant t h e ex t r ac t i on t o st op . So, w e c ou l d u se i t i n
ou r c ode. H ow ev er , w e c annot do t h at i n t h i s c ase,
and t h e r eason f or t h i s i s i l l u st r at ed i n t h e
f ol l ow i ng sc r eensh ot :
Figure 7.10: Wikipedia page highlighting the text to be
extracted

So, i n t h i s sec t i on, w e ju st w ant t o f i nd ou t w h at


t h e t ex t l ook s l i k e ar ou nd t h e mai n c ont ent w e
ar e i nt er est ed i n. For t h at , w e mu st f i nd ou t t h e
st ar t of t h e st r i ng "On t h i s day " and p r i nt ou t t h e
nex t 1 ,000 c h ar ac t er s, u si ng t h e f ol l ow i ng
c ommand:

idx3=txt_dump.find("On this day")

print(txt_dump[idx3+len("On this
day"):idx3+len("On this day")+1000])

Th i s l ook s as f ol l ow s:
Figure 7.11: Output of the "On this day" section from
Wikipedia

To addr ess t h i s i ssu e, w e need t o t h i nk


di f f er ent l y and u se some ot h er met h ods f r om
Beau t i f u l Sou p (and w r i t e anot h er u t i l i t y
f u nc t i on).

EXERCISE 85: USING


ADVANCED BS4 TECHNIQUES
TO EXTRACT RELEVANT TEXT
H TML p ages ar e made of many mar k u p t ags, su c h
as <di v >, w h i c h denot es a di v i si on of
t ex t /i mages, or <u l >, w h i c h denot es l i st s. W e c an
t ak e adv ant age of t h i s st r u c t u r e and l ook at t h e
el ement t h at c ont ai ns t h e t ex t w e ar e i nt er est ed
i n. In t h e Mozi l l a Fi r ef ox b r ow ser , w e c an easi l y
do t h i s b y r i gh t -c l i c k i ng and sel ec t i ng t h e
"I nspe ct Ele me nt" op t i on:

Figure 7.12: Inspecting elements on Wikipedia

A s y ou h ov er ov er t h i s w i t h t h e mou se, y ou w i l l
see di f f er ent p or t i ons of t h e p age b ei ng
h i gh l i gh t ed. By doi ng t h i s, i t i s easy t o di sc ov er
t h e p r ec i se b l oc k of mar k u p t ex t , t h at i s
r esp onsi b l e f or t h e t ex t u al i nf or mat i on w e ar e
i nt er est ed i n. H er e, w e c an see t h at a c er t ai n
<ul> b l oc k c ont ai ns t h e t ex t :

Figure 7.13: Identifying the HTML block that contains


text

N ow , i t i s p r u dent t o f i nd t h e <div> t ag t h at
c ont ai ns t h i s <ul> b l oc k w i t h i n i t . By l ook i ng
ar ou nd t h e same sc r een as b ef or e, w e f i nd t h e
<div> and al so i t s ID:
Figure 7.14: The <ul> tag containing the text

1 . Use th e find_all m eth od


fr om Beau tifu lSou p, w h ich
scans all th e tags of th e
HTML page (and th eir su b-
elem ents) to find and extr act
th e text associated w ith th is
par ticu lar <div> elem ent.

Note

Note how w e are utilizing the


'mp-otd' I D of the < div> to
identify it among tens of other
< div> elements.
Th e find_all m eth od
r etu r ns a NavigableString
class, w h ich h as a u sefu l
text m eth od associated w ith
it for extr action.

2 . To pu t th ese ideas togeth er ,


w e w ill cr eate an em pty list
and append th e text fr om th e
NavigableString class to
th is list as w e tr av er se th e
page:

text_list=[] #Empty
list

for d in
soup.find_all('div'):

if (d.get('id')=='mp-
otd'):

for i in
d.find_all('ul'):

text_list.append(i.tex
t)

3 . Now , if w e exam ine th e


text_list list, w e w ill see
th at it h as th r ee elem ents. If
w e pr int th e elem ents,
separ ated by a m ar ker , w e
w ill see th at th e text w e ar e
inter ested in appear s as th e
fir st elem ent!

for i in text_list:

print(i)

print('-'*100)

Note

I n this example, it is the first


element of the list that w e are
interested in. How ever, the
exact position w ill depend on
the w eb page.

Th e ou tpu t is as follow s:
Figure 7.15: The text highlighted

EXERCISE 86: CREATING A


COMPACT FUNCTION TO
EXTRACT THE "ON THIS DAY"
TEXT FROM THE WIKIPEDIA
HOME PAGE
A s w e di sc u ssed b ef or e, i t i s al w ay s good t o t r y t o
f u nc t i onal i ze sp ec i f i c t ask s, p ar t i c u l ar l y i n a
w eb sc r ap i ng ap p l i c at i on:

1 . Cr eate a fu nction, w h ose


only job is to take th e URL (as
a str ing) and to r etu r n th e
text cor r esponding to th e On
t his day section. Th e benefit
of su ch a fu nctional
appr oach is th at y ou can call
th is fu nction fr om any
Py th on scr ipt and u se it
any w h er e in anoth er
pr ogr am as a standalone
m odu le:

def
wiki_on_this_day(url="
https://en.wikipedia.o
rg/wiki/Main_Page"):

"""

2 . Extr act th e text fr om th e "On


th is day " section on th e
Wikipedia h om e page. A ccept
th e Wikipedia h om e page
URL as a str ing. A defau lt
URL is pr ov ided:

"""

import requests

from bs4 import


BeautifulSoup

wiki_home = str(url)

response =
requests.get(wiki_home
)

def status_check(r):

if r.status_code==200:

return 1
else:

return -1

status =
status_check(response)

if status==1:

contents =
decode_content(respons
e,encoding_check(respo
nse))

else:

print("Sorry could not


reach the web page!")

return -1

soup =
BeautifulSoup(contents
, 'html.parser')

text_list=[]

for d in
soup.find_all('div'):

if (d.get('id')=='mp-
otd'):

for i in
d.find_all('ul'):

text_list.append(i.tex
t)

return (text_list[0])

3 . Note h ow th is fu nction
u tilizes th e statu s ch eck and
pr ints ou t an er r or m essage
if th e r equ est failed. Wh en
w e test th is fu nction w ith an
intentionally incor r ect URL,
it beh av es as expected:

print(wiki_on_this_day
("https://en.wikipedia
.org/wiki/Main_Page1")
)

Sorry could not reach


the web page!

Reading Data from XML


XML, or Ex t ensi b l e Mar k u p Langu age, i s a w eb
mar k u p l angu age t h at 's si mi l ar t o H TML b u t
w i t h si gni f i c ant f l ex i b i l i t y (on t h e p ar t of t h e
u ser ) b u i l t i n, su c h as t h e ab i l i t y t o def i ne y ou r
ow n t ags. It w as one of t h e most h y p ed
t ec h nol ogi es i n t h e 1 990s and ear l y 2000s. It i s a
met a-l angu age, t h at i s, a l angu age t h at al l ow s u s
t o def i ne ot h er l angu ages u si ng i t s mec h ani c s,
su c h as RSS, Mat h ML (a mat h emat i c al mar k u p
l angu age w i del y u sed f or w eb p u b l i c at i on and
t h e di sp l ay of mat h -h eav y t ec h ni c al
i nf or mat i on), and so on. XML i s al so h eav i l y u sed
i n r egu l ar dat a ex c h anges ov er t h e w eb , and as a
dat a w r angl i ng p r of essi onal , y ou sh ou l d h av e
enou gh f ami l i ar i t y w i t h i t s b asi c f eat u r es t o t ap
i nt o t h e dat a f l ow p i p el i ne w h enev er y ou need
t o ex t r ac t dat a f or y ou r p r ojec t .

EXERCISE 87: CREATING AN


XML FILE AND READING XML
ELEMENT OBJECTS
Let 's c r eat e some r andom dat a t o u nder st and t h e
XML dat a f or mat b et t er . Ty p e i n t h e f ol l ow i ng
c ode sni p p et s:

1 . Cr eate an XML file u sing th e


follow ing com m and:

data = '''

<person>

<name>Dave</name>

<surname>Piccardo</sur
name>

<phone type="intl">

+1 742 101 4456

</phone>

<email hide="yes">

dave.p@gmail.com</emai
l>

</person>'''

2 . Th is is a tr iple-qu oted str ing


or m u ltiline str ing. If y ou
pr int th is object, y ou w ill get
th e follow ing ou tpu t. Th is is
an XML-for m atted data
str ing in a tr ee str u ctu r e, as
w e w ill see soon, w h en w e
par se th e str u ctu r e and tease
apar t th e indiv idu al par ts:

Figure 7.16: The XML file output

3 . To pr ocess and w r angle w ith


th e data, w e h av e to r ead it
as an Element object u sing
th e Py th on XML par ser
engine:

import
xml.etree.ElementTree
as ET

tree =
ET.fromstring(data)

type (tree)

Th e ou tpu t is as follow s:

xml.etree.ElementTree.
Element

EXERCISE 88: FINDING


VARIOUS ELEMENTS OF DATA
WITHIN A TREE (ELEMENT)
W e c an u se t h e find met h od t o sear c h f or v ar i ou s
p i ec es of u sef u l dat a w i t h i n an XML el ement
ob jec t and p r i nt t h em (or u se t h em i n w h at ev er
p r oc essi ng c ode w e w ant ) u si ng t h e text met h od.
W e c an al so u se t h e get met h od t o ex t r ac t t h e
sp ec i f i c at t r i b u t e w e w ant :

1 . Use th e find m eth od to find


Name:

# Print the name of


the person

print('Name:',
tree.find('name').text
)

Dave

2 . Use th e find m eth od to find


Surname:

# Print the surname

print('Surname:',
tree.find('surname').t
ext)

Piccardo

3 . Use th e find m eth od to find


Phone. Note th e u se of th e
strip m eth od to str ip aw ay
any tr ailing spaces/blanks:

# Print the phone


number
print('Phone:',
tree.find('phone').tex
t.strip())

Th e ou tpu t w ill be as follow s:

+1 742 101 4456

4 . Use th e find m eth od to find


email status and actual
email. Note th e u se of th e
get m eth od to extr act th e
statu s:

# Print email status


and the actual email

print('Email hidden:',
tree.find('email').get
('hide'))

print('Email:',
tree.find('email').tex
t.strip())

Th e ou tpu t w ill be as follow s:

Email hidden: yes

Email:
dave.p@gmail.com

READING FROM A LOCAL XML


FILE INTO AN ELEMENTTREE
OBJECT
W e c an al so r ead f r om an XML f i l e (sav ed l oc al l y
on di sk ).

Th i s i s a f ai r l y c ommon si t u at i on w h er e a
f r ont end w eb sc r ap i ng modu l e h as al r eady
dow nl oaded a l ot of XML f i l es b y r eadi ng a t ab l e
of dat a on t h e w eb and now t h e dat a w r angl er
needs t o p ar se t h r ou gh t h i s XML f i l e t o ex t r ac t
meani ngf u l p i ec es of nu mer i c al and t ex t u al dat a.

W e h av e a f i l e assoc i at ed w i t h t h i s c h ap t er ,
c al l ed "xml1.xml". Pl ease mak e su r e y ou h av e t h e
f i l e i n t h e same di r ec t or y t h at y ou ar e r u nni ng
y ou r Ju p y t er N ot eb ook f r om:

tree2=ET.parse('xml1.xml')

type(tree2)

The output will be as follows:


xml.etree.ElementTree.ElementTree

N ot e h ow w e u se t h e parse met h od t o r ead t h i s


XML f i l e. Th i s i s sl i gh t l y di f f er ent t h an u si ng
t h e fromstring met h od u sed i n t h e p r ev i ou s
ex er c i se, w h er e w e w er e di r ec t l y r eadi ng f r om a
st r i ng ob jec t . Th i s p r odu c es an ElementTree
ob jec t i nst ead of a si mp l e Element.

Th e i dea of b u i l di ng a t r ee-l i k e ob jec t i s t h e same


as i n t h e domai ns of c omp u t er sc i enc e and
p r ogr ammi ng:

Th er e is a r oot

Th er e ar e ch ildr en objects
attach ed to th e r oot

Th er e cou ld be m u ltiple
lev els, th at is, ch ildr en of
ch ildr en r ecu r siv ely going
dow n

A ll of th e nodes of th e tr ee
(r oot and ch ildr en alike)
h av e attr ibu tes attach ed to
th em th at contain data

Tr ee tr av er sal algor ith m s


can be u sed to sear ch for a
par ticu lar attr ibu te

If pr ov ided, special m eth ods


can be u sed to pr obe a node
deeper

EXERCISE 89: TRAVERSING


THE TREE, FINDING THE
ROOT, AND EXPLORING ALL
CHILD NODES AND THEIR
TAGS AND ATTRIBUTES
Ev er y node i n t h e XML t r ee h as t ags and
at t r i b u t es. Th e i dea i s as f ol l ow s:
Figure 7.17: Finding the root and child nodes of an XML
tag

1 . Explor e th ese tags and


attr ibu tes u sing th e
follow ing code:

root=tree2.getroot()

for child in root:

print
("Child:",child.tag,
"| Child
attribute:",child.attr
ib)

Th e ou tpu t w ill be as follow s:

Figure 7.18: The output showing the extracted XML


tags

Note
Re m e m be r th a t e v e ry XML d a ta file c ou ld follow a
d iffe re nt na m ing or s tru c tu ra l form a t, bu t u s ing
a n e le m e nt tre e a p p roa c h p u ts th e d a ta into a
s om e w h a t s tru c tu re d flow th a t c a n be e xp lore d
s y s te m a tic a lly . S till, it is be s t to e xa m ine th e ra w
XML file s tru c tu re onc e a nd u nd e rs ta nd (e v e n if a t
a h ig h le v e l) th e d a ta form a t be fore a tte m p ting
a u tom a tic e xtra c tions .

EXERCISE 90: USING THE


TEXT METHOD TO EXTRACT
MEANINGFUL DATA
W e c an al most t h i nk of t h e XML t r ee as a list of
lists and i ndex i t ac c or di ngl y :

1 . A ccess th e elem ent root[0]


[2] by u sing th e follow ing
code:

root[0][2]

Th e ou tpu t w ill be as follow s:


<Element 'gdppc' at
0x00000000051FF278>

So, th is points to th e 'gdppc'


piece of data. Her e, 'gdppc' is
th e tag and th e actu al
GDP/per capita data is
attach ed to th is tag.

2 . Use th e text m eth od to


access th e data:

root[0][2].text

Th e ou tpu t w ill be as follow s:

'70617'

3 . Use th e tag m eth od to access


gdppc:

root[0][2].tag

Th e ou tpu t w ill be as follow s:

'gdppc'

4 . Ch eck root[0]:

root[0]

Th e ou tpu t w ill be as follow s:

<Element 'country1' at
0x00000000050298B8>

5. Ch eck th e tag:

root[0].tag

Th e ou tpu t w ill be as follow s:

'country1'

We can u se th e attrib
m eth od to access it:

root[0].attrib

Th e ou tpu t w ill be as follow s:

{'name': 'Norway'}

So, root[0] is again an


elem ent, bu t it h as differ ent
a set of tags and attr ibu tes
th an root[0][2]. Th is is
expected becau se th ey ar e all
par t of th e tr ee as nodes, bu t
each is associated w ith a
differ ent lev el of data.

Th i s l ast p i ec e of c ode ou t p u t i s i nt er est i ng


b ec au se i t r et u r ns a di c t i onar y ob jec t .
Th er ef or e, w e c an ju st i ndex i t b y i t s k ey s. W e
w i l l do t h at i n t h e nex t ex er c i se.

EXTRACTING AND PRINTING


THE GDP/PER CAPITA
INFORMATION USING A LOOP
N ow t h at w e k now h ow t o r ead t h e GDP/p er
c ap i t a dat a and h ow t o get a di c t i onar y b ac k f r om
t h e t r ee, w e c an easi l y c onst r u c t a si mp l e dat aset
b y r u nni ng a l oop ov er t h e t r ee:

for c in root:

country_name=c.attrib['name']

gdppc = int(c[2].text)

print("{}: {}".format(country_name,gdppc))

Th e ou t p u t i s as f ol l ow s:

Norway: 70617

Austria: 44857

Israel: 38788

W e c an p u t t h ese i n a Dat aFr ame or CSV f i l e f or


sav i ng t o a l oc al di sk or f u r t h er p r oc essi ng, su c h
as a si mp l e p l ot !

EXERCISE 91: FINDING ALL


THE NEIGHBORING
COUNTRIES FOR EACH
COUNTRY AND PRINTING
THEM
A s w e ment i oned b ef or e, t h er e ar e ef f i c i ent
sear c h al gor i t h ms f or t r ee st r u c t u r es, and one
su c h met h od f or XML t r ees i s findall. W e c an u se
t h i s, f or t h i s ex amp l e, t o f i nd al l t h e nei gh b or s a
c ou nt r y h as and p r i nt t h em ou t .

W h y do w e need t o u se findall ov er f i nd? W el l ,


b ec au se not al l t h e c ou nt r i es h av e an equ al
nu mb er of nei gh b or s and findall sear c h es f or al l
t h e dat a w i t h t h at t ag t h at i s assoc i at ed w i t h a
p ar t i c u l ar node, and w e w ant t o t r av er se al l of
t h em:

for c in root:

ne=c.findall('neighbor') # Find all the


neighbors

print("Neighbors\n"+"-"*25)
for i in ne: # Iterate over the neighbors
and print their 'name' attribute

print(i.attrib['name'])

print('\n')

Th e ou t p u t l ook s somet h i ng l i k e t h i s:

Figure 7.19: The output that's generated by using


findall

EXERCISE 92: A SIMPLE DEMO


OF USING XML DATA
OBTAINED BY WEB SCRAPING
In t h e l ast t op i c of t h i s c h ap t er , w e l ear ned
ab ou t si mp l e w eb sc r ap i ng u si ng t h e requests
l i b r ar y . So f ar , w e h av e w or k ed w i t h st at i c XML
dat a, t h at i s, dat a f r om a l oc al f i l e or a st r i ng
ob jec t w e'v e sc r i p t ed. N ow , i t i s t i me t o c omb i ne
ou r l ear ni ng and r ead XML dat a di r ec t l y ov er t h e
i nt er net (as y ou ar e ex p ec t ed t o do al most al l t h e
t i me):

1 . We w ill tr y to r ead a cooking


r ecipe fr om a w ebsite called
h ttp://w w w .r ecipepu ppy .co
m /, w h ich aggr egates links
to v ar iou s oth er sites w ith
th e r ecipe:

import urllib.request,
urllib.parse,
urllib.error

serviceurl =
'http://www.recipepupp
y.com/api/?'

item =
str(input('Enter the
name of a food item
(enter \'quit\' to
quit): '))

url = serviceurl +
urllib.parse.urlencode
({'q':item})+'&p=1&for
mat=xml'

uh =
urllib.request.urlopen
(url)

data =
uh.read().decode()

print('Retrieved',
len(data),
'characters')

tree3 =
ET.fromstring(data)

2 . Th is code w ill ask th e u ser for


inpu t. You h av e to enter th e
nam e of a food item . For
exam ple, 'ch icken tikka':

Figure 7.20: Demo of scraping from XML


data

3 . We get back data in XML


for m at and r ead and decode
it befor e cr eating an XML
tr ee ou t of it:
data =
uh.read().decode()

print('Retrieved',
len(data),
'characters')

tree3 =
ET.fromstring(data)

4 . Now , w e can u se anoth er


u sefu l m eth od, called iter,
w h ich basically iter ates ov er
th e nodes u nder an elem ent.
If w e tr av er se th e tr ee and
extr act th e text, w e get th e
follow ing ou tpu t:

for elem in
tree3.iter():

print(elem.text)

Th e ou tpu t is as follow s:
Figure 7.21: The output that's generated
by using iter

5. We can u se th e find m eth od


to sear ch for th e appr opr iate
attr ibu te and extr act its
content. Th is is th e r eason it
is im por tant to scan th r ou gh
th e XML data m anu ally and
ch eck w h at attr ibu tes ar e
u sed. Rem em ber , th is m eans
scanning th e r aw str ing
data, not th e tr ee str u ctu r e.

6 . Pr int th e r aw str ing data:


Figure 7.22: The output showing the
extracted href tags

Now w e know w h at tags to


sear ch for .

7 . Pr int all th e h y per links in


th e XML data:

for e in tree3.iter():
h=e.find('href')

t=e.find('title')

if h!=None and
t!=None:

print("Receipe Link
for:",t.text)

print(h.text)

print("-"*100)

Note th e u se of h!=None and


t!=None. Th ese ar e difficu lt
to expect w h en y ou fir st r u n
th is kind of code. You m ay
get an er r or becau se som e of
th e tags m ay r etu r n a None
object, th at is, th ey w er e
em pty for som e r eason in
th is XML data str eam . Th is
kind of situ ation is fair ly
com m on and cannot be
anticipated befor eh and. You
h av e to u se y ou r Py th on
know ledge and
pr ogr am m ing intu ition to
get ar ou nd it if y ou r eceiv e
su ch an er r or . Her e, w e ar e
ju st ch ecking for th e ty pe of
th e object and if it is not a
None, th en w e need to
extr act th e text associated
w ith it.

Th e final ou tpu t is as follow s:


Figure 7.23: The output showing the final output

Reading Data from an


API
Fu ndament al l y , an A PI or A p p l i c at i on
Pr ogr ammi ng Int er f ac e i s some k i nd of i nt er f ac e
t o a c omp u t i ng r esou r c e (f or ex amp l e, an
op er at i ng sy st em or dat ab ase t ab l e), w h i c h h as a
set of ex p osed met h ods (f u nc t i on c al l s) t h at al l ow
a p r ogr ammer t o ac c ess p ar t i c u l ar dat a or
i nt er nal f eat u r es of t h at r esou r c e.

A w eb A PI i s, as t h e name su ggest s, an A PI ov er
t h e w eb . N ot e t h at i t i s not a sp ec i f i c t ec h nol ogy
or p r ogr ammi ng f r amew or k , b u t an
ar c h i t ec t u r al c onc ep t . Th i nk of an A PI l i k e a
f ast f ood r est au r ant 's c u st omer ser v i c e c ent er .
Int er nal l y , t h er e ar e many f ood i t ems, r aw
mat er i al s, c ook i ng r esou r c es, and r ec i p e
management sy st ems, b u t al l y ou see ar e f i x ed
menu i t ems on t h e b oar d and y ou c an onl y
i nt er ac t t h r ou gh t h ose i t ems. It i s l i k e a p or t
t h at c an b e ac c essed u si ng an H TTP p r ot oc ol and
i s ab l e t o del i v er dat a and ser v i c es i f u sed
p r op er l y .

W eb A PIs ar e ex t r emel y p op u l ar t h ese day s f or


al l k i nds of dat a ser v i c es. In t h e v er y f i r st
c h ap t er , w e t al k ed ab ou t h ow U C San Di ego's dat a
sc i enc e t eam p u l l s dat a f r om Tw i t t er f eeds t o
anal y ze oc c u r r enc e of f or est f i r es. For t h i s, t h ey
do not go t o t w i t t er .c om and sc r ap e t h e dat a b y
l ook i ng at H TML p ages and t ex t . Inst ead, t h ey u se
t h e Tw i t t er A PI, w h i c h sends t h i s dat a
c ont i nu ou sl y i n a st r eami ng f or mat .

Th er ef or e, i t i s v er y i mp or t ant f or a dat a
w r angl i ng p r of essi onal t o u nder st and t h e b asi c s
of dat a ex t r ac t i on f r om a w eb A PI as y ou ar e
ex t r emel y l i k el y t o f i nd y ou r sel f i n a si t u at i on
w h er e l ar ge qu ant i t i es of dat a mu st b e r ead
t h r ou gh an A PI i nt er f ac e f or p r oc essi ng and
w r angl i ng. Th ese day s, most A PIs st r eam dat a ou t
i n JSON f or mat . In t h i s c h ap t er , w e w i l l u se a
f r ee A PI t o r ead some i nf or mat i on ab ou t v ar i ou s
c ou nt r i es ar ou nd t h e w or l d i n JSON f or mat and
p r oc ess i t .

W e w i l l u se Py t h on's b u i l t -i n urllib modu l e f or


t h i s t op i c , al ong w i t h p andas t o mak e a
Dat aFr ame. So, w e c an i mp or t t h em now . W e w i l l
al so i mp or t Py t h on's JSON modu l e:

import urllib.request, urllib.parse

from urllib.error import


HTTPError,URLError

import json

import pandas as pd

DEFINING THE BASE URL (OR


API ENDPOINT)
Fi r st , w e need t o set t h e b ase U RL. W h en w e ar e
deal i ng w i t h A PI mi c r oser v i c es, t h i s i s of t en
c al l ed t h e API e ndpoint. Th er ef or e, l ook f or
su c h a p h r ase i n t h e w eb ser v i c e p or t al y ou ar e
i nt er est ed i n and u se t h e endp oi nt U RL t h ey gi v e
y ou :

serviceurl =
'https://restcountries.eu/rest/v2/name/'

A PI-b ased mi c r oser v i c es ar e ex t r emel y dy nami c


i n nat u r e i n t er ms of w h at and h ow t h ey of f er
t h ei r ser v i c e and dat a. It c an c h ange at any t i me.
A t t h e t i me of t h i s c h ap t er p l anni ng, w e f ou nd
t h i s p ar t i c u l ar A PI t o b e a ni c e c h oi c e f or
ex t r ac t i ng dat a easi l y and w i t h ou t u si ng
au t h or i zat i on k ey s (l ogi n or sp ec i al A PI k ey s).

For most A PIs, h ow ev er , y ou need t o h av e y ou r


ow n A PI k ey . You get t h at b y r egi st er i ng w i t h
t h ei r ser v i c e. A b asi c u sage (u p t o a f i x ed
nu mb er of r equ est s or a dat a f l ow l i mi t ) i s of t en
f r ee, b u t af t er t h at y ou w i l l b e c h ar ged. To
r egi st er f or an A PI k ey , y ou of t en need t o ent er
c r edi t c ar d i nf or mat i on.

W e w ant ed t o av oi d al l t h at h assl e t o t eac h y ou


t h e b asi c s and t h at 's w h y w e c h ose t h i s ex amp l e,
w h i c h does not r equ i r e su c h au t h or i zat i on. Bu t ,
dep endi ng on w h at k i nd of dat a y ou w i l l
enc ou nt er i n y ou r w or k , p l ease b e p r ep ar ed t o
l ear n ab ou t u si ng an A PI k ey .
EXERCISE 93: DEFINING AND
TESTING A FUNCTION TO
PULL COUNTRY DATA FROM
AN API
Th i s p ar t i c u l ar A PI ser v es b asi c i nf or mat i on
ab ou t c ou nt r i es ar ou nd t h e w or l d:

1 . Define a fu nction to pu ll ou t
data w h en w e pass th e nam e
of a cou ntr y as an ar gu m ent.
Th e cr u x of th e oper ation is
contained in th e follow ing
tw o lines of code:

url = serviceurl +
country_name

uh =
urllib.request.urlopen
(url)

2 . Th e fir st line of code appends


th e cou ntr y nam e as a str ing
to th e base URL and th e
second line sends a get
r equ est to th e A PI endpoint.
If all goes w ell, w e get back
th e data, decode it, and r ead
it as a JSON file. Th is w h ole
exer cise is coded in th e
follow ing fu nction, along
w ith som e er r or -h andling
code w r apped ar ou nd th e
basic actions w e talked abou t
pr ev iou sly :

def
get_country_data(count
ry):

"""

Function to get data


about country from
"https://restcountries
.eu" API

"""

country_name=str(count
ry)

url = serviceurl +
country_name
try:

uh =
urllib.request.urlopen
(url)

except HTTPError as e:

print("Sorry! Could
not retrieve anything
on
{}".format(country_nam
e))

return None

except URLError as e:

print('Failed to reach
a server.')

print('Reason: ',
e.reason)

return None

else:

data =
uh.read().decode()

print("Retrieved data
on {}. Total {}
characters
read.".format(country_
name,len(data)))

return data

3 . Test th is fu nction by passing


som e ar gu m ents. We pass a
cor r ect nam e and an
er r oneou s nam e. Th e
r esponse is as follow s:

Note

This is an example of
rudimentary error handling.
You have to think about
various possibilities and put in
such code to catch and
gracefully respond to user
input w hen you are building a
real-life w eb or enterprise
application.
Figure 7.24: Input arguments

USING THE BUILT-IN JSON


LIBRARY TO READ AND
EXAMINE DATA
A s w e h av e al r eady ment i oned, JSON l ook s a l ot
l i k e a Py t h on di c t i onar y .

In t h i s ex er c i se, w e w i l l u se Py t h on's json


modu l e t o r ead r aw dat a i n t h at f or mat and see
w h at w e c an p r oc ess f u r t h er :

x=json.loads(data)

y=x[0]

type(y)

Th e ou t p u t w i l l b e as f ol l ow s:

dict

So, w e get a l i st b ac k w h en w e u se t h e loads


met h od f r om t h e json modu l e. It r eads a st r i ng
dat at y p e i nt o a l i st of di c t i onar i es. In t h i s c ase,
w e get onl y one el ement i n t h e l i st , so w e ex t r ac t
t h at and c h ec k i t s t y p e t o mak e su r e i t i s a
di c t i onar y .

W e c an qu i c k l y c h ec k t h e k ey s of t h e di c t i onar y ,
t h at i s t h e JSON dat a (not e t h at a f u l l sc r eensh ot
i s not sh ow n h er e). W e c an see t h e r el ev ant
c ou nt r y dat a, su c h as c al l i ng c odes, p op u l at i on,
ar ea, t i me zones, b or der s, and so on:

Figure 7.25: The output of dict_keys

PRINTING ALL THE DATA


ELEMENTS
Th i s t ask i s ex t r emel y si mp l e gi v en t h at w e h av e
a di c t i onar y at ou r di sp osal ! A l l w e h av e t o do i s
i t er at e ov er t h e di c t i onar y and p r i nt t h e
k ey s/i t ems p ai r one b y one:

for k,v in y.items():

print("{}: {}".format(k,v))

Th e ou t p u t i s as f ol l ow s:
Figure 7.26: The output using dict

N ot e t h at t h e i t ems i n t h e di c t i onar y ar e not of


t h e same t y p e, t h at i s, t h ey ar e not si mi l ar
ob jec t s. Some ar e f l oat i ng-p oi nt nu mb er s, su c h as
t h e ar ea, many ar e si mp l e st r i ngs, b u t some ar e
l i st s or ev en l i st s of di c t i onar i es!

Th i s i s f ai r l y c ommon w i t h JSON dat a. Th e


i nt er nal dat a st r u c t u r e of JSON c an b e
ar b i t r ar i l y c omp l ex and mu l t i l ev el , t h at i s, y ou
c an h av e a di c t i onar y of l i st s of di c t i onar i es of
di c t i onar i es of l i st s of l i st s…. and so on.

Note
I t is c le a r, th e re fore , th a t th e re is no u niv e rs a l
m e th od or p roc e s s ing fu nc tion for JS ON d a ta
form a t, a nd y ou h a v e to w rite c u s tom loop s a nd
fu nc tions to e xtra c t d a ta from s u c h a d ic tiona ry
obje c t ba s e d on y ou r p a rtic u la r ne e d s .

N ow , w e w i l l w r i t e a smal l l oop t o ex t r ac t t h e
l angu ages sp ok en i n Sw i t zer l and. Fi r st , l et 's
ex ami ne t h e di c t i onar y c l osel y and see w h er e
t h e l angu age dat a i s:

Figure 7.27: The tags

So, t h e dat a i s emb edded i nsi de a l i st of


di c t i onar i es, w h i c h i s ac c essed b y a p ar t i c u l ar
k ey of t h e mai n di c t i onar y .

W e c an w r i t e si mp l e t w o-l i ne c ode t o ex t r ac t
t h i s dat a:

for lang in y['languages']:

print(lang['name'])

Th e ou t p u t i s as f ol l ow s:

Figure 7.28: The output showing the languages


USING A FUNCTION THAT
EXTRACTS A DATAFRAME
CONTAINING KEY
INFORMATION
H er e, w e ar e i nt er est ed i n w r i t i ng a f u nc t i on
t h at c an t ak e a l i st of c ou nt r i es and r et u r n a
p andas Dat aFr ame w i t h some k ey i nf or mat i on:

Capital

Region

Su b-r egion

Popu lation

Latitu de/longitu de

A r ea

Gini index

Tim e zones

Cu r r encies

Langu ages

Note

This is the kind of w rapper


function you are generally
expected to w rite in real-life
data w rangling tasks, that is,
a utility function that can take
a user argument and output a
useful data structure (or a
mini database type object)
w ith key information
extracted over the internet
about the item the user is
interested in.

W e w i l l sh ow y ou t h e w h ol e f u nc t i on f i r st and
t h en di sc u ss some k ey p oi nt s ab ou t i t . It i s a
sl i gh t l y c omp l ex and l ong p i ec e of c ode.
H ow ev er , b ased on y ou r Py t h on- b ased dat a
w r angl i ng k now l edge, y ou sh ou l d b e ab l e t o
ex ami ne t h i s f u nc t i on c l osel y and u nder st and
w h at i t i s doi ng:

import pandas as pd

import json

def build_country_database(list_country):

"""
Takes a list of country names.

Output a DataFrame with key information


about those countries.

"""

# Define an empty dictionary with keys

country_dict={'Country':[],'Capital':
[],'Region':[],'Sub-region':
[],'Population':[],

'Lattitude':[],'Longitude':[],'Area':
[],'Gini':[],'Timezones':[],

'Currencies':[],'Languages':[]}

Note
Th e c od e h a s be e n tru nc a te d h e re . Ple a s e find th e
e ntire c od e a t th e follow ing GitHu b link a nd c od e
bu nd le fold e r link
h ttp s ://g ith u b.c om /Tra ining By Pa c k t/Da ta -
Wra ng ling -w ith -
Py th on/blob/m a s te r/Ch a p te r07/Exe rc is e 93-
94/Ch a p te r%207%20Top ic %203%20Exe rc is e s .ip y
nb.

H er e ar e some of t h e k ey p oi nt s ab ou t t h i s
f u nc t i on:

It star ts by bu ilding an
em pty dictionar y of lists.
Th is is th e ch osen for m at for
finally passing to th e pandas
DataFrame m eth od, w h ich
can accept su ch a for m at and
r etu r ns a nice DataFr am e
w ith colu m n nam es set to
th e dictionar y key s' nam es.

We u se th e pr ev iou sly
defined get_country_data
fu nction to extr act data for
each cou ntr y in th e u ser -
defined list. For th is, w e
sim ply iter ate ov er th e list
and call th is fu nction.

We ch eck th e ou tpu t of th e
get_country_data fu nction.
If, for som e r eason, it r etu r ns
a None object, w e w ill know
th at th e A PI r eading w as not
su ccessfu l, and w e w ill pr int
ou t a su itable m essage.
A gain, th is is an exam ple of
an er r or -h andling
m ech anism and y ou m u st
h av e th em in y ou r code.
With ou t su ch sm all er r or
ch ecking code, y ou r
application w on't be r obu st
enou gh for th e occasional
incor r ect inpu t or A PI
m alfu nction!

For m any data ty pes, w e


sim ply extr act th e data fr om
th e m ain JSON dictionar y
and append it to th e
cor r esponding list in ou r
data dictionar y .

How ev er , for special data


ty pes, su ch as tim e zones,
cu r r encies, and langu ages,
w e w r ite a special loop to
extr act th e data w ith ou t
er r or .

We also take car e of th e fact


th at th ese special data ty pes
can h av e a v ar iable length ,
th at is, som e cou ntr ies m ay
h av e m u ltiple spoken
langu ages, bu t m ost w ill
h av e only one entr y . So, w e
ch eck w h eth er th e length of
th e list is gr eater th an one
and h andle th e data
accor dingly .

EXERCISE 94: TESTING THE


FUNCTION BY BUILDING A
SMALL DATABASE OF
COUNTRIES' INFORMATION
Fi nal l y , w e t est t h i s f u nc t i on b y p assi ng a l i st of
c ou nt r y names:

1 . To test its r obu stness, w e pass


in an er r oneou s nam e – su ch
as 'Tu r m er ic' in th is case!

See th e ou tpu t… it detected


th at it did not get any data
back for th e incor r ect entr y
and pr inted ou t a su itable
m essage. Th e key is th at, if
y ou do not h av e th e er r or
ch ecking and h andling code
in y ou r fu nction, th en it w ill
stop execu tion on th at entr y
and w ill not r etu r n th e
expected m ini database. To
av oid th is beh av ior , su ch
er r or h andling code is
inv alu able:

Figure 7.29: The incorrect entry


highlighted

2 . Finally , th e ou tpu t is a
pandas DataFr am e, w h ich is
as follow s:
Figure 7.30: The data extracted correctly

Fundamentals of
Regular Expressions
(RegEx)
Re g u l ar e xp r essi ons or re g e x ar e u sed t o
i dent i f y w h et h er a p at t er n ex i st s i n a gi v en
sequ enc e of c h ar ac t er s a (st r i ng) or not . Th ey
h el p i n mani p u l at i ng t ex t u al dat a, w h i c h i s
of t en a p r er equ i si t e f or dat a sc i enc e p r ojec t s
t h at i nv ol v e t ex t mi ni ng.
REGEX IN THE CONTEXT OF
WEB SCRAPING
W eb p ages ar e of t en f u l l of t ex t and w h i l e t h er e
ar e some met h ods i n BeautifulSoup or XML
p ar ser t o ex t r ac t r aw t ex t , t h er e i s no met h od f or
t h e i nt el l i gent anal y si s of t h at t ex t . If , as a dat a
w r angl er , y ou ar e l ook i ng f or a p ar t i c u l ar p i ec e
of dat a (f or ex amp l e, emai l IDs or p h one nu mb er s
i n a sp ec i al f or mat ), y ou h av e t o do a l ot of st r i ng
mani p u l at i on on a l ar ge c or p u s t o ex t r ac t emai l
IDs or p h one nu mb er s. RegEx ar e v er y p ow er f u l
and sav e dat a w r angl i ng p r of essi onal a l ot of
t i me and ef f or t w i t h st r i ng mani p u l at i on
b ec au se t h ey c an sear c h f or c omp l ex t ex t u al
p at t er ns w i t h w i l dc ar ds of an ar b i t r ar y l engt h .

RegEx i s l i k e a mi ni -p r ogr ammi ng l angu age i n


i t sel f and c ommon i deas ar e u sed not onl y
Py t h on, b u t i n al l w i del y u sed w eb ap p
l angu ages l i k e Jav aSc r i p t , PH P, Per l , and so on.
Th e RegEx modu l e i s i n-b u i l t i n Py t h on, and y ou
c an i mp or t i t b y u si ng t h e f ol l ow i ng c ode:

import re

EXERCISE 95: USING THE


MATCH METHOD TO CHECK
WHETHER A PATTERN
MATCHES A
STRING/SEQUENCE
One of t h e most c ommon r egex met h ods i s match.
Th i s i s u sed t o c h ec k f or an ex ac t or p ar t i al
mat c h at a b egi nni ng of t h e st r i ng (b y def au l t ):

1 . Im por t th e RegEx m odu le:

import re

2 . Define a str ing and a


patter n:

string1 = 'Python'

pattern = r"Python"

3 . Wr ite a conditional
expr ession to ch eck for a
m atch :

if
re.match(pattern,strin
g1):

print("Matches!")

else:
print("Doesn't
match.")

Th e pr eceding code sh ou ld
giv e an affir m ativ e answ er ,
th at is, "Match es!".

4 . Test th is w ith a str ing th at


only differ s in th e fir st letter
by m aking it low er case:

string2 = 'python'

if
re.match(pattern,strin
g2):

print("Matches!")

else:

print("Doesn't
match.")

Th e ou tpu t is as follow s:

Doesn't match.

USING THE COMPILE METHOD


TO CREATE A REGEX
PROGRAM
In a p r ogr am or modu l e, i f w e ar e mak i ng h eav y
u se of a p ar t i c u l ar p at t er n, t h en i t i s b et t er t o
u se t h e compile met h od and c r eat e a r egex
p r ogr am and t h en c al l met h ods on t h i s p r ogr am.

H er e i s h ow y ou c omp i l e a r egex p r ogr am:

prog = re.compile(pattern)

prog.match(string1)

Th e ou t p u t i s as f ol l ow s:

<_sre.SRE_Match object; span=(0, 6),


match='Python'>

Th i s c ode p r odu c ed an SRE.Match ob jec t t h at h as a


span of (0,6) and t h e mat c h ed st r i ng of 'Py t h on'.
Th e sp an h er e si mp l y denot es t h e st ar t and end
i ndi c es of t h e p at t er n t h at w as mat c h ed. Th ese
i ndi c es may c ome i n h andy i n a t ex t mi ni ng
p r ogr am w h er e t h e su b sequ ent c ode u ses t h e
i ndi c es f or f u r t h er sear c h or dec i si on-mak i ng
p u r p oses. W e w i l l see some ex amp l es of t h at
l at er .

EXERCISE 96: COMPILING


PROGRAMS TO MATCH
OBJECTS
Comp i l ed ob jec t s ac t l i k e f u nc t i ons i n t h at t h ey
r et u r n None i f t h e p at t er n does not mat c h . H er e,
w e ar e goi ng t o c h ec k t h at b y w r i t i ng a si mp l e
c ondi t i onal . Th i s c onc ep t w i l l c ome i n h andy
l at er w h en w e w r i t e a smal l u t i l i t y f u nc t i on t o
c h ec k f or t h e t y p e of t h e r et u r ned ob jec t f r om
r egex -c omp i l ed p r ogr ams and ac t ac c or di ngl y .
W e c annot b e su r e w h et h er a p at t er n w i l l mat c h
a gi v en st r i ng or w h et h er i t w i l l ap p ear i n a
c or p u s of t h e t ex t (i f w e ar e sear c h i ng f or t h e
p at t er n any w h er e w i t h i n t h e t ex t ). Dep endi ng
on t h e si t u at i on, w e may enc ou nt er Match ob jec t s
or None as t h e r et u r ned v al u e, and w e h av e t o
h andl e t h i s gr ac ef u l l y :

#string1 = 'Python'

#string2 = 'python'

#pattern = r"Python"
1 . Use th e compile fu nction in
RegEx:

prog =
re.compile(pattern)

2 . Match it w ith th e fir st


str ing:

if
prog.match(string1)!=N
one:

print("Matches!")

else:

print("Doesn't
match.")

Th e ou tpu t is as follow s:

Matches!

3 . Match it w ith th e second


str ing:

if
prog.match(string2)!=N
one:

print("Matches!")

else:

print("Doesn't
match.")

Th e ou tpu t is as follow s:

Doesn't match.
EXERCISE 97: USING
ADDITIONAL PARAMETERS IN
MATCH TO CHECK FOR
POSITIONAL MATCHING
By def au l t , match l ook s f or p at t er n mat c h i ng at
t h e b egi nni ng of t h e gi v en st r i ng. Bu t somet i mes,
w e need t o c h ec k mat c h i ng at a sp ec i f i c l oc at i on
i n t h e st r i ng:

1 . Match y for th e second


position:

prog =
re.compile(r'y')

prog.match('Python',po
s=1)

Th e ou tpu t is as follow s:

<_sre.SRE_Match
object; span=(1, 2),
match='y'>

2 . Ch eck for a patter n called


thon star ting fr om pos=2,
th at is, th e th ir d ch ar acter :

prog =
re.compile(r'thon')

prog.match('Python',po
s=2)

Th e ou tpu t is as follow s:

<_sre.SRE_Match
object; span=(2, 6),
match='thon'>

3 . Find a m atch in a differ ent


str ing by u sing th e follow ing
com m and:

prog.match('Marathon',
pos=4)

Th e ou tpu t is as follow s:

<_sre.SRE_Match
object; span=(4, 8),
match='thon'>
FINDING THE NUMBER OF
WORDS IN A LIST THAT END
WITH "ING"
Su p p ose w e w ant t o f i nd ou t i f a gi v en st r i ng h as
t h e l ast t h r ee l et t er s: 'i ng'. Th i s k i nd of qu er y
may c ome u p i n a t ex t anal y t i c s/t ex t mi ni ng
p r ogr am w h er e someb ody i s i nt er est ed i n
f i ndi ng i nst anc es of p r esent c ont i nu ou s t ense
w or ds, w h i c h ar e h i gh l y l i k el y t o end w i t h 'i ng'.
H ow ev er , ot h er nou ns may al so end w i t h 'i ng' (as
w e w i l l see i n t h i s ex amp l e):

prog = re.compile(r'ing')

words = ['Spring','Cycling','Ringtone']

Cr eat e a for l oop t o f i nd w or ds endi ng w i t h 'i ng':

for w in words:

if prog.match(w,pos=len(w)-3)!=None:

print("{} has last three letters


'ing'".format(w))

else:

print("{} does not have last three letter


as 'ing'".format(w))

Th e ou t p u t i s as f ol l ow s:

Spring has last three letters 'ing'

Cycling has last three letters 'ing'

Ringtone does not have last three letter


as 'ing'

Note
I t look s p la in a nd s im p le , a nd y ou m a y w e ll
w ond e r w h a t th e p u rp os e of u s ing a s p e c ia l
re g e x m od u le for th is is . A s im p le s tring m e th od
s h ou ld h a v e be e n s u ffic ie nt. Ye s , it w ou ld h a v e
be e n OK for th is p a rtic u la r e xa m p le , bu t th e w h ole
p oint of u s ing re g e x is to be a ble to u s e v e ry
c om p le x s tring p a tte rns th a t a re not a t a ll obv iou s
w h e n it c om e s to h ow th e y a re w ritte n u s ing
s im p le s tring m e th od s . We w ill s e e th e re a l
p ow e r of re g e x c om p a re d to s tring m e th od s
s h ortly . Bu t be fore th a t, le t's e xp lore a noth e r of th e
m os t c om m only u s e d m e th od s , c a lle d search.

EXERCISE 98: THE SEARCH


METHOD IN REGEX
Search and match ar e r el at ed c onc ep t s and t h ey
b ot h r et u r n t h e same Mat c h ob jec t . Th e r eal
di f f er enc e b et w een t h em i s t h at match works
for only the first match (ei t h er at t h e b egi nni ng
of t h e st r i ng or at a sp ec i f i ed p osi t i on, as w e saw
i n t h e p r ev i ou s ex er c i ses), w h er eas se arch looks
for the patte rn any whe re in the string and
r et u r ns t h e ap p r op r i at e p osi t i on i f i t f i nds a
mat c h :

1 . Use th e compile m eth od to


find m atch ing str ings:

prog =
re.compile('ing')

if
prog.match('Spring')==
None:

print("None")

2 . Th e ou tpu t is as follow s:

None

3 . Sear ch th e str ing by u sing


th e follow ing com m and:

prog.search('Spring')

<_sre.SRE_Match
object; span=(3, 6),
match='ing'>

prog.search('Ringtone'
)

<_sre.SRE_Match
object; span=(1, 4),
match='ing'>

A s y ou can see, th e match


m eth od r etu r ns None for th e
inpu t spring, and w e h ad to
w r ite code to pr int th at ou t
explicitly (becau se in
Ju py ter notebook, noth ing
w ill sh ow u p for a None
object). Bu t search r etu r ns a
Match object w ith span=
(3,6) as it finds th e ing
patter n spanning th ose
positions.

Si mi l ar l y , f or t h e Ringtone st r i ng, i t f i nds t h e


c or r ec t p osi t i on of t h e mat c h and r et u r ns span=
(1,4).

EXERCISE 99: USING THE


SPAN METHOD OF THE MATCH
OBJECT TO LOCATE THE
POSITION OF THE MATCHED
PATTERN
A s y ou w i l l u nder st and b y now , t h e span
c ont ai ned i n t h e Match ob jec t i s u sef u l f or
l oc at i ng t h e ex ac t p osi t i on of t h e p at t er n as i t
ap p ear s i n t h e st r i ng.

1 . Intitialize prog w ith patter n


ing.

prog =
re.compile(r'ing')

words =
['Spring','Cycling','R
ingtone']

2 . Cr eate a fu nction to r etu r n a


tu ple of star t and end
positions of m atch .

for w in words:

mt = prog.search(w)

# Span returns a tuple


of start and end
positions of the match

start_pos = mt.span()
[0] # Starting
position of the match

end_pos = mt.span()[1]
# Ending position of
the match

3 . Pr int th e w or ds ending w ith


ing in th e star t or end
position.

print("The word '{}'


contains 'ing' in the
position {}-
{}".format(w,start_pos
,end_pos))

Th e ou t p u t i s as f ol l ow s:

The word 'Spring' contains 'ing' in the


position 3-6

The word 'Cycling' contains 'ing' in the


position 4-7

The word 'Ringtone' contains 'ing' in the


position 1-4
EXERCISE 100: EXAMPLES OF
SINGLE CHARACTER PATTERN
MATCHING WITH SEARCH
N ow , w e w i l l st ar t get t i ng i nt o t h e r eal u sage of
r egex w i t h ex amp l es of v ar i ou s u sef u l p at t er n
mat c h i ng. Fi r st , w e w i l l ex p l or e si ngl e-
c h ar ac t er mat c h i ng. W e w i l l al so u se t h e group
met h od, w h i c h essent i al l y r et u r ns t h e mat c h ed
p at t er n i n a st r i ng f or mat so t h at w e c an p r i nt
and p r oc ess i t easi l y :

1 . Dot (.) m atch es any single


ch ar acter except a new line
ch ar acter :

prog =
re.compile(r'py.')

print(prog.search('pyg
my').group())

print(prog.search('Jup
yter').group())

Th e ou tpu t is as follow s:

pyg

pyt

2 . \w (low er case w ) m atch es


any single letter , digit, or
u nder scor e:

prog =
re.compile(r'c\wm')

print(prog.search('com
edy').group())

print(prog.search('cam
era').group())

print(prog.search('pac
_man').group())

print(prog.search('pac
2man').group())

Th e ou tpu t is as follow s:

com

cam

c_m

c2m
3 . \W (u pper case W) m atch es
any th ing not cov er ed w ith
\w:

prog =
re.compile(r'4\W1')

print(prog.search('4/1
was a wonderful
day!').group())

print(prog.search('4-1
was a wonderful
day!').group())

print(prog.search('4.1
was a wonderful
day!').group())

print(prog.search('Rem
ember the wonderful
day 04/1?').group())

Th e ou tpu t is as follow s:

4/1

4-1

4.1

4/1

4 . \s (low er case s) m atch es a


single w h itespace ch ar acter ,
su ch as a space, new line, tab,
or r etu r n:

prog =
re.compile(r'Data\swra
ngling')

print(prog.search("Dat
a wrangling is
cool").group())

print("-"*80)

print("Data\twrangling
is the full string")

print(prog.search("Dat
a\twrangling is the
full string").group())

print("-"*80)

print("Data\nwrangling
is the full string")
print(prog.search("Dat
a\nwrangling").group()
)

Th e ou tpu t is as follow s:

Data wrangling

----------------------
----------------------
----------------------
----

Data wrangling is the


full string

Data wrangling

----------------------
----------------------
----------------------
----

Data

wrangling is the full


string

Data

wrangling

5. \d m atch es nu m er ical digits


0 – 9:

prog =
re.compile(r"score was
\d\d")

print(prog.search("My
score was
67").group())

print(prog.search("You
r score was
73").group())

Th e ou tpu t is as follow s:

score was 67

score was 73

EXERCISE 101: EXAMPLES OF


PATTERN MATCHING AT THE
START OR END OF A STRING
In t h i s ex er c i se, w e w i l l mat c h p at t er ns w i t h
st r i ngs. Th e f oc u s i s t o f i nd ou t w h et h er t h e
p at t er n i s p r esent at t h e st ar t or t h e end of t h e
st r i ng:

1 . Wr ite a fu nction to h andle


cases w h er e m atch is not
fou nd, th at is, to h andle None
objects as r etu r ns:

def print_match(s):

if
prog.search(s)==None:

print("No match")

else:

print(prog.search(s).g
roup())

2 . Use ^ (Car et) to m atch a


patter n at th e star t of th e
str ing:
prog =
re.compile(r'^India')

print_match("Russia
implemented this law")

print_match("India
implemented that law")

print_match("This law
was implemented by
India")

The output is as
follows: No match

India

No match

3 . Use $ (dollar sign) to m atch a


patter n at th e end of th e
str ing:

prog =
re.compile(r'Apple$')

print_match("Patent no
123456 belongs to
Apple")

print_match("Patent no
345672 belongs to
Samsung")
print_match("Patent no
987654 belongs to
Apple")

Th e ou tpu t is as follow s:

Apple

No match

Apple

EXERCISE 102: EXAMPLES OF


PATTERN MATCHING WITH
MULTIPLE CHARACTERS
N ow , w e w i l l t u r n t o mor e ex c i t i ng and u sef u l
p at t er n mat c h i ng w i t h ex amp l es of mu l t i p l e
c h ar ac t er s mat c h i ng. You sh ou l d st ar t seei ng and
ap p r ec i at i ng t h e r eal p ow er of r egex b y now .

Note:
For th e s e e xa m p le s a nd e xe rc is e s , a ls o try to
th ink h ow y ou w ou ld im p le m e nt th e m w ith ou t
re g e x, th a t is , by u s ing s im p le s tring m e th od s
a nd a ny oth e r log ic th a t y ou c a n th ink of. Th e n,
c om p a re th a t s olu tion to th e one s im p le m e nte d
w ith re g e x for bre v ity a nd e ffic ie nc y .

1 . Use * to m atch 0 or m or e
r epetitions of th e pr eceding
RE:

prog =
re.compile(r'ab*')

print_match("a")

print_match("ab")

print_match("abbb")

print_match("b")

print_match("bbab")

print_match("something
_abb_something")

Th e ou tpu t is as follow s:

ab

abbb

No match

ab
abb

2 . Using + cau ses th e r esu lting


RE to m atch 1 or m or e
r epetitions of th e pr eceding
RE:

prog =
re.compile(r'ab+')

print_match("a")

print_match("ab")

print_match("abbb")

print_match("b")

print_match("bbab")

print_match("something
_abb_something")

Th e ou tpu t is as follow s:

No match

ab

abbb

No match

ab

abb

3 . ? cau ses th e r esu lting RE to


m atch pr ecisely 0 or 1
r epetitions of th e pr eceding
RE:

prog =
re.compile(r'ab?')

print_match("a")

print_match("ab")

print_match("abbb")

print_match("b")

print_match("bbab")

print_match("something
_abb_something")

Th e ou tpu t is as follow s:

ab

ab
No match

ab

ab

EXERCISE 103: GREEDY


VERSUS NON-GREEDY
MATCHING
Th e st andar d (def au l t ) mode of p at t er n mat c h i ng
i n r egex i s gr eedy , t h at i s, t h e p r ogr am t r i es t o
mat c h as mu c h as i t c an. Somet i mes, t h i s
b eh av i or i s nat u r al , b u t , i n some c ases, y ou may
w ant t o mat c h mi ni mal l y :

1 . Th e gr eedy w ay of m atch ing


a str ing is as follow s:

prog =
re.compile(r'<.*>')

print_match('<a> b
<c>')

Th e ou tpu t is as follow s:
<a> b <c>

2 . So, th e pr eceding r egex


fou nd both tags w ith th e < >
patter n, bu t w h at if w e
w anted to m atch th e fir st tag
only and stop th er e. We can
u se ? by inser ting it after
any r egex expr ession to
m ake it non-gr eedy :

prog =
re.compile(r'<.*?>')

print_match('<a> b
<c>')

Th e ou tpu t is as follow s:

<a>

EXERCISE 104: CONTROLLING


REPETITIONS TO MATCH
In many si t u at i ons, w e w ant t o h av e p r ec i se
c ont r ol ov er h ow many r ep et i t i ons of t h e
p at t er n w e w ant t o mat c h i n a t ex t . Th i s c an b e
done i n a f ew w ay s, w h i c h w e w i l l sh ow
ex amp l es of h er e:
1 . {m} specifies exactly m copies
of RE to m atch . Few er
m atch es cau se a non-m atch
and r etu r ns None:

prog =
re.compile(r'A{3}')

print_match("ccAAAdd")

print_match("ccAAAAdd"
)

print_match("ccAAdd")

Th e ou tpu t is as follow s:

AAA

AAA

No match

2 . {m,n} specifies exactly m to n


copies of RE to m atch :

prog =
re.compile(r'A{2,4}B')

print_match("ccAAABdd"
)

print_match("ccABdd")

print_match("ccAABBBdd
")

print_match("ccAAAAAAA
Bdd")

Th e ou tpu t is as follow s:

AAAB

No match

AAB

AAAAB

3 . Om itting m specifies a low er


bou nd of zer o:

prog =
re.compile(r'A{,3}B')

print_match("ccAAABdd"
)

print_match("ccABdd")

print_match("ccAABBBdd
")
print_match("ccAAAAAAA
Bdd")

Th e ou tpu t is as follow s:

AAAB

AB

AAB

AAAB

4 . Om itting n specifies an
infinite u pper bou nd:

prog =
re.compile(r'A{3,}B')

print_match("ccAAABdd"
)

print_match("ccABdd")

print_match("ccAABBBdd
")

print_match("ccAAAAAAA
Bdd")

Th e ou tpu t is as follow s:

AAAB

No match

No match

AAAAAAAB

5. {m,n}? specifies m to n copies


of RE to m atch in a non-
gr eedy fash ion:

prog =
re.compile(r'A{2,4}')

print_match("AAAAAAA")

prog =
re.compile(r'A{2,4}?')

print_match("AAAAAAA")

Th e ou tpu t is as follow s:

AAAA

AA

EXERCISE 105: SETS OF


MATCHING CHARACTERS
To mat c h an ar b i t r ar i l y c omp l ex p at t er n, w e
need t o b e ab l e t o i nc l u de a l ogi c al c omb i nat i on
of c h ar ac t er s t oget h er as a b u nc h . Regex gi v es u s
t h at k i nd of c ap ab i l i t y :

1 . Th e follow ing exam ples


dem onstr ate su ch u ses of
r egex. [x,y,z] m atch es x, y ,
or z:

prog =
re.compile(r'[A,B]')

print_match("ccAd")

print_match("ccABd")

print_match("ccXdB")

print_match("ccXdZ")

Th e ou tpu t w ill be as follow s:


A

No match

A r ange of ch ar acter s can be


m atch ed inside th e set u sing
-. Th is is one of th e m ost
w idely u sed r egex
tech niqu es!

2 . Su ppose w e w ant to pick ou t


an em ail addr ess fr om a text.
Em ail addr esses ar e
gener ally of th e for m <some
name>@<some domain
name>.<some domain
identifier>:

prog =
re.compile(r'[a-zA-
Z]+@+[a-zA-Z]+\.com')

print_match("My email
is coolguy@xyz.com")

print_match("My email
is coolguy12@xyz.com")

Th e ou tpu t is as follow s:

coolguy@xyz.com

No match
Look at th e r egex patter n
inside th e [ … ]. It is 'a-zA-
Z'. Th is cov er s all alph abets,
inclu ding low er case and
u pper case! With th is one
sim ple r egex, y ou ar e able to
m atch any (pu r e)
alph abetical str ing for th at
par t of th e em ail. Now , th e
next patter n is '@', w h ich is
added to th e pr ev iou s r egex
by a '+' ch ar acter . Th is is th e
w ay to bu ild u p a com plex
r egex: by adding/stacking
u p indiv idu al r egex
patter ns. We also u se th e
sam e [a-zA-Z] for th e em ail
dom ain nam e and add a
'.com' at th e end to com plete
th e patter n as a v alid em ail
addr ess. Wh y \.? Becau se, by
itself, DOT (.) is u sed as a
special m odifier in r egex, bu t
h er e w e w ant to u se DOT (.)
ju st as DOT (.), not as a
m odifier . So, w e need to
pr ecede it by a '\'.

3 . So, w ith th is r egex, w e cou ld


extr act th e fir st em ail
addr ess per fectly bu t got 'No
match' w ith th e second one.

4 . Wh at h appened w ith th e
second em ail ID?

5. Th e r egex cou ld not captu r e


it becau se it h ad th e nu m ber
'1 2 ' in th e nam e! Th at
patter n is not captu r ed by
th e expr ession [a-zA -Z].

6 . Let's ch ange th at and add th e


digits as w ell:

prog =
re.compile(r'[a-zA-Z0-
9]+@+[a-zA-Z]+\.com')

print_match("My email
is coolguy12@xyz.com")
print_match("My email
is coolguy12@xyz.org")

Th e ou tpu t is as follow s:

coolguy12@xyz.com

No match

Now , w e catch th e fir st em ail


ID per fectly . Bu t w h at's
going on w ith th e second
one? A gain, w e got a
m ism atch . Th e r eason is th at
w e ch anged th e .com to .or g
in th at em ail, and in ou r
r egex expr ession, th at
por tion w as h ar dcoded as
.com, so it did not find a
m atch .

7 . Let's tr y to addr ess th is in th e


follow ing r egex:

prog =
re.compile(r'[a-zA-Z0-
9]+@+[a-zA-Z]+\.+[a-
zA-Z]{2,3}')

print_match("My email
is coolguy12@xyz.org")

print_match("My email
is
coolguy12[AT]xyz[DOT]o
rg")

Th e ou tpu t is as follow s:

coolguy12@xyz.org

No match

8. In th is r egex, w e u sed th e
fact th at m ost dom ain
identifier s h av e 2 or 3
ch ar acter s, so w e u sed [a-
zA-Z]{2,3} to captu r e th at.

W h at h ap p ened w i t h t h e sec ond emai l ID? Th i s i s


an ex amp l e of t h e smal l t w eak s t h at y ou c an
mak e t o st ay ah ead of t el emar k et er s w h o w ant t o
sc r ap e onl i ne f or u ms or any ot h er c or p u s of t ex t
and ex t r ac t y ou r emai l ID. If y ou do not w ant
y ou r emai l t o b e f ou nd, y ou c an c h ange @ t o [AT]
and . t o [DOT] ,and h op ef u l l y t h at c an b eat some
r egex t ec h ni qu es (b u t not al l )!
EXERCISE 106: THE USE OF OR
IN REGEX USING THE OR
OPERATOR
Bec au se r egex p at t er ns ar e l i k e c omp l ex and
c omp ac t l ogi c al c onst r u c t or s t h emsel v es, i t
mak es p er f ec t sense t h at w e w ant t o c omb i ne
t h em t o c onst r u c t ev en mor e c omp l ex p r ogr ams
w h en needed. W e c an do t h at b y u si ng t h e |
op er at or :

1 . Th e follow ing exam ple


dem onstr ates th e u se of th e
OR oper ator :

prog =
re.compile(r'[0-9]
{10}')

print_match("312456789
7")

print_match("312-456-
7897")

Th e ou tpu t is as follow s:

3124567897

No match

So, h er e, w e ar e tr y ing to
extr act patter ns of 1 0-digit
nu m ber s th at cou ld be ph one
nu m ber s. Note th e u se of
{10} to denote exactly 1 0-
digit nu m ber s in th e
patter n. Bu t th e second
nu m ber cou ld not be
m atch ed for obv iou s r easons
– it h ad '-' sy m bols inser ted
in betw een gr ou ps of
nu m ber s.

2 . Use m u ltiple sm aller r egexes


and logically com bine th em
by u sing th e follow ing
com m and:

prog =
re.compile(r'[0-9]
{10}|[0-9]{3}-[0-9]
{3}-[0-9]{4}')

print_match("312456789
7")
print_match("312-456-
7897")

Th e ou tpu t is as follow s:

3124567897

312-456-7897

Ph one nu m ber s ar e w r itten


in a m y r iad of w ay s and if
y ou sear ch on th e w eb, y ou
w ill see exam ples of v er y
com plex r egexes (w r itten not
only in Py th on bu t oth er
w idely u sed langu ages, for
w eb apps su ch as Jav aScr ipt,
C+ + , PHP, Per l, and so on)
for captu r ing ph one
nu m ber s.

3 . Cr eate fou r str ings and


execu te print_match on
th em :

p1= r'[0-9]{10}'

p2=r'[0-9]{3}-[0-9]
{3}-[0-9]{4}'

p3 = r'\([0-9]{3}\)[0-
9]{3}-[0-9]{4}'

p4 = r'[0-9]{3}\.[0-9]
{3}\.[0-9]{4}'

pattern=
p1+'|'+p2+'|'+p3+'|'+p
4

prog =
re.compile(pattern)

print_match("312456789
7")

print_match("312-456-
7897")

print_match("(312)456-
7897")

print_match("312.456.7
897")

Th e ou tpu t is as follow s:

3124567897

312-456-7897
(312)456-7897

312.456.7897

THE FINDALL METHOD


Th e l ast r egex met h od t h at w e w i l l l ear n i n t h i s
c h ap t er i s findall. Essent i al l y , i t i s a se arch-
and-ag g re g ate met h od, t h at i s, i t p u t s al l t h e
i nst anc es t h at mat c h w i t h t h e r egex p at t er n i n a
gi v en t ex t and r et u r ns t h em i n a l i st . Th i s i s
ex t r emel y u sef u l , as w e c an ju st c ou nt t h e l engt h
of t h e r et u r ned l i st t o c ou nt t h e nu mb er of
oc c u r r enc es or p i c k and u se t h e r et u r ned
p at t er n-mat c h ed w or ds one b y one as w e see f i t .

N ot e t h at , al t h ou gh w e ar e gi v i ng sh or t
ex amp l es of si ngl e sent enc es i n t h i s c h ap t er , y ou
w i l l of t en deal w i t h a l ar ge c or p u s of t ex t w h en
u si ng a RegEx .

In t h ose c ases y ou ar e l i k el y t o get many mat c h es


f r om a si ngl e r egex p at t er n sear c h . For al l of
t h ose c ases, t h e findall met h od i s goi ng t o b e t h e
most u sef u l :

ph_numbers = """Here are some phone


numbers.

Pick out the numbers with 312 area code:

312-423-3456, 456-334-6721, 312-5478-9999,

312-Not-a-Number,777.345.2317,
312.331.6789"""

print(ph_numbers)

re.findall('312+[-\.][0-9-
\.]+',ph_numbers)

Th e ou t p u t i s as f ol l ow s:

Here are some phone numbers.

Pick out the numbers with 312 area code:

312-423-3456, 456-334-6721, 312-5478-9999,

312-Not-a-Number,777.345.2317,
312.331.6789

['312-423-3456', '312-5478-9999',
'312.331.6789']

ACTIVITY 9: EXTRACTING THE


TOP 100 EBOOKS FROM
GUTENBERG
Pr ojec t Gu t enb er g enc ou r ages t h e c r eat i on and
di st r i b u t i on of eBook s b y enc ou r agi ng v ol u nt eer
ef f or t s t o di gi t i ze and ar c h i v e c u l t u r al w or k s.
Th i s ac t i v i t y ai ms t o sc r ap e t h e U RL of Pr ojec t
Gu t enb er g's Top 1 00 eBook s t o i dent i f y t h e
eBook s' l i nk s. It u ses Beau t i f u l Sou p 4 t o p ar se t h e
H TML and r egu l ar ex p r essi on c ode t o i dent i f y
t h e Top 1 00 eBook f i l e nu mb er s.
You c an u se t h ose b ook ID nu mb er s t o dow nl oad
t h e b ook i nt o y ou r l oc al dr i v e i f y ou w ant .

H ead ov er t o t h e su p p l i ed Ju p y t er not eb ook (i n


t h e Gi t H u b r ep osi t or y ) t o w or k on t h i s ac t i v i t y .

Th ese ar e t h e st ep s t h at w i l l h el p y ou sol v e t h i s
ac t i v i t y :

1 . Im por t th e necessar y
libr ar ies, inclu ding regex
and beautifulsoup.

2 . Ch eck th e SSL cer tificate.

3 . Read th e HTML fr om th e
URL.

4 . Wr ite a sm all fu nction to


ch eck th e statu s of th e w eb
r equ est.

5. Decode th e r esponse and pass


th is on to Beau tifu lSou p for
HTML par sing.

6 . Find all th e href tags and


stor e th em in th e list of links.
Ch eck w h at th e list looks like
– pr int th e fir st 3 0 elem ents.

7 . Use a r egu lar expr ession to


find th e nu m er ic digits in
th ese links. Th ese ar e th e file
nu m ber s for th e top 1 00
eBooks.

8. Initialize th e em pty list to


h old th e file nu m ber s ov er
an appr opr iate r ange and
u se regex to find th e
nu m er ic digits in th e link
href str ing. Use th e findall
m eth od.

9 . Wh at does th e soup object's


text look like? Use th e .text
m eth od and pr int only th e
fir st 2 ,000 ch ar acter s (do
not pr int th e w h ole th ing, as
it is too long).

1 0. Sear ch in th e extr acted text


(u sing a r egu lar expr ession)
fr om th e sou p object to find
th e nam es of th e top 1 00
eBooks (y ester day 's
r anking).

1 1 . Cr eate a star ting index. It


sh ou ld point at th e text Top
100 Ebooks yesterday. Use
th e splitlines m eth od of
sou p.text. It splits th e lines of
text of th e sou p object.

1 2 . Loop 1 -1 00 to add th e str ings


of th e next 1 00 lines to th is
tem por ar y list. Hint: u se th e
splitlines m eth od.

1 3 . Use a r egu lar expr ession to


extr act only text fr om th e
nam e str ings and append it
to an em pty list. Use match
and span to find th e indices
and u se th em .

Note

The solution for this activity


can be found on page 315.

ACTIVITY 10: BUILDING YOUR


OWN MOVIE DATABASE BY
READING AN API
In t h i s ac t i v i t y , y ou w i l l b u i l d a c omp l et e mov i e
dat ab ase b y c ommu ni c at i ng and i nt er f ac i ng
w i t h a f r ee A PI. You w i l l l ear n ab ou t ob t ai ni ng a
u ni qu e u ser k ey t h at mu st b e u sed w h en y ou r
p r ogr am t r i es t o ac c ess t h e A PI. Th i s ac t i v i t y
w i l l t eac h y ou gener al c h ap t er s ab ou t w or k i ng
w i t h an A PI, w h i c h ar e f ai r l y c ommon f or ot h er
h i gh l y p op u l ar A PI ser v i c es su c h as Googl e or
Tw i t t er . Th er ef or e, af t er doi ng t h i s ex er c i se, y ou
w i l l b e c onf i dent ab ou t w r i t i ng mor e c omp l ex
p r ogr ams t o sc r ap e dat a f r om su c h ser v i c es.

Th e ai ms of t h i s ac t i v i t y ar e as f ol l ow s:

To r etr iev e and pr int basic


data abou t a m ov ie (th e title
is enter ed by th e u ser ) fr om
th e w eb (OMDb database)

If a poster of th e m ov ie can
be fou nd, it dow nloads th e
file and sav es it at a u ser -
specified location

Th ese ar e t h e st ep s t h at w i l l h el p y ou sol v e t h i s
ac t i v i t y :

1 . Im por t urllib.request,
urllib.parse,
urllib.error, and json.

2 . Load th e secr et A PI key (y ou


h av e to get one fr om th e
OMDb w ebsite and u se th at;
it h as a daily lim it of 1 ,000)
fr om a JSON file stor ed in
th e sam e folder in a v ar iable,
by u sing json.loads.

3 . Obtain a key and stor e it in


JSON as APIkeys.json.

4 . Open th e APIkeys.json file.

5. A ssign th e OMDb por tal


(h ttp://w w w .om dbapi.com /
?) as a str ing to a v ar iable.

6 . Cr eate a v ar iable called


apikey w ith th e last por tion
of th e URL
(&apikey=secretapikey),
w h er e secretapikey is y ou r
ow n A PI key .

7 . Wr ite a u tility fu nction


called print_json to pr int
th e m ov ie data fr om a JSON
file (w h ich w e w ill get fr om
th e por tal).

8. Wr ite a u tility fu nction to


dow nload a poster of th e
m ov ie based on th e
infor m ation fr om th e JSON
dataset and sav e it in y ou r
local folder . Use th e os
m odu le. Th e poster data is
stor ed in th e JSON key
Poster. Use th e Py th on
com m and to open a file and
w r ite th e poster data. Close
th e file after y ou 'r e done.
Th is fu nction w ill sav e th e
poster data as an im age file.

9 . Wr ite a u tility fu nction


called search_movie to
sear ch for a m ov ie by its
nam e, pr int th e dow nloaded
JSON data, and sav e th e
m ov ie poster in th e local
folder . Use a try-except loop
for th is. Use th e pr ev iou sly
cr eated serviceurl and
apikey v ar iables. You h av e
to pass on a dictionar y w ith a
key , t, and th e m ov ie nam e
as th e cor r esponding v alu e to
th e
urllib.parse.urlencode()
fu nction and th en add th e
serviceurl and apikey to
th e ou tpu t of th e fu nction to
constr u ct th e fu ll URL. Th is
URL w ill be u sed to access th e
data. Th e JSON data h as a
key called Response. If it is
True, th at m eans th e r ead
w as su ccessfu l. Ch eck th is
befor e pr ocessing th e data. If
it's not su ccessfu l, th en pr int
th e JSON key Error, w h ich
w ill contain th e appr opr iate
er r or m essage r etu r ned by
th e m ov ie database.

1 0. Test th e search_movie
fu nction by enter ing
Titanic.

1 1 . Test th e search_movie
fu nction by enter ing
"Random_error" (obv iou sly ,
th is w ill not be fou nd, and
y ou sh ou ld be able to ch eck
w h eth er y ou r er r or catch ing
code is w or king pr oper ly ).

Note:

The solution for this activity


can be found on page 320.
Summary
In t h i s c h ap t er , w e w ent t h r ou gh sev er al
i mp or t ant c onc ep t s and l ear ni ng modu l es
r el at ed t o adv anc ed dat a gat h er i ng and w eb
sc r ap i ng. W e st ar t ed b y r eadi ng dat a f r om w eb
p ages u si ng t w o of t h e most p op u l ar Py t h on
l i b r ar i es – requests and BeautifulSoup. In t h i s
t ask , w e u t i l i zed t h e p r ev i ou s c h ap t er 's
k now l edge ab ou t t h e gener al st r u c t u r e of H TML
p ages and t h ei r i nt er ac t i on w i t h Py t h on c ode.
W e ex t r ac t ed meani ngf u l dat a f r om t h e
W i k i p edi a h ome p age du r i ng t h i s p r oc ess.

Th en, w e l ear ned h ow t o r ead dat a f r om XML and


JSON f i l es, t w o of t h e most w i del y u sed dat a
st r eami ng/ex c h ange f or mat s on t h e w eb . For t h e
XML p ar t , w e sh ow ed y ou h ow t o t r av er se t h e
t r ee-st r u c t u r e dat a st r i ng ef f i c i ent l y t o ex t r ac t
k ey i nf or mat i on. For t h e JSON p ar t , w e mi x ed i t
w i t h r eadi ng dat a f r om t h e w eb u si ng an A PI
(A p p l i c at i on Pr ogr am Int er f ac e). Th e A PI w e
c onsu med w as RESTf u l , w h i c h i s one of t h e major
st andar ds i n W eb A PI.

A t t h e end of t h i s c h ap t er , w e w ent t h r ou gh a
det ai l ed ex er c i se of u si ng r egex t ec h ni qu es i n
t r i c k y st r i ng-mat c h i ng p r ob l ems t o sc r ap e
u sef u l i nf or mat i on f r om a l ar ge and messy t ex t
c or p u s, p ar sed f r om H TML. Th i s c h ap t er sh ou l d
c ome i n ex t r emel y h andy f or st r i ng and t ex t
p r oc essi ng t ask s i n y ou r dat a w r angl i ng c ar eer .

In t h e nex t c h ap t er , w e w i l l l ear n ab ou t
dat ab ases w i t h Py t h on.
Chapter 8
RDBMS and SQL
Learning Objectives
By t h e end of t h i s c h ap t er , y ou w i l l b e ab l e t o:

A pply th e basics of RDBMS to


qu er y databases u sing
Py th on

Conv er t data fr om SQL into a


pandas DataFr am e

Th i s c h ap t er ex p l ai ns t h e c onc ep t s of dat ab ases,


i nc l u di ng t h ei r c r eat i on, mani p u l at i on and
c ont r ol , and t r ansf or mi ng t ab l es i nt o p andas
Dat aFr ames.

Introduction
Th i s c h ap t er of ou r dat a jou r ney i s f oc u sed on
RDBMS (Rel at i onal Dat ab ase Management
Sy st ems) and SQL (St r u c t u r ed Qu er y Langu age).
In t h e p r ev i ou s c h ap t er , w e st or ed and r ead dat a
f r om a f i l e. In t h i s c h ap t er , w e w i l l r ead
st r u c t u r ed dat a, desi gn ac c ess t o t h e dat a, and
c r eat e qu er y i nt er f ac es f or dat ab ases.

Dat a h as b een st or ed i n RDBMS f or mat f or y ear s.


Th e r easons b eh i nd i t ar e as f ol l ow s:

RDBMS is one of th e safest


w ay s to stor e, m anage, and
r etr iev e data.

Th ey ar e backed by a solid
m ath em atical fou ndation
(r elational algebr a and
calcu lu s) and th ey expose an
efficient and intu itiv e
declar ativ e langu age – SQL
– for easy inter action.

A lm ost ev er y langu age h as a


r ich set of libr ar ies to
inter act w ith differ ent
RDBMS and th e tr icks and
m eth ods of u sing th em ar e
w ell tested and w ell
u nder stood.
Scaling an RDBMS is a pr etty
w ell-u nder stood task and
th er e ar e a bu nch of w ell
tr ained, exper ienced
pr ofessionals to do th is job
(DBA or database
adm inistr ator ).

A s w e c an see i n t h e f ol l ow i ng c h ar t , t h e mar k et
of DBMS i s b i g. Th i s c h ar t w as p r odu c ed b ased on
mar k et r esear c h t h at w as done b y Gartne r, I nc.
i n 2016:

Figure 8.1 Commercial database market share in 2016

W e w i l l l ear n and p l ay ar ou nd w i t h some b asi c


and f u ndament al c onc ep t s of dat ab ase and
dat ab ase management sy st ems i n t h i s c h ap t er .

Refresher of RDBMS and


SQL
A n RDBMS i s a p i ec e of sof t w ar e t h at manages
dat a (r ep r esent ed f or t h e end u ser i n a t ab u l ar
f or m) on p h y si c al h ar d di sk s and i s b u i l t u si ng
t h e Codd's r el at i onal model . Most of t h e dat ab ases
t h at w e enc ou nt er t oday ar e RDBMS. In r ec ent
y ear s, t h er e h as b een a h u ge i ndu st r y sh i f t
t ow ar d a new er k i nd of dat ab ase management
sy st em, c al l ed NoSQL (Mong oDB, CouchDB,
Riak, and so on). Th ese sy st ems, al t h ou gh i n some
asp ec t s t h ey f ol l ow some of t h e r u l es of RDBMS,
i n most c ases r ejec t or modi f y t h em.

HOW IS AN RDBMS
STRUCTURED?
Th e RDBMS st r u c t u r e c onsi st s of t h r ee mai n
el ement s, namel y t h e st or age engi ne, qu er y
engi ne, and l og management . H er e i s a di agr am
t h at sh ow s t h e st r u c t u r e of a RDBMS:
Figure 8.2 RDBMS structure

Th e f ol l ow i ng ar e t h e mai n c onc ep t s of any


RDBMS st r u c t u r e:

St orage engine: Th is is th e
par t of th e RDBMS th at is
r esponsible for stor ing th e
data in an efficient w ay and
also to giv e it back w h en
asked for , in an efficient
w ay . A s an end u ser of th e
RDBMS sy stem (an
application dev eloper is
consider ed an end u ser of an
RDBMS), w e w ill nev er need
to inter act w ith th is lay er
dir ectly .

Query engine: Th is is th e
par t of th e RDBMS th at
allow s u s to cr eate data
objects (tables, v iew s, and so
on), m anipu late th em
(cr eate and delete colu m ns,
cr eate/delete/u pdate r ow s,
and so on), and qu er y th em
(r ead r ow s) u sing a sim ple
y et pow er fu l langu age.

Log management : Th is
par t of th e RDBMS is
r esponsible for cr eating and
m aintaining th e logs. If y ou
ar e w onder ing w h y th e log is
su ch an im por tant th ing,
th en y ou sh ou ld look into
h ow r eplication and
par titions ar e h andled in a
m oder n RDBMS (su ch as
Postgr eSQL) u sing som eth ing
called Writ e Ahead Log (or
WA L for sh or t).

W e w i l l f oc u s on t h e qu er y engi ne i n t h i s
c h ap t er .

SQL
St r u c t u r ed Qu er y Langu age or SQL (p r onou nc ed
sequ el ), as i t i s c ommonl y k now n, i s a domai n-
sp ec i f i c l angu age t h at w as or i gi nal l y desi gned
b ased on E.F. Codd's r el at i onal model and i s
w i del y u sed i n t oday 's dat ab ases t o def i ne, i nser t ,
mani p u l at e, and r et r i ev e dat a f r om t h em. It c an
b e f u r t h er su b -di v i ded i nt o f ou r smal l er su b -
l angu ages, namel y DDL (Dat a Def i ni t i on
Langu age), DML (Dat a Mani p u l at i on Langu age),
DQL (Dat a Qu er y Langu age), and DCL (Dat a
Cont r ol Langu age). Th er e ar e sev er al adv ant ages
of u si ng SQL, w i t h some of t h em b ei ng as f ol l ow s:

It is based on a solid
m ath em atical fr am ew or k
and th u s it is easy to
u nder stand.

It is a declar ativ e langu age,


w h ich m eans th at w e
actu ally nev er tell it h ow to
do its job. We alm ost alw ay s
tell it w h at to do. Th is fr ees
u s fr om a big bu r den of
w r iting cu stom code for data
m anagem ent. We can be
m or e focu sed on th e actu al
qu er y pr oblem w e ar e tr y ing
to solv e instead of both er ing
abou t h ow to cr eate and
m aintain a data stor e.

It giv es y ou a fast and


r eadable w ay to deal w ith
data.

SQL giv es y ou ou t-of-th e-box


w ay s to get m u ltiple pieces of
data w ith a single qu er y .

Th e mai n ar eas of f oc u s f or t h e f ol l ow i ng t op i c
w i l l b e DDL, DML, and DQL. Th e DCL p ar t i s mor e
f or dat ab ase admi ni st r at or s.

DDL: Th is is h ow w e define
ou r data str u ctu r e in SQL. A s
RDBMS is m ainly designed
and bu ilt w ith str u ctu r ed
data in m ind, w e h av e to tell
an RDBMS engine
befor eh and w h at ou r data is
going to look like. We can
u pdate th is definition at a
later point in tim e, bu t an
initial one is a m u st. Th is is
w h er e w e w ill w r ite
statem ents su ch as CREATE
TABLE or DROP TABLE or
ALTER TABLE.
Note

Notice the use of uppercase


letters. I t is not a specification
and you can use low ercase
letters, but it is a w idely
follow ed convention and w e
w ill use that in this book.

DML: DML is th e par t of SQL


th at let u s inser t, delete, or
u pdate a cer tain data point
(a r ow ) in a pr ev iou sly
defined data object (a table).
Th is is th e par t of SQL w h ich
contains statem ents su ch as
INSERT INTO, DELETE FROM,
or UPDATE.

DQL: With DQL, w e enable


ou r selv es to qu er y th e data
stor ed in a RDBMS, w h ich
w as defined by DDL and
inser ted u sing DML. It giv es
u s enor m ou s pow er and
flexibility to not only qu er y
data ou t of a single object
(table), bu t also to extr act
r elev ant data fr om all th e
r elated objects u sing qu er ies.
Th e fr equ ently u sed qu er y
th at's u sed to r etr iev e data is
th e SELECT com m and. We
w ill also see and u se th e
concepts of th e pr im ar y key ,
for eign key , index, joins, and
so on.

Onc e y ou def i ne and i nser t dat a i n a dat ab ase, i t


c an b e r ep r esent ed as f ol l ow s:
Figure 8.3 Table displaying sample data

A not h er t h i ng t o r ememb er ab ou t RDBMS i s


r el at i ons. Gener al l y , i n a t ab l e, w e h av e one or
mor e c ol u mns t h at w i l l h av e u ni qu e v al u es f or
eac h r ow i n t h e t ab l e. W e c al l t h em primary
ke y s f or t h e t ab l e. W e sh ou l d b e aw ar e t h at w e
w i l l enc ou nt er u ni qu e v al u es ac r oss t h e r ow s,
w h i c h ar e not p r i mar y k ey s. Th e mai n
di f f er enc e b et w een t h em and p r i mar y k ey s i s
t h e f ac t t h at a p r i mar y k ey c annot b e nu l l .

By u si ng t h e p r i mar y k ey of one t ab l e and


ment i oni ng i t as a f or ei gn k ey i n anot h er t ab l e,
w e c an est ab l i sh r el at i ons b et w een t w o t ab l es. A
c er t ai n t ab l e c an b e r el at ed t o any f i ni t e
nu mb er of t ab l es. Th e r el at i ons c an b e 1 :1 , w h i c h
means t h at eac h r ow of t h e sec ond t ab l e i s
u ni qu el y r el at ed t o one r ow of t h e f i r st t ab l e, or
1 :N , N :1 , or N : M. A n ex amp l e of r el at i ons i s as
f ol l ow s:
Figure 8.4 Diagram showing relations

W i t h t h i s b r i ef r ef r esh er , w e ar e now r eady t o


ju mp i nt o h ands-on ex er c i ses and w r i t e some SQL
t o st or e and r et r i ev e dat a.

Using an RDBMS
(MySQL/PostgreSQL/SQ
Lite)
In t h i s t op i c , w e w i l l f oc u s on h ow t o w r i t e some
b asi c SQL c ommands, as w el l as h ow t o c onnec t t o
a dat ab ase f r om Py t h on and u se i t ef f ec t i v el y
w i t h i n Py t h on. Th e dat ab ase w e w i l l c h oose h er e
i s SQLi t e. Th er e ar e ot h er dat ab ases, su c h as
Oracle, MySQL, Postgresql, and DB2. Th e mai n
t r i c k s t h at y ou ar e goi ng t o l ear n h er e w i l l not
c h ange b ased on w h at dat ab ase y ou ar e u si ng. Bu t
f or di f f er ent dat ab ases, y ou w i l l need t o i nst al l
di f f er ent t h i r d-p ar t y Py t h on l i b r ar i es (su c h as
Psycopg2 f or Postgresql, and so on). Th e r eason
t h ey al l b eh av e t h e same w ay (ap ar t f or some
smal l det ai l s) i s t h e f ac t t h at t h ey al l adh er e t o
PEP249 (c ommonl y k now n as Py t h on DB A PI 2).

Th i s i s a good st andar di zat i on and sav es u s a l ot of


h eadac h es w h i l e p or t i ng f r om one RDBMS t o
anot h er .
Note
Mos t of th e ind u s try s ta nd a rd p roje c ts w h ic h a re
w ritte n in Py th on a nd u s e s om e k ind of RDBMS a s
th e d a ta s tore , m os t ofte n re la y on a n ORM or
Obje c t Re la tiona l Ma p p e r. An ORM is a h ig h -le v e l
libra ry in Py th on w h ic h m a k e s m a ny ta s k s ,
w h ile d e a ling w ith RDBMS , e a s ie r. I t a ls o
e xp os e s a m ore Py th onic API th a n w riting ra w
S QL ins id e Py th on c od e .

EXERCISE 107: CONNECTING


TO DATABASE IN SQLITE
In t h i s ex er c i se, w e w i l l l ook i nt o t h e f i r st st ep
t ow ar d u si ng a RDBMS i n Py t h on c ode. A l l w e ar e
goi ng t o do i s c onnec t t o a dat ab ase and t h en c l ose
t h e c onnec t i on. W e w i l l al so l ear n ab ou t t h e b est
w ay t o do t h i s:

1 . Im por t th e sqlite3 libr ar y


of Py th on by u sing th e
follow ing com m and:

import sqlite3

2 . Use th e connect fu nction to


connect to a database. If y ou
alr eady h av e som e
exper ience w ith databases,
th en y ou w ill notice th at w e
ar e not u sing any server
address, user name,
password, or oth er
cr edentials to connect to a
database. Th is is becau se
th ese fields ar e not
m andator y in sqlite3,
u nlike in Postgresql or
MySQL. Th e m ain database
engine of SQLite is
em bedded:

conn =
sqlite3.connect("chapt
er.db")

3 . Close th e connection, as
follow s:

conn.close()

Th is conn object is th e m ain


connection object, and w e
w ill need th at to get a second
ty pe of object in th e fu tu r e
once w e w ant to inter act
w ith th e database. We need
to be car efu l abou t closing
any open connection to ou r
database.

4 . Use th e sam e with statem ent


fr om Py th on, ju st like w e did
for files, and connect to th e
database, as follow s:

with
sqlite3.connect("chapt
er.db") as conn:

pass

In t h i s ex er c i se, w e h av e c onnec t ed t o a dat ab ase


u si ng Py t h on.

EXERCISE 108: DDL AND DML


COMMANDS IN SQLITE
In t h i s ex er c i se, w e w i l l l ook at h ow w e c an
c r eat e a t ab l e, and w e w i l l al so i nser t dat a i n i t .

A s t h e name su ggest s, DDL (Dat a Def i ni t i on


Langu age) i s t h e w ay t o c ommu ni c at e t o t h e
dat ab ase engi ne i n adv anc e t o def i ne w h at t h e
dat a w i l l l ook l i k e. Th e dat ab ase engi ne c r eat es a
t ab l e ob jec t b ased on t h e def i ni t i on p r ov i ded and
p r ep ar es i t .

To c r eat e a t ab l e i n SQL, u se t h e CREATE TABLE SQL


c l au se. Th i s w i l l need t h e t ab l e name and t h e
t ab l e def i ni t i on. Tab l e name i s a u ni qu e
i dent i f i er f or t h e dat ab ase engi ne t o f i nd and u se
t h e t ab l e f or al l f u t u r e t r ansac t i ons. It c an b e
any t h i ng (any al p h anu mer i c st r i ng), as l ong as i t
i s u ni qu e. W e add t h e t ab l e def i ni t i on i n t h e
f or m of (c ol u mn_name_1 dat a_t y p e,
c ol u mn_name_2 dat a t y p e, … ). For ou r p u r p ose,
w e w i l l u se t h e text and integer dat at y p es, b u t
u su al l y a st andar d dat ab ase engi ne su p p or t s
many mor e dat at y p es, su c h as f l oat , dou b l e, dat e
t i me, Bool ean, and so on. W e w i l l al so need t o
sp ec i f y a p r i mar y k ey . A p r i mar y k ey i s a
u ni qu e, non-nu l l i dent i f i er t h at 's u sed t o
u ni qu el y i dent i f y a r ow i n a t ab l e. In ou r c ase,
w e u se emai l as a p r i mar y k ey . A p r i mar y k ey
c an b e an i nt eger or t ex t .

Th e l ast t h i ng y ou need t o k now i s t h at u nl ess


y ou c al l a commit on t h e ser i es of op er at i ons y ou
ju st p er f or med (t oget h er , w e f or mal l y c al l t h em
a transaction), not h i ng w i l l b e ac t u al l y
p er f or med and r ef l ec t ed i n t h e dat ab ase. Th i s
p r op er t y i s c al l ed atomicity . In f ac t , f or a
dat ab ase t o b e i ndu st r y st andar d (t o b e u seab l e i n
r eal l i f e), i t needs t o f ol l ow t h e A CID (A t omi c i t y ,
Consi st enc y , Isol at i on, Du r ab i l i t y ) p r op er t i es:
1 . Use SQLite's connect
fu nction to connect to th e
chapter.db database, as
follow s:

with
sqlite3.connect("chapt
er.db") as conn:

Note

This code w ill w ork once you


add the snippet from step 3.

2 . Cr eate a cu r sor object by


calling conn.cursor(). Th e
cu r sor object acts as a
m ediu m to com m u nicate
w ith th e database. Cr eate a
table in Py th on, as follow s:

cursor = conn.cursor()

cursor.execute("CREATE
TABLE IF NOT EXISTS
user (email text,
first_name text,
last_name text,
address text, age
integer, PRIMARY KEY
(email))")

3 . Inser t r ow s into th e database


th at y ou cr eated, as follow s:

cursor.execute("INSERT
INTO user VALUES
('bob@example.com',
'Bob', 'Codd', '123
Fantasy lane, Fantasy
City', 31)")

cursor.execute("INSERT
INTO user VALUES
('tom@web.com', 'Tom',
'Fake', '456 Fantasy
lane, Fantasu City',
39)")

4 . Com m it to th e database:

conn.commit()

Th i s w i l l c r eat e t h e t ab l e and w r i t e t w o r ow s t o
i t w i t h dat a.
READING DATA FROM A
DATABASE IN SQLITE
In t h e p r ec edi ng ex er c i se, w e c r eat ed a t ab l e and
st or ed dat a i n i t . N ow , w e w i l l l ear n h ow t o r ead
t h e dat a t h at 's st or ed i n t h i s dat ab ase.

Th e SELECT c l au se i s i mmensel y p ow er f u l , and i t


i s r eal l y i mp or t ant f or a dat a p r ac t i t i oner t o
mast er SELECT and ev er y t h i ng r el at ed t o i t (su c h
as c ondi t i ons, joi ns, gr ou p -b y , and so on).

Th e * af t er SELECT t el l s t h e engi ne t o sel ec t al l of


t h e c ol u mns f r om t h e t ab l e. It i s a u sef u l
sh or t h and. W e h av e not ment i oned any c ondi t i on
f or t h e sel ec t i on (su c h as ab ov e a c er t ai n age,
f i r st name st ar t i ng w i t h a c er t ai n sequ enc e of
l et t er s, and so on). W e ar e p r ac t i c al l y t el l i ng t h e
dat ab ase engi ne t o sel ec t al l t h e r ow s and al l t h e
c ol u mns f r om t h e t ab l e. It i s t i me-c onsu mi ng and
l ess ef f ec t i v e i f w e h av e a h u ge t ab l e. H enc e, w e
w ou l d w ant t o u se t h e LIMIT c l au se t o l i mi t t h e
nu mb er of r ow s w e w ant .

You c an u se t h e SELECT c l au se i n SQL t o r et r i ev e


dat a, as f ol l ow s:

with sqlite3.connect("chapter.db") as
conn:

cursor = conn.cursor()

rows = cursor.execute('SELECT * FROM


user')

for row in rows:

print(row)

Th e ou t p u t i s as f ol l ow s:

Figure 8.5: Output of the SELECT clause

Th e sy nt ax t o u se t h e SELECT c l au se w i t h a LIMIT
as f ol l ow s:

SELECT * FROM <table_name> LIMIT 50;

Note
Th is s y nta x is a s a m p le c od e a nd w ill not w ork
on Ju p y te r note book .

Th i s w i l l sel ec t al l t h e c ol u mns, b u t onl y t h e


f i r st 50 r ow s f r om t h e t ab l e.

EXERCISE 109: SORTING


VALUES THAT ARE PRESENT
IN THE DATABASE
In t h i s ex er c i se, w e w i l l u se t h e ORDER BY c l au se
t o sor t t h e r ow s of u ser t ab l e w i t h r esp ec t t o age:

1 . Sor t th e chapter.db by age


in descending or der , as
follow s:

with
sqlite3.connect("chapt
er.db") as conn:

cursor = conn.cursor()

rows =
cursor.execute('SELECT
* FROM user ORDER BY
age DESC')

for row in rows:

print(row)

Th e ou tpu t is as follow s:

Figure 8.6: Output of data displaying age


in descending order

2 . Sor t th e chapter.db by age


in ascending or der , as
follow s:

with
sqlite3.connect("chapt
er.db") as conn:

cursor = conn.cursor()

rows =
cursor.execute('SELECT
* FROM user ORDER BY
age')

for row in rows:

print(row)

3 . Th e ou tpu t is as follow s:

Figure 8.7: Output of data displaying age in ascending


order

N ot i c e t h at w e don't need t o sp ec i f y t h e or der as


ASC t o sor t i t i nt o asc endi ng or der .
EXERCISE 110: ALTERING THE
STRUCTURE OF A TABLE AND
UPDATING THE NEW FIELDS
In t h i s ex er c i se, w e ar e goi ng t o add a c ol u mn
u si ng ALTER and UPDATE t h e v al u es i n t h e new l y
added c ol u mn.

Th e UPDATE c ommand i s u sed t o edi t /u p dat e any


r ow af t er i t h as b een i nser t ed. Be c ar ef u l w h en
u si ng i t b ec au se u si ng UPDATE w i t h ou t sel ec t i v e
c l au ses (su c h as WHERE) af f ec t s t h e ent i r e t ab l e:

1 . Establish th e connection
w ith th e database by u sing
th e follow ing com m and:
with
sqlite3.connect("chapt
er.db") as conn:

cursor = conn.cursor()

2 . A dd anoth er colu m n in th e
user table and fill it w ith
null v alu es by u sing th e
follow ing com m and:

cursor.execute("ALTER
TABLE user ADD COLUMN
gender text")

3 . Update all of th e v alu es of


gender so th at th ey ar e M by
u sing th e follow ing
com m and:

cursor.execute("UPDATE
user SET gender='M'")

conn.commit()

4 . To ch eck th e alter ed table,


execu te th e follow ing
com m and:

rows =
cursor.execute('SELECT
* FROM user')

for row in rows:

print(row)
Figure 8.8: Output a er altering the table

W e h av e u p dat ed t h e ent i r e t ab l e b y set t i ng t h e


gender of al l t h e u ser s as M, w h er e M st ands f or
mal e.

EXERCISE 111: GROUPING


VALUES IN TABLES
In t h i s ex er c i se, w e w i l l l ear n ab ou t a c onc ep t
t h at w e h av e al r eady l ear ned ab ou t i n p andas.
Th i s i s t h e GROUP BY c l au se. Th e GROUP BY c l au se i s
a t ec h ni qu e t h at 's u sed t o r et r i ev e di st i nc t
v al u es f r om t h e dat ab ase and p l ac e t h em i n
i ndi v i du al b u c k et s.

Th e f ol l ow i ng di agr am ex p l ai ns h ow t h e GROU P
BY c l au se w or k s:

Figure 8.9: Illustration of the GROUP BY clause on a


table

In t h e p r ec edi ng di agr am, w e c an see t h at t h e


Col3 c ol u mn h as onl y t w o u ni qu e v al u es ac r oss
al l r ow s, A and B.

Th e c ommand t h at 's u sed t o c h ec k t h e t ot al


nu mb er of r ow s b el ongi ng t o eac h gr ou p i s as
f ol l ow s:

SELECT count(*), col3 FROM table1 GROUP BY


col3
A dd f emal e u ser s t o t h e t ab l e and gr ou p t h em
b ased on t h e gender :

1 . A dd a fem ale u ser to th e


table:

cursor.execute("INSERT
INTO user VALUES
('shelly@www.com',
'Shelly', 'Milar',
'123, Ocean View
Lane', 39, 'F')")

2 . Ru n th e follow ing code to see


th e cou nt by each gender :

rows =
cursor.execute("SELECT
COUNT(*), gender FROM
user GROUP BY gender")

for row in rows:

print(row)

Th e ou tpu t is as follow s:

Figure 8.10: Output of the GROUP BY clause

RELATION MAPPING IN
DATABASES
W e h av e b een w or k i ng w i t h a si ngl e t ab l e and
al t er i ng i t , as w el l as r eadi ng b ac k t h e dat a.
H ow ev er , t h e r eal p ow er of an RDBMS c omes
f r om t h e h andl i ng of r el at i onsh i p s among
di f f er ent ob jec t s (t ab l es). In t h i s sec t i on, w e ar e
goi ng t o c r eat e a new t ab l e c al l ed comments and
l i nk i t w i t h t h e u ser t ab l e i n a 1 : N r el at i onsh i p .
Th i s means t h at one u ser c an h av e mu l t i p l e
c omment s. Th e w ay w e ar e goi ng t o do t h i s i s b y
addi ng t h e user t ab l e's p r i mar y k ey as a f or ei gn
k ey i n t h e comments t ab l e. Th i s w i l l c r eat e a 1 : N
r el at i onsh i p .

W h en w e l i nk t w o t ab l es, w e need t o sp ec i f y t o
t h e dat ab ase engi ne w h at sh ou l d b e done i f t h e
p ar ent r ow i s del et ed, w h i c h h as many c h i l dr en
i n t h e ot h er t ab l e. A s w e c an see i n t h e f ol l ow i ng
di agr am, w e ar e ask i ng w h at h ap p ens at t h e
p l ac e of t h e qu est i on mar k s w h en w e del et e r ow 1
of t h e u ser t ab l e:
Figure 8.11: Illustration of relations

In a non-RDBMS si t u at i on, t h i s si t u at i on c an
qu i c k l y b ec ome di f f i c u l t and messy t o manage
and mai nt ai n. H ow ev er , w i t h an RDBMS, al l w e
h av e t o t el l t h e dat ab ase engi ne, i n v er y p r ec i se
w ay s, i s w h at t o do w h en a si t u at i on l i k e t h i s
oc c u r s. Th e dat ab ase engi ne w i l l do t h e r est f or
u s. W e u se ON DELETE t o t el l t h e engi ne w h at w e do
w i t h al l t h e r ow s of a t ab l e w h en t h e p ar ent r ow
get s del et ed. Th e f ol l ow i ng c ode i l l u st r at es t h ese
c onc ep t s:

with sqlite3.connect("chapter.db") as
conn:

cursor = conn.cursor()

cursor.execute("PRAGMA foreign_keys = 1")

sql = """

CREATE TABLE comments (

user_id text,

comments text,

FOREIGN KEY (user_id) REFERENCES user


(email)

ON DELETE CASCADE ON UPDATE NO ACTION

"""

cursor.execute(sql)

conn.commit()

Th e ON DELETE CASCADE l i ne i nf or ms t h e dat ab ase


engi ne t h at w e w ant t o del et e al l t h e c h i l dr en
r ow s w h en t h e p ar ent get s del et ed. W e c an al so
def i ne ac t i ons f or UPDATE. In t h i s c ase, t h er e i s
not h i ng t o do on UPDATE.

Th e FOREIGN KEY modi f i er modi f i es a c ol u mn


def i ni t i on (user_id, i n t h i s c ase) and mar k s i t as a
f or ei gn k ey , w h i c h i s r el at ed t o t h e p r i mar y k ey
(email, i n t h i s c ase) of anot h er t ab l e.

You may not i c e t h e st r ange l ook i ng


cursor.execute("PRAGMA foreign_keys = 1")
l i ne i n t h e c ode. It i s t h er e ju st b ec au se SQLi t e
does not u se t h e nor mal f or ei gn k ey f eat u r es b y
def au l t . It i s t h i s l i ne t h at enab l es t h at f eat u r e.
It i s t y p i c al t o SQLi t e and w e w on't need i t f or
any ot h er dat ab ases.
ADDING ROWS IN THE
COMMENTS TABLE
W e h av e c r eat ed a t ab l e c al l ed c omment s. In t h i s
sec t i on, w e w i l l dy nami c al l y gener at e an i nser t
qu er y , as f ol l ow s:

with sqlite3.connect("chapter.db") as
conn:

cursor = conn.cursor()

cursor.execute("PRAGMA foreign_keys = 1")

sql = "INSERT INTO comments VALUES ('{}',


'{}')"

rows = cursor.execute('SELECT * FROM user


ORDER BY age')

for row in rows:

email = row[0]

print("Going to create rows for


{}".format(email))

name = row[1] + " " + row[2]

for i in range(10):

comment = "This is comment {} by


{}".format(i, name)

conn.cursor().execute(sql.format(email,
comment))

conn.commit()

Pay at t ent i on t o h ow w e dy nami c al l y gener at e


t h e i nser t qu er y so t h at w e c an i nser t 1 0
c omment s f or eac h u ser .

JOINS
In t h i s ex er c i se, w e w i l l l ear n h ow t o ex p l oi t t h e
r el at i onsh i p w e ju st b u i l t . Th i s means t h at i f w e
h av e t h e p r i mar y k ey f r om one t ab l e, w e c an
r ec ov er al l t h e dat a needed f r om t h at t ab l e and
al so al l t h e l i nk ed r ow s f r om t h e c h i l d t ab l e. To
ac h i ev e t h i s, w e w i l l u se somet h i ng c al l ed a j oin.

A joi n i s b asi c al l y a w ay t o r et r i ev e l i nk ed r ow s
f r om t w o t ab l es u si ng any k i nd of p r i mar y k ey -
f or ei gn k ey r el at i on t h at t h ey h av e. Th er e ar e
many t y p es of joi n, su c h as INNER, LEFT OUTER,
RIGHT OUTER, FULL OUTER, and CROSS. Th ey ar e u sed
i n di f f er ent si t u at i ons. H ow ev er , most of t h e
t i me, i n si mp l e 1 : N r el at i ons, w e end u p u si ng an
INNER joi n. In Ch a p te r 1: I ntrod u c tion to Da ta
Wra ng ling w ith Py th on, w e l ear ned ab ou t set s,
t h en w e c an v i ew an INNER JOIN as an
i nt er sec t i on of t w o set s. Th e f ol l ow i ng di agr am
i l l u st r at e t h e c onc ep t s:
Figure 8.12: Intersection Join

H er e, A r ep r esent s one t ab l e and B r ep r esent s


anot h er . Th e meani ng of h av i ng c ommon
memb er s i s t o h av e a r el at i onsh i p b et w een t h em.
It t ak es al l of t h e r ow s of A and c omp ar es t h em
w i t h al l of t h e r ow s of B t o f i nd t h e mat c h i ng
r ow s t h at sat i sf y t h e joi n p r edi c at e. Th i s c an
qu i c k l y b ec ome a c omp l ex and t i me-c onsu mi ng
op er at i on. Joi ns c an b e v er y ex p ensi v e
op er at i ons. U su al l y , w e u se some k i nd of where
c l au se, af t er w e sp ec i f y t h e joi n, t o sh or t en t h e
sc op e of r ow s t h at ar e f et c h ed f r om t ab l e A or B
t o p er f or m t h e mat c h i ng.

In ou r c ase, ou r f i r st t ab l e, user, h as t h r ee
ent r i es, w i t h t h e p r i mar y k ey b ei ng t h e email.
W e c an mak e u se of t h i s i n ou r qu er y t o get
c omment s ju st f r om Bob:
with sqlite3.connect("chapter.db") as
conn:

cursor = conn.cursor()s

cursor.execute("PRAGMA foreign_keys = 1")

sql = """

SELECT * FROM comments

JOIN user ON comments.user_id = user.email

WHERE user.email='bob@example.com'

"""

rows = cursor.execute(sql)

for row in rows:

print(row)

Th e ou t p u t i s as f ol l ow s:

('bob@example.com', 'This is comment 0 by


Bob Codd', 'bob@example.com', 'Bob',
'Codd', '123 Fantasy lane, Fantasu City',
31, None)

('bob@example.com', 'This is comment 1 by


Bob Codd', 'bob@example.com', 'Bob',
'Codd', '123 Fantasy lane, Fantasu City',
31, None)

('bob@example.com', 'This is comment 2 by


Bob Codd', 'bob@example.com', 'Bob',
'Codd', '123 Fantasy lane, Fantasu City',
31, None)

('bob@example.com', 'This is comment 3 by


Bob Codd', 'bob@example.com', 'Bob',
'Codd', '123 Fantasy lane, Fantasu City',
31, None)

('bob@example.com', 'This is comment 4 by


Bob Codd', 'bob@example.com', 'Bob',
'Codd', '123 Fantasy lane, Fantasu City',
31, None)

('bob@example.com', 'This is comment 5 by


Bob Codd', 'bob@example.com', 'Bob',
'Codd', '123 Fantasy lane, Fantasu City',
31, None)

('bob@example.com', 'This is comment 6 by


Bob Codd', 'bob@example.com', 'Bob',
'Codd', '123 Fantasy lane, Fantasu City',
31, None)

('bob@example.com', 'This is comment 7 by


Bob Codd', 'bob@example.com', 'Bob',
'Codd', '123 Fantasy lane, Fantasu City',
31, None)

('bob@example.com', 'This is comment 8 by


Bob Codd', 'bob@example.com', 'Bob',
'Codd', '123 Fantasy lane, Fantasu City',
31, None)

('bob@example.com', 'This is comment 9 by


Bob Codd', 'bob@example.com', 'Bob',
'Codd', '123 Fantasy lane, Fantasu City',
31, None)
Figure 8.13: Output of the Join query
RETRIEVING SPECIFIC
COLUMNS FROM A JOIN
QUERY
In t h e p r ev i ou s ex er c i se, w e saw t h at w e c an u se
a JOIN t o f et c h t h e r el at ed r ow s f r om t w o t ab l es.
H ow ev er , i f w e l ook at t h e r esu l t s, w e w i l l see
t h at i t r et u r ned al l t h e c ol u mns, t h u s c omb i ni ng
b ot h t ab l es. Th i s i s not v er y c onc i se. W h at ab ou t
i f w e onl y w ant t o see t h e emai l s and t h e r el at ed
c omment s, and not al l t h e dat a?

Th er e i s some ni c e sh or t h and c ode t h at l et s u s do


t h i s:

with sqlite3.connect("chapter.db") as
conn:

cursor = conn.cursor()

cursor.execute("PRAGMA foreign_keys = 1")

sql = """

SELECT comments.* FROM comments

JOIN user ON comments.user_id = user.email

WHERE user.email='bob@example.com'

"""

rows = cursor.execute(sql)

for row in rows:

print(row)

Ju st b y c h angi ng t h e SELECT st at ement , w e made


ou r f i nal r esu l t l ook as f ol l ow s:

('bob@example.com', 'This is comment 0 by


Bob Codd')

('bob@example.com', 'This is comment 1 by


Bob Codd')

('bob@example.com', 'This is comment 2 by


Bob Codd')

('bob@example.com', 'This is comment 3 by


Bob Codd')

('bob@example.com', 'This is comment 4 by


Bob Codd')

('bob@example.com', 'This is comment 5 by


Bob Codd')

('bob@example.com', 'This is comment 6 by


Bob Codd')

('bob@example.com', 'This is comment 7 by


Bob Codd')

('bob@example.com', 'This is comment 8 by


Bob Codd')

('bob@example.com', 'This is comment 9 by


Bob Codd')
EXERCISE 112: DELETING
ROWS
In t h i s ex er c i se, w e ar e goi ng t o del et e a r ow
f r om t h e u ser t ab l e and ob ser v e t h e ef f ec t s i t
w i l l h av e on t h e comments t ab l e. Be v er y c ar ef u l
w h en r u nni ng t h i s c ommand as i t c an h av e a
dest r u c t i v e ef f ec t on t h e dat a. Pl ease k eep i n
mi nd t h at i t h as t o al most al w ay s b e r u n
ac c omp ani ed b y a WHERE c l au se so t h at w e del et e
ju st a p ar t of t h e dat a and not ev er y t h i ng:

1 . To delete a r ow fr om a table,
w e u se th e DELETE clau se in
SQL. To r u n delete on th e
user table, w e ar e going to
u se th e follow ing code:

with
sqlite3.connect("chapt
er.db") as conn:

cursor = conn.cursor()

cursor.execute("PRAGMA
foreign_keys = 1")

cursor.execute("DELETE
FROM user WHERE
email='bob@example.com
'")

conn.commit()

2 . Per for m th e SELECT


oper ation on th e u ser table:

with
sqlite3.connect("chapt
er.db") as conn:

cursor = conn.cursor()

cursor.execute("PRAGMA
foreign_keys = 1")

rows =
cursor.execute("SELECT
* FROM user")

for row in rows:

print(row)

Obser v e th at th e u ser Bob


h as been deleted.

Now , m ov ing on to th e
comments table, w e h av e to
r em em ber th at w e h ad
m entioned ON DELETE
CASCADE w h ile cr eating th e
table. Th e database engine
know s th at if a r ow is deleted
fr om th e par ent table (user),
all th e r elated r ow s fr om th e
ch ild tables (comments) w ill
h av e to be deleted.

3 . Per for m a select oper ation on


th e com m ents table by u sing
th e follow ing com m and:

with
sqlite3.connect("chapt
er.db") as conn:

cursor = conn.cursor()

cursor.execute("PRAGMA
foreign_keys = 1")

rows =
cursor.execute("SELECT
* FROM comments")

for row in rows:

print(row)

Th e ou tpu t is as follow s:

('tom@web.com', 'This
is comment 0 by Tom
Fake')

('tom@web.com', 'This
is comment 1 by Tom
Fake')

('tom@web.com', 'This
is comment 2 by Tom
Fake')

('tom@web.com', 'This
is comment 3 by Tom
Fake')

('tom@web.com', 'This
is comment 4 by Tom
Fake')

('tom@web.com', 'This
is comment 5 by Tom
Fake')
('tom@web.com', 'This
is comment 6 by Tom
Fake')

('tom@web.com', 'This
is comment 7 by Tom
Fake')

('tom@web.com', 'This
is comment 8 by Tom
Fake')

('tom@web.com', 'This
is comment 9 by Tom
Fake')

We can see th at all of th e


r ow s r elated to Bob ar e
deleted.

UPDATING SPECIFIC VALUES


IN A TABLE
In t h i s ex er c i se, w e w i l l see h ow w e c an u p dat e
r ow s i n a t ab l e. W e h av e al r eady l ook ed at t h i s i n
t h e p ast b u t , as w e ment i oned, at a t ab l e l ev el
onl y . W i t h ou t W H ERE, u p dat i ng i s of t en a b ad
i dea.

Comb i ne U PDA TE w i t h W H ERE t o sel ec t i v el y


u p dat e t h e f i r st name of t h e u ser w i t h t h e emai l
addr ess tom@web.com:

with sqlite3.connect("chapter.db") as
conn:

cursor = conn.cursor()

cursor.execute("PRAGMA foreign_keys = 1")

cursor.execute("UPDATE user set


first_name='Chris' where
email='tom@web.com'")

conn.commit()

rows = cursor.execute("SELECT * FROM


user")

for row in rows:

print(row)

Th e ou t p u t i s as f ol l ow s:

Figure 8.14: Output of the update query


EXERCISE 113: RDBMS AND
DATAFRAMES
W e h av e l ook ed i nt o many f u ndament al asp ec t s
of st or i ng and qu er y i ng dat a f r om a dat ab ase, b u t
as a dat a w r angl i ng ex p er t , w e need ou r dat a t o
b e p ac k ed and p r esent ed as a Dat aFr ame so t h at
w e c an p er f or m qu i c k and c onv eni ent op er at i ons
on t h em:

1 . Im por t pandas u sing th e


follow ing code:

import pandas as pd

2 . Cr eate a colu m ns list w ith


email, first name, last
name, age, gender, and
comments as colu m n nam es.
A lso, cr eate an em pty data
list:

columns = ["Email",
"First Name", "Last
Name", "Age",
"Gender", "Comments"]

data = []

3 . Connect to chapter.db u sing


SQLite and obtain a cu r sor ,
as follow s:

with
sqlite3.connect("chapt
er.db") as conn:

cursor = conn.cursor()

Use the execute method


from the cursor to set
"PRAGMA foreign_keys =
1"

cursor.execute("PRAGMA
foreign_keys = 1")

4 . Cr eate a sql v ar iable th at


w ill contain th e SELECT
com m and and u se th e join
com m and to join th e
databases:

sql = """

SELECT user.email,
user.first_name,
user.last_name,
user.age, user.gender,
comments.comments FROM
comments

JOIN user ON
comments.user_id =
user.email

WHERE user.email =
'tom@web.com'

"""

5. Use th e execute m eth od of


cu r sor to execu te th e sql
com m and:

rows =
cursor.execute(sql)

6 . A ppend th e r ow s to th e data
list:

for row in rows:

data.append(row)

7 . Cr eate a DataFr am e u sing


th e data list:

df =
pd.DataFrame(data,
columns=columns)

8. We h av e cr eated th e
DataFr am e u sing th e data
list. You can pr int th e v alu es
into th e DataFr am e u sing
df.head.

ACTIVITY 11: RETRIEVING


DATA CORRECTLY FROM
DATABASES
In t h i s ac t i v i t y , w e h av e t h e p er sons t ab l e:
Figure 8.15: The persons table

W e h av e t h e p et s t ab l e:

Figure 8.16: The pets table

A s w e c an see, t h e id c ol u mn i n t h e p er sons t ab l e
(w h i c h i s an i nt eger ) ser v es as t h e p r i mar y k ey
f or t h at t ab l e and as a f or ei gn k ey f or t h e p et
t ab l e, w h i c h i s l i nk ed v i a t h e owner_id c ol u mn.

Th e p er sons t ab l e h as t h e f ol l ow i ng c ol u mns:

first_name: Th e fir st nam e


of th e per son
last_name: Th e last nam e of
th e per son (can be "nu ll")

age: Th e age of th e per son

city: Th e city fr om w h er e
h e/sh e is fr om

zip_code: Th e zip code of th e


city

Th e p et s t ab l e h as t h e f ol l ow i ng c ol u mns:

pet_name: Th e nam e of th e
pet.

pet_type: Wh at ty pe of pet
it is, for exam ple, cat, dog,
and so on. Du e to a lack of
fu r th er infor m ation, w e do
not know w h ich nu m ber
r epr esents w h at, bu t it is an
integer and can be nu ll.

treatment_done: It is also an
integer colu m n, and 0 h er e
r epr esents "No", w h er eas 1
r epr esents "Yes".

Th e name of t h e SQLi t e DB i s petsdb and i t i s


su p p l i ed al ong w i t h t h e A c t i v i t y not eb ook .

Th ese st ep s w i l l h el p y ou c omp l et e t h i s ac t i v i t y :

1 . Connect to petsDB and ch eck


w h eth er th e connection h as
been su ccessfu l.

2 . Find th e differ ent age gr ou ps


in th e per sons database.

3 . Find th e age gr ou p th at h as
th e m axim u m nu m ber of
people.

4 . Find th e people w h o do not


h av e a last nam e.

5. Find ou t h ow m any people


h av e m or e th an one pet.

6 . Find ou t h ow m any pets


h av e r eceiv ed tr eatm ent.

7 . Find ou t h ow m any pets


h av e r eceiv ed tr eatm ent and
th e ty pe of pet is know n.

8. Find ou t h ow m any pets ar e


fr om th e city called east
port.

9 . Find ou t h ow m any pets ar e


fr om th e city called east
port and w h o r eceiv ed a
tr eatm ent.

Note

The solution for this activity


can be found on page 324.

Summary
W e h av e c ome t o t h e end of t h e dat ab ase c h ap t er .
W e h av e l ear ned h ow t o c onnec t t o SQLi t e u si ng
Py t h on. W e h av e b r u sh ed u p on t h e b asi c s of
r el at i onal dat ab ases and l ear ned h ow t o op en and
c l ose a dat ab ase. W e t h en l ear ned h ow t o ex p or t
t h i s r el at i onal dat ab ase i nt o Py t h on Dat aFr ames.

In t h e nex t c h ap t er , w e w i l l b e p er f or mi ng dat a
w r angl i ng on r eal -w or l d dat aset s.
Chapter 9
Application of
Data Wrangling in Real
Life
Learning Objectives
By t h e end of t h i s c h ap t er , y ou w i l l b e ab l e t o:

Per for m data w r angling on


m u ltiple fu ll-fledged datasets
fr om r enow ned sou r ces

Cr eate a u nified dataset th at


can be passed on to a data
science team for m ach ine
lear ning and pr edictiv e
analy tics

Relate data w r angling to


v er sion contr ol,
container ization, clou d
ser v ices for data analy tics,
and big data tech nologies
su ch as A pach e Spar k and
Hadoop

In t h i s c h ap t er , y ou w i l l ap p l y y ou r gat h er ed
k now l edge on r eal -l i f e dat aset s and i nv est i gat e
v ar i ou s asp ec t s of i t .

Introduction
W e l ear ned ab ou t dat ab ases i n t h e p r ev i ou s
c h ap t er , so now i t i s t i me t o c omb i ne t h e
k now l edge of dat a w r angl i ng and Py t h on w i t h a
r eal -w or l d sc enar i o. In t h e r eal w or l d, dat a f r om
one sou r c e i s of t en i nadequ at e t o p er f or m
anal y si s. Gener al l y , a dat a w r angl er h as t o
di st i ngu i sh b et w een r el ev ant and non-r el ev ant
dat a and c omb i ne dat a f r om di f f er ent sou r c es.

Th e p r i mar y job of a dat a w r angl i ng ex p er t i s t o


p u l l dat a f r om mu l t i p l e sou r c es, f or mat and
c l ean i t (i mp u t e t h e dat a i f i t i s mi ssi ng), and
f i nal l y c omb i ne i t i n a c oh er ent manner t o
p r ep ar e a dat aset f or f u r t h er anal y si s b y dat a
sc i ent i st s or mac h i ne l ear ni ng engi neer s.

In t h i s t op i c , w e w i l l t r y t o mi mi c su c h a t y p i c al
t ask f l ow b y dow nl oadi ng and u si ng t w o
di f f er ent dat aset s f r om r ep u t ed w eb p or t al s.
Eac h of t h e dat aset s c ont ai ns p ar t i al dat a
p er t ai ni ng t o t h e k ey qu est i on t h at i s b ei ng
ask ed. Let 's ex ami ne i t mor e c l osel y .

Applying Your
Knowledge to a Real-life
Data Wrangling Task
Su p p ose y ou ar e ask ed t h i s qu est i on: I n I ndia,
did the e nrollme nt in
primary /se condary /te rtiary e ducation
incre ase with the improv e me nt of pe r capita
GDP in the past 15 y e ars? Th e ac t u al model i ng
and anal y si s w i l l b e done b y some seni or dat a
sc i ent i st , w h o w i l l u se mac h i ne l ear ni ng and
dat a v i su al i zat i on f or anal y si s. A s a dat a
w r angl i ng ex p er t , y our j ob will be to acquire
and prov ide a cle an datase t that contains
e ducational e nrollme nt and GDP data side by
side .

Su p p ose y ou h av e a l i nk f or a dat aset f r om t h e


U ni t ed N at i ons and y ou c an dow nl oad t h e dat aset
of edu c at i on (f or al l t h e nat i ons ar ou nd t h e
w or l d). Bu t t h i s dat aset h as some mi ssi ng v al u es
and mor eov er i t does not h av e any GDP
i nf or mat i on. Someone h as al so gi v en y ou anot h er
sep ar at e CSV f i l e (dow nl oaded f r om t h e W or l d
Bank si t e) w h i c h c ont ai ns GDP dat a b u t i n a
messy f or mat .

In t h i s ac t i v i t y , w e w i l l ex ami ne h ow t o h andl e
t h ese t w o sep ar at e sou r c es and c l ean t h e dat a t o
p r ep ar e a si mp l e f i nal dat aset w i t h t h e r equ i r ed
dat a and sav e i t t o t h e l oc al dr i v e as a SQL
dat ab ase f i l e:
Figure 9.1: Pictorial representation of the merging of
education and economic data

You ar e enc ou r aged t o f ol l ow al ong w i t h t h e c ode


and r esu l t s i n t h e not eb ook and t r y t o u nder st and
and i nt er nal i ze t h e nat u r e of t h e dat a w r angl i ng
f l ow . You ar e al so enc ou r aged t o t r y ex t r ac t i ng
v ar i ou s dat a f r om t h ese f i l es and answ er y ou r
ow n qu est i ons ab ou t a nat i ons' soc i o-ec onomi c
f ac t or s and t h ei r i nt er -r el at i onsh i p s.

Note
Com ing u p w ith inte re s ting q u e s tions a bou t
s oc ia l, e c onom ic , te c h nolog ic a l, a nd g e o-p olitic a l
top ic s a nd th e n a ns w e ring th e m u s ing fre e ly
a v a ila ble d a ta a nd a little bit of p rog ra m m ing
k now le d g e is one of m os t fu n w a y s to le a rn a bou t
a ny d a ta s c ie nc e top ic . You w ill g e t a fla re of th a t
p roc e s s in th is c h a p te r.

Data I mputation

Cl ear l y , w e ar e mi ssi ng some dat a. Let 's say w e


dec i de t o i mp u t e t h ese dat a p oi nt s b y si mp l e
l i near i nt er p ol at i on b et w een t h e av ai l ab l e dat a
p oi nt s. W e c an t ak e ou t a p en and p ap er or a
c al c u l at or and c omp u t e t h ose v al u es and
manu al l y c r eat e a dat aset . Bu t b ei ng a dat a
w r angl er , w e w i l l of c ou r se t ak e adv ant age of
Py t h on p r ogr ammi ng, and u se p andas i mp u t at i on
met h ods f or t h i s t ask .

Bu t t o do t h at , w e f i r st need t o c r eat e a
Dat aFr ame w i t h mi ssi ng v al u es i n i t , t h at i s, w e
need t o ap p end anot h er Dat aFr ame w i t h mi ssi ng
v al u es t o t h e c u r r ent Dat aFr ame.

Activity 12: Data


Wrangling Task – Fixing
UN Data
Su p p ose t h e agenda of t h e dat a anal y si s i s t o f i nd
ou t w h et h er t h e enr ol ment i n p r i mar y ,
sec ondar y , or t er t i ar y edu c at i on h as i nc r eased
w i t h t h e i mp r ov ement of p er c ap i t a GDP i n t h e
p ast 1 5 y ear s. For t h i s t ask , w e w i l l f i r st need t o
c l ean or w r angl e t h e t w o dat aset s, t h at i s, t h e
Edu c at i on Enr ol ment and GDP dat a.

Th e U N dat a i s av ai l ab l e on
h t t p s://gi t h u b .c om/Tr ai ni ngBy Pac k t /Dat a-
W r angl i ng-w i t h -
Py t h on/b l ob /mast er /Ch ap t er 09/A c t i v i t y 1 2-
1 5/SYB61 _T07 _Edu c at i on.c sv .

Note
I f y ou d ow nloa d th e CS V file a nd op e n it u s ing
Exc e l, th e n y ou w ill s e e th a t th e Footnotes
c olu m n s om e tim e s c onta ins u s e fu l note s . We
m a y not w a nt to d rop it in th e be g inning . I f w e
a re inte re s te d in a p a rtic u la r c ou ntry 's d a ta (lik e
w e a re in th is ta s k ), th e n it m a y w e ll tu rn ou t th a t
Footnotes w ill be NaN, th a t is , bla nk . I n th a t c a s e ,
w e c a n d rop it a t th e e nd . Bu t for s om e c ou ntrie s
or re g ions , it m a y c onta in inform a tion.

Th ese st ep s w i l l gu i de y ou t o f i nd t h e sol u t i on:

1 . Dow nload th e dataset fr om


th e UN data fr om GitHu b
fr om th e follow ing link:
h ttps://gith u b.com /Tr ainin
gBy Packt/Data-Wr angling-
w ith -
Py th on/blob/m aster /Ch apte
r 09 /A ctiv ity 1 3 /India_Wor l
d_Bank_Info.csv .

Th e UN data h as m issing
v alu es. Clean th e data to
pr epar e a sim ple final
dataset w ith th e r equ ir ed
data and sav e it to th e local
dr iv e as a SQL database file.

2 . Use th e pd.read_csv m eth od


of pandas to cr eate a
DataFr am e.

3 . Since th e fir st r ow does not


contain u sefu l infor m ation,
skip it u sing th e skiprows
par am eter .

4 . Dr op th e colu m n
r egion/cou ntr y /ar ea and
sou r ce.

5. A ssign th e follow ing nam es


as colu m ns of DataFr am e:
Region/Cou nty /A r ea, Year ,
Data, V alu e, and Footnotes.

6 . Ch eck h ow m any u niqu e


v alu es ar e pr esent in
th e Footnotes colu m n.

7 . Ch eck th e ty pe of v alu e
colu m n.

8. Cr eate a fu nction to conv er t


th e v alu e colu m n into a
floating-point.

9 . Use th e apply m eth od to


apply th is fu nction to a
v alu e.

1 0. Pr int th e u niqu e v alu es in


th e data colu m n.

Note:

The solution for this activity


can be found on page 338.
Activity 13: Data
Wrangling Task –
Cleaning GDP Data
Th e GDP dat a i s av ai l ab l e on
h t t p s://dat a.w or l db ank .or g/ and i t i s av ai l ab l e
on Gi t H u b at
h t t p s://gi t h u b .c om/Tr ai ni ngBy Pac k t /Dat a-
W r angl i ng-w i t h -
Py t h on/b l ob /mast er /Ch ap t er 09/A c t i v i t y 1 2-
1 5/Indi a_W or l d_Bank _Inf o.c sv .

In t h i s ac t i v i t y , w e w i l l c l ean t h e GDP dat a.

1 . Cr eate th r ee DataFr am es
fr om th e or iginal DataFr am e
u sing filter ing. Cr eate th e
df_primary,
df_secondary, and
df_tertiary DataFrames
for stu dents enr olled in
pr im ar y edu cation,
secondar y edu cation, and
ter tiar y edu cation in
th ou sands, r espectiv ely .

2 . Plot bar ch ar ts of th e
enr ollm ent of pr im ar y
stu dents in a low -incom e
cou ntr y like India and a
h igh er -incom e cou ntr y like
th e USA .

3 . Since th er e is m issing data,


u se pandas im pu tation
m eth ods to im pu te th ese
data points by sim ple linear
inter polation betw een data
points. To do th at, cr eate a
DataFr am e w ith m issing
v alu es inser ted and append a
new DataFr am e w ith
m issing v alu es to th e
cu r r ent DataFr am e.

4 . (For India) A ppend th e r ow s


cor r esponding to th e m issing
y ear s - 2004 - 2009, 2011 –
2013.

5. Cr eate a dictionar y of v alu es


w ith np.nan. Note th at th er e
ar e 9 m issing data points, so
w e need to cr eate a list w ith
identical v alu es r epeated 9
tim es.

6 . Cr eate a DataFr am e of
m issing v alu es (fr om th e
pr eceding dictionar y ) th at
w e can append.

7 . A ppend th e DataFr am es
togeth er .

8. Sor t by Year and r eset th e


indices u sing reset_index.
Use inplace=True to execu te
th e ch anges on th e
DataFr am e itself.

9 . Use th e inter polate m eth od


for linear inter polation. It
fills all th e NaNs by linear ly
inter polated v alu es. See th e
follow ing link for m or e
details abou t th is m eth od:
h ttp://pandas.py data.or g/p
andas-
docs/v er sion/0.1 7 /gener ate
d/pandas.DataFr am e.inter p
olate.h tm l.

1 0. Repeat th e sam e steps for


USA (or oth er cou ntr ies).

1 1 . If th er e ar e v alu es th at ar e
u nfilled, u se th e limit and
limit_direction
par am eter s w ith th e
inter polate m eth od to fill
th em in.

1 2 . Plot th e final gr aph u sing th e


new data.

1 3 . Read th e GDP data u sing th e


pandas read_csv m eth od. It
w ill gener ally th r ow an
er r or .

1 4 . To av oid er r or s, tr y th e
error_bad_lines = False
option.
1 5. Since th er e is no delim iter in
th e file, add th e \t delim iter .

1 6 . Use th e skiprows fu nction to


r em ov e r ow s th at ar e not
u sefu l.

1 7 . Exam ine th e dataset. Filter


th e dataset w ith infor m ation
th at states th at it is sim ilar
to th e pr ev iou s edu cation
dataset.

1 8. Reset th e index for th is new


dataset.

1 9 . Dr op th e not u sefu l r ow s and


r e-index th e dataset.

2 0. Renam e th e colu m ns
pr oper ly . Th is is necessar y
for m er ging th e tw o datasets.

2 1 . We w ill concentr ate only on


th e data fr om 2 003 to 2 01 6 .
Elim inate th e r em aining
data.

2 2 . Cr eate a new DataFr am e


called df_gdp w ith r ow s 4 3
to 56 .

Note

The solution for this activity


can be found on page 338.

Activity 14: Data


Wrangling Task –
Merging UN Data and
GDP Data
Th e st ep s t o mer ge t h e dat ab ases i s as f ol l ow s:

1 . Reset th e indexes for


m er ging.

2 . Mer ge th e tw o DataFr am es,


primary_enrollment_india
and df_gdp, on th e Year
colu m n.
3 . Dr op th e data, footnotes, and
r egion/cou nty /ar ea.

4 . Rear r ange th e colu m ns for


pr oper v iew ing and
pr esentation.

Note

The solution for this activity


can be found on page 345.

Activity 15: Data


Wrangling Task –
Connecting the New
Data to the Database
Th e st ep s t o c onnec t t h e dat a t o t h e dat ab ase i s as
f ol l ow s:

1 . Im por t th e sqlite3 m odu le


of Py th on and u se th e
connect fu nction to connect
to th e database. Th e m ain
database engine is
em bedded. Bu t for a differ ent
database like Postgresql or
MySQL, w e w ill need to
connect to th em u sing th ose
cr edentials. We designate
Year as th e PRIMARY KEY of
th is table.

2 . Th en, r u n a loop w ith th e


dataset r ow s one by one to
inser t th em into th e table.

3 . If w e look at th e cu r r ent
folder , w e sh ou ld see a file
called Education_GDP.db,
and if w e exam ine th at u sing
a database v iew er pr ogr am ,
w e can see th e data
tr ansfer r ed th er e.

Note

The solution for this activity


can be found on page 347 .
In t h i s not eb ook , w e ex ami ned a c omp l et e dat a
w r angl i ng f l ow , i nc l u di ng r eadi ng dat a f r om t h e
w eb and l oc al dr i v e, f i l t er i ng, c l eani ng, qu i c k
v i su al i zat i on, i mp u t at i on, i ndex i ng, mer gi ng,
and w r i t i ng b ac k t o a dat ab ase t ab l e. W e al so
w r ot e c u st om f u nc t i ons t o t r ansf or m some of t h e
dat a and saw h ow t o h andl e si t u at i ons w h er e w e
may get er r or s w h en r eadi ng t h e f i l e.

An Extension to Data
Wrangling
Th i s i s t h e c onc l u di ng c h ap t er of ou r b ook ,
w h er e w e w ant t o gi v e y ou a b r oad ov er v i ew of
some of t h e ex c i t i ng t ec h nol ogi es and
f r amew or k s t h at y ou may need t o l ear n b ey ond
dat a w r angl i ng t o w or k as a f u l l -st ac k dat a
sc i ent i st . Dat a w r angl i ng i s an essent i al p ar t of
t h e w h ol e dat a sc i enc e and anal y t i c s p i p el i ne,
b u t i t i s not t h e w h ol e ent er p r i se. You h av e
l ear ned i nv al u ab l e sk i l l s and t ec h ni qu es i n t h i s
b ook , b u t i t i s al w ay s good t o b r oaden y ou r
h or i zons and l ook b ey ond t o see w h at ot h er t ool s
t h at ar e ou t t h er e c an gi v e y ou an edge i n t h i s
c omp et i t i v e and ev er -c h angi ng w or l d.

ADDITIONAL SKILLS
REQUIRED TO BECOME A DATA
SCIENTIST
To p r ac t i c e as a f u l l y qu al i f i ed dat a
sc i ent i st /anal y st , y ou sh ou l d h av e some b asi c
sk i l l s i n y ou r r ep er t oi r e, i r r esp ec t i v e of t h e
p ar t i c u l ar p r ogr ammi ng l angu age y ou c h oose t o
f oc u s on. Th ese sk i l l s and k now -h ow s ar e
l angu age agnost i c and c an b e u t i l i zed w i t h any
f r amew or k t h at y ou h av e t o emb r ac e, dep endi ng
on y ou r or gani zat i on and b u si ness needs. W e
desc r i b e t h em i n b r i ef h er e:

Git and v ersion cont rol:


Git to v er sion contr ol is w h at
RDBMS is to data stor age and
qu er y . It sim ply m eans th at
th er e is a h u ge gap betw een
th e pr e and post Git er a of
v er sion contr olling y ou r
code. A s y ou m ay h av e
noticed, all th e notebooks for
th is book/book ar e h osted on
GitHu b, and th is w as done to
take adv antage of th e
pow er fu l Git V CS. It giv es
y ou , ou t of th e box, v er sion
contr ol, h istor y , br anch ing
facilities for differ ent code,
m er ging differ ent code
br anch es, and adv anced
oper ations like ch er r y
picking, diff, and so on. It is
an v er y essential tool to
m aster as y ou can be alm ost
su r e th at y ou w ill face it at
one point of tim e in y ou r
jou r ney . Packt h as a v er y
good book on it. You can
ch eck th at ou t for m or e
infor m ation.

Linux command line:


People com ing fr om a
Window s backgr ou nd (or
ev en Mac, if y ou h av e not
done any dev elopm ent
befor e) ar e not v er y
fam iliar , u su ally , w ith th e
com m and line. Th e su per ior
UI of th ose OSes h ides th e low
lev el details of inter action
w ith th e OS u sing a
com m and line. How ev er , as
a data pr ofessional, it is
im por tant th at y ou know th e
com m and line w ell. Th er e
ar e so m any oper ations th at
y ou can do by sim ply u sing
th e com m and line th at it is
astonish ing.

SQL and basic relat ional


dat abase concept s: We
dedicated an entir e ch apter
to SQL and RDBMS.
How ev er , as w e alr eady
m entioned th er e, it w as
r eally not enou gh . Th is is a
v ast su bject and needs y ear s
of stu dy to m aster it. Tr y to
r ead m or e abou t it
(Inclu ding Th eor y and
Pr actical) fr om books and
online sou r ces. Do not for get
th at, despite all th e oth er
sou r ces of data being u sed
now aday s, w e still h av e
h u ndr eds of m illions of by tes
of str u ctu r ed data stor ed in
legacy database sy stem s.
You can be su r e to com e
acr oss one, sooner or later .

Docker and
cont ainerizat ion: Since its
fir st r elease in 2 01 3 , Docker
h as ch anged th e w ay w e
distr ibu te and deploy
softw ar e in ser v er -based
applications. It giv es y ou a
clean and ligh tw eigh t
abstr action ov er th e
u nder ly ing OS and lets y ou
iter ate fast on dev elopm ent
w ith ou t th e h eadach e of
cr eating and m aintaining a
pr oper env ir onm ent. It is
v er y u sefu l in both th e
dev elopm ent and pr odu ction
ph ases. With ou t v ir tu ally no
com petitor pr esent, th ey ar e
becom ing th e defau lt in th e
indu str y v er y fast. We
str ongly adv ise y ou to
explor e it in gr eat detail.

BASIC FAMILIARITY WITH BIG


DATA AND CLOUD
TECHNOLOGIES
Bi g dat a and c l ou d p l at f or ms ar e t h e l at est t r end.
W e w i l l i nt r odu c e t h em h er e w i t h one or t w o
sh or t sent enc es and w e enc ou r age y ou t o go ah ead
and l ear n ab ou t t h em as mu c h as y ou c an. If y ou
ar e p l anni ng t o gr ow as a dat a p r of essi onal , t h en
y ou c an b e su r e t h at w i t h ou t t h ese nec essar y
sk i l l s i t w i l l b e h ar d f or y ou t o t r ansi t i on t o t h e
nex t l ev el :

Fundament al
charact erist ics of big
dat a: Big data is sim ply data
th at is v er y big in size. Th e
ter m size is a bit am bigu ou s
h er e. It can m ean one static
ch u nk of data (like th e detail
censu s data of a big cou ntr y
like India or th e US) or data
th at is dy nam ically
gener ated as tim e passes,
and each tim e it is h u ge. To
giv e an exam ple for th e
second categor y , w e can
th ink of h ow m u ch data is
gener ated by Facebook per
day . It's abou t 500+
Ter aby tes per day . You can
easily im agine th at w e w ill
need specialized tools to deal
w ith th at am ou nt of data.
Th er e ar e th r ee differ ent
categor ies of big data, th at is,
Str u ctu r ed, Unstr u ctu r ed,
and Sem i-Str u ctu r ed. Th e
m ain featu r es th at define big
data ar e V olu m e, V ar iety ,
V elocity , and V ar iability .

Hadoop ecosy st em: A pach e


Hadoop (and th e r elated
ecosy stem ) is a softw ar e
fr am ew or k th at aim s to u se
th e Map-Redu ce
pr ogr am m ing m odel to
sim plify th e stor age and
pr ocessing of big data. It h as
since becom e one of th e
backbones of big data
pr ocessing in th e indu str y .
Th e m odu les in Hadoop ar e
designed keeping in m ind
th at h ar dw ar e failu r es ar e
com m on occu r r ences, and
th ey sh ou ld be
au tom atically h andled by
th e fr am ew or k. Th e fou r
base m odu les of Hadoop ar e
com m on, HDFS, YA RN, and
MapRedu ce. Th e Hadoop
ecosy stem consists of A pach e
Pig, A pach e Hiv e, A pach e
Im pala, A pach e Zookeeper ,
A pach e HBase, and m or e.
Th ey ar e v er y im por tant
br icks in m any h igh dem and
and cu tting-edge data
pipelines. We encou r age y ou
to stu dy m or e abou t th em .
Th ey ar e essential in any
indu str y th at aim s to
lev er age data.
Apache Spark: A pach e
Spar k is a gener al pu r pose
Clu ster Com pu ting
fr am ew or k th at w as initially
dev eloped at th e Univ er sity
of Califor nia, Bar kley , and
r eleased in 2 01 4 . It giv es
y ou an inter face to pr ogr am
an entir e clu ster of
com pu ter s w ith bu ilt-in data
par allelism and fau lt
toler ance. It contains Spar k
Cor e, Spar k SQL, Spar k
Str eam ing, MLib (for
m ach ine lear ning), and
Gr aph X. It is now one of th e
m ain fr am ew or ks th at's u sed
in th e indu str y to pr ocess a
h u ge am ou nt of data in r eal
tim e based on str eam ing
data. We encou r age y ou to
r ead and m aster it if y ou
w ant to go tow ar d r eal tim e
data engineer ing.

Amazon Web serv ice


(AWS): A m azon Web
Ser v ices (often abbr ev iated
as A WS) ar e a bu nch of
m anaged ser v ices offer ed by
A m azon r anging fr om
infr astr u ctu r e-as-a-Ser v ice,
Database-as-a-Ser v ice,
Mach ineLear ning-as-a-
Ser v ice, Cach e, Load
Balancer , NoSQL database,
to Message Qu eu es and
sev er al oth er ty pes. Th ey ar e
v er y u sefu l for all sor ts of
applications. It can be a
sim ple w eb app or a m u lti-
clu ster data pipeline. Many
fam ou s com panies r u n th eir
entir e infr astr u ctu r e on
A WS (su ch as Netflix). Th ey
giv e u s on-dem and
pr ov ision, easy scaling, a
m anaged env ir onm ent, a
slick UI to contr ol
ev er y th ing, and also a v er y
pow er fu l com m and-line
client. Th ey also expose a
r ich set of A PIs and w e can
find an A WS A PI client in
v ir tu ally any pr ogr am m ing
langu age. Th e Py th on one is
called Boto3 . If y ou ar e
planning to becom e a data
pr ofessional, th en it can be
said w ith near cer tainty th at
y ou w ill end u p u sing m any
of th eir ser v ices at one point
or anoth er .

WHAT GOES WITH DATA


WRANGLING?
W e l ear ned i n Ch a p te r 1, I ntrod u c tion to Da ta
Wra ng ling w ith Py th on, t h at t h e p r oc ess of dat a
w r angl i ng l i es i n-b et w een dat a gat h er i ng and
adv anc ed anal y t i c s, i nc l u di ng v i su al i zat i on and
mac h i ne l ear ni ng. H ow ev er , t h e b ou ndar i es t h at
ex i st i n-b et w een t h ese p r oc esses may not al w ay s
b e st r i c t and r i gi d. It dep ends l ar gel y on t h e
or gani zat i onal c u l t u r e and t eam c omp osi t i on.

Th er ef or e, w e need t o not onl y b e aw ar e of t h e


dat a w r angl i ng b u t al so t h e ot h er c omp onent s of
t h e dat a sc i enc e p l at f or m t o w r angl e dat a
ef f ec t i v el y . Ev en i f y ou ar e p er f or mi ng p u r e
dat a w r angl i ng t ask s, h av i ng a good gr asp ov er
h ow dat a i s sou r c ed and u t i l i zed w i l l gi v e y ou an
edge f or c omi ng u p w i t h u ni qu e and ef f i c i ent
sol u t i ons t o c omp l ex dat a w r angl i ng p r ob l ems
and enh anc e t h e v al u e of t h ose sol u t i ons t o t h e
mac h i ne l ear ni ng sc i ent i st or t h e b u si ness
domai n ex p er t :
Figure 9.2: Process of data wrangling

N ow , w e h av e, i n f ac t , al r eady l ai d ou t a sol i d
gr ou ndw or k i n t h i s b ook f or t h e dat a p l at f or m
p ar t , assu mi ng t h at i t i s an i nt egr al p ar t of dat a
w r angl i ng w or k f l ow . For ex amp l e, w e h av e
c ov er ed w eb sc r ap i ng, w or k i ng w i t h RESTf u l
A PIs, and dat ab ase ac c ess and mani p u l at i on u si ng
Py t h on l i b r ar i es i n det ai l .
W e h av e al so t ou c h ed on b asi c v i su al i zat i on
t ec h ni qu es and p l ot t i ng f u nc t i ons i n Py t h on
u si ng mat p l ot l i b . H ow ev er , t h er e ar e ot h er
adv anc ed st at i st i c al p l ot t i ng l i b r ar i es su c h as
Se aborn t h at y ou c an mast er f or mor e
sop h i st i c at ed v i su al i zat i on f or dat a sc i enc e
t ask s.

Bu si ness l ogi c and domai n ex p er t i se i s t h e most


v ar i ed t op i c and i t c an onl y b e l ear ned on t h e
job , h ow ev er i t w i l l c ome ev ent u al l y w i t h
ex p er i enc e. If y ou h av e an ac ademi c b ac k gr ou nd
and/or w or k ex p er i enc e i n any domai n su c h as
f i nanc e, medi c i ne and h eal t h c ar e, and
engi neer i ng, t h at k now l edge w i l l c ome i n h andy
i n y ou r dat a sc i enc e c ar eer .

Th e f r u i t of t h e h ar d w or k of dat a w r angl i ng i s
r eal i zed f u l l y i n t h e domai n of mac h i ne
l ear ni ng. It i s t h e sc i enc e and engi neer i ng of
mak i ng mac h i nes l ear n p at t er ns and i nsi gh t s
f r om dat a f or p r edi c t i v e anal y t i c s and
i nt el l i gent , au t omat ed dec i si on-mak i ng w i t h a
del u ge of dat a, w h i c h c annot b e anal y zed
ef f i c i ent l y b y h u mans. Mac h i ne l ear ni ng h as
b ec ome one of t h e most sou gh t -af t er sk i l l s i n t h e
moder n t ec h nol ogy l andsc ap e. It h as t r u l y
b ec ome one of t h e most ex c i t i ng and p r omi si ng
i nt el l ec t u al f i el ds, w i t h ap p l i c at i ons r angi ng
f r om e-c ommer c e t o h eal t h c ar e and v i r t u al l y
ev er y t h i ng i n-b et w een. Dat a w r angl i ng i s
i nt r i nsi c al l y l i nk ed w i t h mac h i ne l ear ni ng as i t
p r ep ar es t h e dat a so t h at i t 's su i t ab l e f or
i nt el l i gent al gor i t h ms t o p r oc ess. Ev en i f y ou
st ar t y ou r c ar eer i n dat a w r angl i ng, i t c ou l d b e a
nat u r al p r ogr essi on t o mov e t o mac h i ne
l ear ni ng.

Pac k t h as p u b l i sh ed nu mer ou s b ook s and b ook s


on t h i s t op i c t h at y ou sh ou l d ex p l or e. In t h e nex t
sec t i on, w e w i l l t ou c h u p on some ap p r oac h es t o
adop t and Py t h on l i b r ar i es t o c h ec k ou t f or
gi v i ng y ou a b oost i n y ou r l ear ni ng.

TIPS AND TRICKS FOR


MASTERING MACHINE
LEARNING
Mac h i ne l ear ni ng i s di f f i c u l t t o st ar t w i t h . W e
h av e l i st ed some st r u c t u r ed MOOCs and
i nc r edi b l e f r ee r esou r c es t h at ar e av ai l ab l e so
t h at y ou c an b egi n y ou r jou r ney :

Under stand th e definition of


and differ entiation betw een
th e bu zzw or ds — ar tificial
intelligence, m ach ine
lear ning, deep lear ning, and
data science. Cu ltiv ate th e
h abit of r eading gr eat posts
or listening to th e exper t
talks, on th ese topics, and
u nder stand th eir tr u e r each
and applicability in som e
bu siness pr oblem .

Stay u pdated w ith th e r ecent


tr ends by w atch ing v ideos,
r eading books like The Master
Algorithm: How the Quest for
the Ultimate Learning Machine
Will Remake Our World, and
ar ticles and follow ing
influ ential blogs like
KDnu ggets, Br andon
Roh r er 's blog, Open A I's blog
abou t th eir r esear ch ,
Tow ar ds Data Science
pu blication on Mediu m , and
so on.

A s y ou lear n new algor ith m s


or concepts, pau se and
analy ze h ow y ou can apply
th ese m ach ine lear ning
concepts or algor ith m in
y ou r daily w or k. Th is is th e
best m eth od for lear ning and
expanding y ou r know ledge
base.

If y ou ch oose Py th on as y ou r
pr efer r ed langu age for
m ach ine lear ning tasks, y ou
h av e a gr eat ML libr ar y in
scikit -learn. It is th e m ost
w idely u sed gener al m ach ine
lear ning package in th e
Py th on ecosy stem . scikit-
lear n h as a w ide v ar iety of
su per v ised and u nsu per v ised
lear ning algor ith m s, w h ich
ar e exposed v ia a stable
consistent inter face.
Mor eov er , it is specifically
designed to inter face
seam lessly w ith oth er
popu lar data w r angling and
nu m er ical libr ar ies su ch as
Nu m Py and pandas.

A noth er h ot skill in today 's


job m ar ket is deep lear ning.
Packt h as m any books and
books on th is topic and th er e
ar e excellent MOOC books
fr om Bookr a w h er e y ou can
stu dy deep lear ning. For
Py th on libr ar ies, y ou can
lear n and pr actice w ith
TensorFlow, Keras, or
Py Torch for deep lear ning.

Summary
Dat a i s ev er y w h er e and i t i s al l ar ou nd u s. In
t h ese ni ne c h ap t er s, w e h av e l ear ned ab ou t h ow
dat a f r om di f f er ent t y p es and sou r c es c an b e
c l eaned, c or r ec t ed, and c omb i ned. U si ng t h e
p ow er of Py t h on and t h e k now l edge of dat a
w r angl i ng and ap p l y i ng t h e t r i c k s and t i p s t h at
y ou h av e st u di ed i n t h i s b ook , y ou ar e r eady t o b e
a dat a w r angl er .
Appendix
About
Th i s sec t i on i s i nc l u ded t o assi st t h e st u dent s t o
p er f or m t h e ac t i v i t i es i n t h e b ook . It i nc l u des
det ai l ed st ep s t h at ar e t o b e p er f or med b y t h e
st u dent s t o ac h i ev e t h e ob jec t i v es of t h e
ac t i v i t i es.

SOLUTION OF ACTIVITY 1:
HANDLING LISTS
Th ese ar e t h e st ep s t o c omp l et e t h i s ac t i v i t y :

1 . Im por t th e random libr ar y :

import random

2 . Set th e m axim u m nu m ber of


r andom nu m ber s:

LIMIT = 100

3 . Use th e randint fu nction


fr om th e random libr ar y to
cr eate 1 00 r andom
nu m ber s. Tip: tr y getting a
list w ith th e least nu m ber of
du plicates:

random_number_list =
[random.randint(0,
LIMIT) for x in
range(0, LIMIT)]

4 . Pr int random_number_list:

random_number_list

Th e sam ple ou tpu t is as


follow s:
Figure 1.16: Section of output for
random_number_list

5. Cr eate a
list_with_divisible_by_3
list fr om
random_number_list, w h ich
w ill contain only nu m ber s
th at ar e div isible by 3:

list_with_divisible_by
_3 = [a for a in
random_number_list if
a % 3 == 0]
list_with_divisible_by
_3

Th e sam ple ou tpu t is as


follow s:

Figure 1.17: Section of output for


random_number_list divisible by 3

6 . Use th e len fu nction to


m easu r e th e length of th e
fir st list and th e second list,
and stor e th em in tw o
differ ent v ar iables,
length_of_random_list
and
length_of_3_divisible_li
st. Calcu late th e differ ence
in length in a v ar iable called
difference:

length_of_random_list
=
len(random_number_list
)

length_of_3_divisible_
list =
len(list_with_divisibl
e_by_3)

difference =
length_of_random_list
-
length_of_3_divisible_
list

difference

Th e sam ple ou tpu t is as


follow s:

62

7 . Com bine th e tasks w e h av e


per for m ed so far and add a
w h ile loop to it. Ru n th e loop
1 0 tim es and add th e v alu es
of th e differ ence v ar iables to
a list:

NUMBER_OF_EXPERIMENTS
= 10

difference_list = []

for i in range(0,
NUMBER_OF_EXPERIEMENTS
):

random_number_list =
[random.randint(0,
LIMIT) for x in
range(0, LIMIT)]

list_with_divisible_by
_3 = [a for a in
random_number_list if
a % 3 == 0]

length_of_random_list
=
len(random_number_list
)
length_of_3_divisible_
list =
len(list_with_divisibl
e_by_3)

difference =
length_of_random_list
-
length_of_3_divisible_
list

difference_list.append
(difference)

difference_list

Th e sam ple ou tpu t is as


follow s:

[64, 61, 67, 60, 73,


66, 66, 75, 70, 61]

8. Th en, calcu late th e


ar ith m etic m ean (com m on
av er age) for th e differ ences
in th e length s th at y ou h av e:

avg_diff =
sum(difference_list) /
float(len(difference_l
ist))

avg_diff

Th e sam ple ou tpu t is as


follow s:

66.3

SOLUTION OF ACTIVITY 2:
ANALYZE A MULTILINE STRING
AND GENERATE THE UNIQUE
WORD COUNT
Th ese ar e t h e st ep s t o c omp l et e t h i s ac t i v i t y :

1 . Cr eate a str ing called


multiline_text and copy
th e text pr esent in th e fir st
ch apter of Pride and
Prejudice. Use Ctrl + A to
select th e entir e text and
th en Ctrl + C to copy it and
paste th e text y ou ju st copied
into it:

Figure 1.18: Initializing the mutliline_text


string

2 . Find th e ty pe of th e str ing


u sing th e type fu nction:

type(multiline_text)

Th e ou tpu t is as follow s:

str

3 . Now , find th e length of th e


str ing, u sing th e len
fu nction:

len(multiline_text)

Th e ou tpu t is as follow s:

4475

4 . Use str ing m eth ods to get r id


of all th e new lines (\n or
\r),and sy m bols. Rem ov e all
new lines by r eplacing th em
w ith th is:

multiline_text =
multiline_text.replace
('\n', "")

Th en, w e w ill pr int and


ch eck th e ou tpu t:

multiline_text

Th e ou tpu t is as follow s:

Figure 1.19: The multiline_text string


a er removing the new lines

5. Rem ov ing th e special


ch ar acter s and pu nctu ation:

# remove special
chars, punctuation
etc.

cleaned_multiline_text
= ""

for char in
multiline_text:

if char == " ":


cleaned_multiline_text
+= char

elif char.isalnum(): #
using the isalnum()
method of strings.

cleaned_multiline_text
+= char

else:

cleaned_multiline_text
+= " "

6 . Ch eck th e content of
cleaned_multiline_text:

cleaned_multiline_text

Th e ou tpu t is as follow s:

Figure 1.20: The cleaned_multiline_text


string

7 . Gener ate a list of all th e


w or ds fr om th e cleaned
str ing u sing th e follow ing
com m and:

list_of_words =
cleaned_multiline_text
.split()

list_of_words

Th e ou tpu t is as follow s:

Figure 1.21: The section of output


displaying the list_of_words

8. Find th e nu m ber of w or ds:

len(list_of_words)

Th e ou tpu t is 852.

9 . Cr eate a list fr om th e list y ou


ju st cr eated, w h ich inclu des
only u niqu e w or ds:

unique_words_as_dict =
dict.fromkeys(list_of_
words)

len(list(unique_words_
as_dict.keys()))

Th e ou tpu t is 340.

1 0. Cou nt th e nu m ber of tim es


each of th e u niqu e w or ds
appear ed in th e cleaned text:

for word in
list_of_words:

if
unique_words_as_dict[w
ord] is None:

unique_words_as_dict[w
ord] = 1

else:

unique_words_as_dict[w
ord] += 1

unique_words_as_dict

Th e ou tpu t is as follow s:

Figure 1.22: Section of output showing


unique_words_as_dict

You ju st cr eated, step by


step, a u niqu e w or d cou nter
u sing all th e neat tr icks th at
y ou ju st lear ned.

1 1 . Find th e top 2 5 w or ds fr om
th e unique_words_as_dict.

top_words =
sorted(unique_words_as
_dict.items(),
key=lambda
key_val_tuple:
key_val_tuple[1],
reverse=True)

top_words[:25]
Th ese ar e th e steps to
com plete th is activ ity :

Figure 1.23: Top 25 unique words from multiline_text

SOLUTION OF ACTIVITY 3:
PERMUTATION, ITERATOR,
LAMBDA, LIST
Th ese ar e t h e st ep s t o sol v e t h i s ac t i v i t y :

1 . Look u p th e definition of
permutations and
dropwhile fr om itertools.
Th er e is a w ay to look u p th e
definition of a fu nction inside
Ju py ter itself. Ju st ty pe th e
fu nction nam e, follow ed by
?, and pr ess Shift + Enter:

from itertools import


permutations,
dropwhile

permutations?
dropwhile?

You w ill see a long list of


definitions after each ?. We
w ill skip it h er e.

2 . Wr ite an expr ession to


gener ate all th e possible
th r ee-digit nu m ber s u sing 1 ,
2 , and 3 :

permutations(range(3))

Th e ou tpu t is as follow s:

<itertools.permutation
s at 0x7f6c6c077af0>

3 . Loop ov er th e iter ator


expr ession y ou gener ated
befor e. Use pr int to pr int
each elem ent r etu r ned by
th e iter ator . Use assert and
isinstance to m ake su r e
th at th e elem ents ar e tu ples:

for number_tuple in
permutations(range(3))
:

print(number_tuple)

assert
isinstance(number_tupl
e, tuple)

Th e ou tpu t is as follow s:

(0, 1, 2)

(0, 2, 1)

(1, 0, 2)

(1, 2, 0)

(2, 0, 1)

(2, 1, 0)

4 . Wr ite th e loop again. Bu t


th is tim e, u se dropwhile
w ith a lam bda expr ession to
dr op any leading zer os fr om
th e tu ples. A s an exam ple,
(0, 1, 2) w ill becom e [0,
2]. A lso, cast th e ou tpu t of
th e dropwhile to a list.
A n extr a task can be to ch eck
th e actu al ty pe th at
dropwhile r etu r ns w ith ou t
casting:

for number_tuple in
permutations(range(3))
:

print(list(dropwhile(l
ambda x: x <= 0,
number_tuple)))

Th e ou tpu t is as follow s:

[1, 2]

[2, 1]

[1, 0, 2]

[1, 2, 0]

[2, 0, 1]

[2, 1, 0]

5. Wr ite all th e logic y ou w r ote


befor e, bu t th is tim e w r ite a
separ ate fu nction w h er e y ou
w ill be passing th e list
gener ated fr om dropwhile,
and th e fu nction w ill r etu r n
th e w h ole nu m ber contained
in th e list. A s an exam ple, if
y ou pass [1, 2] to th e
fu nction, it w ill r etu r n 12.
Make su r e th at th e r etu r n
ty pe is indeed a nu m ber and
not a str ing. A lth ou gh th is
task can be ach iev ed u sing
oth er tr icks, w e r equ ir e th at
y ou tr eat th e incom ing list
as a stack in th e fu nction and
gener ate th e nu m ber th er e:

import math

def
convert_to_number(numb
er_stack):

final_number = 0

for i in range(0,
len(number_stack)):
final_number +=
(number_stack.pop() *
(math.pow(10, i)))

return final_number

for number_tuple in
permutations(range(3))
:

number_stack =
list(dropwhile(lambda
x: x <= 0,
number_tuple))

print(convert_to_numbe
r(number_stack))

Th e ou tpu t is as follow s:

12.0

21.0

102.0

120.0

201.0

210.0

SOLUTION OF ACTIVITY 4:
DESIGN YOUR OWN CSV
PARSER
Th ese ar e t h e st ep s t o c omp l et e t h i s ac t i v i t y :

1 . Im por t zip_longest fr om
itertools:

from itertools import


zip_longest

2 . Define th e
return_dict_from_csv_lin
e fu nction so th at it contains
header, line, and
fillvalue as None, and add
it to a dict:

def
return_dict_from_csv_l
ine(header, line):

# Zip them
zipped_line =
zip_longest(header,
line, fillvalue=None)

# Use dict
comprehension to
generate the final
dict

ret_dict = {kv[0]:
kv[1] for kv in
zipped_line}

return ret_dict

3 . Open th e accom pany ing


sales_record.csv file u sing
r m ode inside a w ith block.
Fir st, ch eck th at it is opened,
r ead th e fir st line, and u se
str ing m eth ods to gener ate a
list of all th e colu m n nam es
w ith
open("sales_record.csv",
"r") as fd. Wh en y ou r ead
each line, pass th at line to a
fu nction along w ith th e list
of th e h eader s. Th e w or k of
th e fu nction is to constr u ct a
dict ou t of th ese tw o and fill
u p th e key:values. Keep in
m ind th at a m issing v alu e
sh ou ld r esu lt in a None:

first_line =
fd.readline()

header =
first_line.replace("\n
", "").split(",")

for i, line in
enumerate(fd):

line =
line.replace("\n",
"").split(",")

d =
return_dict_from_csv_l
ine(header, line)

print(d)

if i > 10:
break
Th e ou tpu t is as follow s:

Figure 2.10: Section of output

SOLUTION OF ACTIVITY 5:
GENERATING STATISTICS
FROM A CSV FILE
Th ese ar e t h e st ep s t o c omp l et e t h i s ac t i v i t y :

1 . Load th e necessar y libr ar ies:

import numpy as np

import pandas as pd

import
matplotlib.pyplot as
plt

2 . Read in th e Boston h ou sing


dataset (giv en as a .csv file)
fr om th e local dir ection:
# Hint: The Pandas
function for reading a
CSV file is
'read_csv'.

# Don't forget that


all functions in
Pandas can be accessed
by syntax like pd.
{function_name}

df=pd.read_csv("Boston
_housing.csv")

3 . Ch eck th e fir st 1 0 r ecor ds:

df.head(10)

Th e ou tpu t is as follow s:
Figure 3.23: Output displaying the first 10
records

4 . Find th e total nu m ber of


r ecor ds:

df.shape

Th e ou tpu t is as follow s:

(506, 14)

5. Cr eate a sm aller DataFr am e


w ith colu m ns th at do not
inclu de CHAS, NOX, B, and
LSTAT:

df1=df[['CRIM','ZN','I
NDUS','RM','AGE','DIS'
,'RAD','TAX','PTRATIO'
,'PRICE']]

6 . Ch eck th e last 7 r ecor ds of


th e new DataFr am e y ou ju st
cr eated:

df1.tail(7)

Th e ou tpu t is as follow s:

Figure 3.24: Last seven records of the


DataFrame

7 . Plot h istogr am s of all th e


v ar iables (colu m ns) in th e
new DataFr am e by u sing a
for loop:

for c in df1.columns:
plt.title("Plot of
"+c,fontsize=15)

plt.hist(df1[c],bins=2
0)

plt.show()

Th e ou tpu t is as follow s:
Figure 3.25: Plot of all variables using a
for loop

8. Cr im e r ate cou ld be an
indicator of h ou se pr ice
(people don't w ant to liv e in
h igh -cr im e ar eas). Cr eate a
scatter plot of cr im e r ate
v er su s pr ice:

plt.scatter(df1['CRIM'
],df1['PRICE'])

plt.show()

Th e ou tpu t is as follow s:

Figure 3.26: Scatter plot of crime rate


versus price

We can u nder stand th e


r elationsh ip better if w e plot
log1 0(cr im e) v er su s pr ice.

9 . Cr eate th at plot of
log1 0(cr im e) v er su s pr ice:

plt.scatter(np.log10(d
f1['CRIM']),df1['PRICE
'],c='red')

plt.title("Crime rate
(Log) vs. Price plot",
fontsize=18)

plt.xlabel("Log of
Crime
rate",fontsize=15)
plt.ylabel("Price",fon
tsize=15)

plt.grid(True)

plt.show()

Th e ou tpu t is as follow s:

Figure 3.27: Scatter plot of crime rate


(Log) versus price

1 0. Calcu late th e m ean r oom s


per dw elling:

df1['RM'].mean()

Th e ou tpu t is
6.284634387351788.

1 1 . Calcu late th e m edian age:

df1['AGE'].median()

Th e ou tpu t is 77.5.

1 2 . Calcu late th e av er age


(m ean) distances to fiv e
Boston em ploy m ent center s:

df1['DIS'].mean()

Th e ou tpu t is
3.795042687747034.

1 3 . Calcu late th e per centage of


h ou ses w ith low pr ice (<
$20,000):
# Create a Pandas
series and directly
compare it with 20

# You can do this


because Pandas series
is basically NumPy
array and you have
seen how to filter
NumPy array

low_price=df1['PRICE']
<20

# This creates a
Boolean array of True,
False

print(low_price)

# True = 1, False = 0,
so now if you take an
average of this NumPy
array, you will know
how many 1's are
there.

# That many houses are


priced below 20,000.
So that is the answer.

# You can convert that


into percentage by
multiplying with 100

pcnt=low_price.mean()*
100

print("\nPercentage of
house with <20,000
price is: ",pcnt)

Th e ou tpu t is as follow s:

0 False
1 False

2 False

3 False

4 False

5 False

6 False

7 False

8 True
9 True

10 True

500 True

501 False

502 False

503 False

504 False

505 True

Name: PRICE, Length:


506, dtype: bool

Percentage of house
with <20,000 price is:
41.50197628458498

SOLUTION OF ACTIVITY 6:
WORKING WITH THE ADULT
INCOME DATASET (UCI)
Th ese ar e t h e st ep s t o c omp l et e t h i s ac t i v i t y :

1 . Load th e necessar y libr ar ies:

import numpy as np

import pandas as pd

import
matplotlib.pyplot as
plt

2 . Read in th e adu lt incom e


dataset (giv en as a .csv file)
fr om th e local dir ector y and
ch eck th e fir st 5 r ecor ds:

df =
pd.read_csv("adult_inc
ome_data.csv")

df.head()

Th e ou tpu t is as follow s:
Figure 4.61: DataFrame displaying the
first five records from the .csv file

3 . Cr eate a scr ipt th at w ill r ead


a text file line by line and
extr acts th e fir st line, w h ich
is th e h eader of th e .csv file:

names = []

with
open('adult_income_nam
es.txt','r') as f:

for line in f:

f.readline()

var=line.split(":")[0]

names.append(var)

names

Th e ou tpu t is as follow s:

Figure 4.62: Names of the columns in the


database

4 . A dd a nam e of Income for th e


r esponse v ar iable (last
colu m n) to th e dataset by
u sing th e append com m and:

names.append('Income')

5. Read th e new file again u sing


th e follow ing com m and:

df =
pd.read_csv("adult_inc
ome_data.csv",names=na
mes)

df.head()

Th e ou tpu t is as follow s:

Figure 4.63: DataFrame with the income


column added

6 . Use th e describe com m and


to get th e statistical
su m m ar y of th e dataset:

df.describe()

Th e ou tpu t is as follow s:
Figure 4.64: Statistical summary of the
dataset

Note th at only a sm all


nu m ber of colu m ns ar e
inclu ded. Many v ar iables in
th e dataset h av e m u ltiple
factor s or classes.

7 . Make a list of all th e


v ar iables in th e classes by
u sing th e follow ing
com m and:

# Make a list of all


variables with classes

vars_class =
['workclass','educatio
n','marital-status',

'occupation','relation
ship','sex','native-
country']

8. Cr eate a loop to cou nt and


pr int th em by u sing th e
follow ing com m and:

for v in vars_class:

classes=df[v].unique()
num_classes =
df[v].nunique()

print("There are {}
classes in the \"{}\"
column. They are:
{}".format(num_classes
,v,classes))

print("-"*100)

Th e ou tpu t is as follow s:

Figure 4.65: Output of different factors or


classes

9 . Find th e m issing v alu es by


u sing th e follow ing
com m and:

df.isnull().sum()

Th e ou tpu t is as follow s:
Figure 4.66: Finding the missing values

1 0. Cr eate a DataFr am e w ith


only age, edu cation, and
occu pation by u sing
su bsetting:

df_subset =
df[['age','education',
'occupation']]

df_subset.head()

Th e ou tpu t is as follow s:

Figure 4.67: Subset DataFrame

1 1 . Plot a h istogr am of age w ith


a bin size of 2 0:

df_subset['age'].hist(
bins=20)

Th e ou tpu t is as follow s:

<matplotlib.axes._subp
lots.AxesSubplot at
0x19dea8d0>
Figure 4.68: Histogram of age with a bin
size of 20

1 2 . Plot boxplots for age gr ou ped


by education (u se a long
figu r e size 2 5x1 0 and m ake x
ticks font size 1 5):

df_subset.boxplot(colu
mn='age',by='education
',figsize=(25,10))

plt.xticks(fontsize=15
)

plt.xlabel("Education"
,fontsize=20)

plt.show()

Th e ou tpu t is as follow s:
Figure 4.69: Boxplot of age grouped by
education

Befor e doing any fu r th er


oper ations, w e need to u se
th e apply m eth od w e
lear ned in th is ch apter . It
tu r ns ou t th at w h en r eading
th e dataset fr om th e CSV file,
all th e str ings cam e w ith a
w h itespace ch ar acter in
fr ont. So, w e need to r em ov e
th at w h itespace fr om all th e
str ings.

1 3 . Cr eate a fu nction to str ip th e


w h itespace ch ar acter s:

def
strip_whitespace(s):

return s.strip()

1 4 . Use th e apply m eth od to


apply th is fu nction to all th e
colu m ns w ith str ing v alu es,
cr eate a new colu m n, copy
th e v alu es fr om th is new
colu m n to th e old colu m n,
and dr op th e new colu m n.
Th is is th e pr efer r ed m eth od
so th at y ou don't
accidentally delete v alu able
data. Most of th e tim e, y ou
w ill need to cr eate a new
colu m n w ith a desir ed
oper ation and th en copy it
back to th e old colu m n if
necessar y . Ignor e any
w ar ning m essages th at ar e
pr inted:

# Education column

df_subset['education_s
tripped']=df['educatio
n'].apply(strip_whites
pace)

df_subset['education']
=df_subset['education_
stripped']

df_subset.drop(labels=
['education_stripped']
,axis=1,inplace=True)

# Occupation column

df_subset['occupation_
stripped']=df['occupat
ion'].apply(strip_whit
espace)

df_subset['occupation'
]=df_subset['occupatio
n_stripped']
df_subset.drop(labels=
['occupation_stripped'
],axis=1,inplace=True)

Th is is th e sam ple w ar ning


m essage, w h ich y ou sh ou ld
ignor e:

Figure 4.70: Warning message to be


ignored

1 5. Find th e nu m ber of people


w h o ar e aged betw een 3 0
and 50 (inclu siv e) by u sing
th e follow ing com m and:

# Conditional clauses
and join them by &
(AND)

df_filtered=df_subset[
(df_subset['age']>=30)
& (df_subset['age']
<=50)]

Ch eck th e contents of th e
new dataset:

df_filtered.head()

Th e ou tpu t is as follow s:
Figure 4.71: Contents of new DataFrame

1 6 . Find th e shape of th e filter ed


DataFr am e and specify th e
index of th e tu ple as 0 to
r etu r n th e fir st elem ent:

answer_1=df_filtered.s
hape[0]

answer_1

Th e ou tpu t is as follow s:

1630

1 7 . Pr int th e nu m ber of black


people aged betw een 3 0 and
50 u sing th e follow ing
com m and:

print("There are {}
people of age between
30 and 50 in this
dataset.".format(answe
r_1))

Th e ou tpu t is as follow s:

There are 1630 black


of age between 30 and
50 in this dataset.

1 8. Gr ou p th e r ecor ds based on
occu pation to find h ow th e
m ean age is distr ibu ted:

df_subset.groupby('occ
upation').describe()
['age']
Th e ou tpu t is as follow s:

Figure 4.72: DataFrame with data


grouped by age and education

Th e code r etu r ns 79 rows ×


1 columns.

1 9 . Gr ou p by occu pation and


sh ow th e su m m ar y statistics
of age. Find w h ich pr ofession
h as th e oldest w or ker s on
av er age and w h ich
pr ofession h as its lar gest
sh ar e of w or kfor ce abov e th e
7 5th per centile:

df_subset.groupby('occ
upation').describe()
['age']

Th e ou tpu t is as follow s:
Figure 4.73: DataFrame showing
summary statistics of age

Is th er e a par ticu lar


occu pation gr ou p th at h as
v er y low r epr esentation?
Per h aps w e sh ou ld r em ov e
th ose pieces of data becau se
w ith v er y low data, th e
gr ou p w on't be u sefu l in
analy sis. A ctu ally , ju st by
looking at th e pr eceding
table, y ou sh ou ld be able to
see th at th e Armed-Forces
gr ou p h as only got a 9 cou nt,
th at is, 9 data points. Bu t
h ow can w e detect th is? By
plotting th e cou nt colu m n in
a bar ch ar t. Note h ow th e
fir st ar gu m ent to th e barh
fu nction is th e index of th e
DataFr am e, w h ich is th e
su m m ar y stats of th e
occu pation gr ou ps. We can
see th at th e Armed-Forces
gr ou p h as alm ost no data.
Th is exer cise teach es y ou
th at, som etim es, th e ou tlier
is not ju st a v alu e, bu t can be
a w h ole gr ou p. Th e data of
th is gr ou p is fine, bu t it is too
sm all to be u sefu l for any
analy sis. So, it can be tr eated
as an ou tlier in th is case. Bu t
alw ay s u se y ou r bu siness
know ledge and engineer ing
ju dgem ent for su ch ou tlier
detection and h ow to pr ocess
th em .

2 0. Use su bset and gr ou pby to


find th e ou tlier s:

occupation_stats=
df_subset.groupby(

'occupation').describe
()['age']

2 1 . Plot th e v alu es on a bar


ch ar t:

plt.figure(figsize=
(15,8))

plt.barh(y=occupation_
stats.index,

width=occupation_stats
['count'])

plt.yticks(fontsize=13
)

plt.show()

Th e ou tpu t is as follow s:
Figure 4.74: Bar chart displaying
occupation statistics

2 2 . Pr actice m er ging by
com m on key s. Su ppose y ou
ar e giv en tw o datasets w h er e
th e com m on key is
occupation. Fir st, cr eate
tw o su ch disjoint datasets by
taking r andom sam ples fr om
th e fu ll dataset and th en tr y
m er ging. Inclu de at least tw o
oth er colu m ns, along w ith
th e com m on key colu m n for
each dataset. Notice h ow th e
r esu lting dataset, after
m er ging, m ay h av e m or e
data points th an eith er of th e
tw o star ting datasets if y ou r
com m on key is not u niqu e:

df_1 = df[['age',

'workclass',
'occupation']].sample(
5,random_state=101)

df_1.head()

Th e ou tpu t is as follow s:

Figure 4.75: Output a er merging the common keys

Th e sec ond dat aset i s as f ol l ow s:

df_2 = df[['education',

'occupation']].sample(5,random_state=101)

df_2.head()

Th e ou t p u t i s as f ol l ow s:

Figure 4.76: Output a er merging the common keys

Mer gi ng t h e t w o dat aset s t oget h er :

df_merged = pd.merge(df_1,df_2,

on='occupation',

how='inner').drop_duplicates()

df_merged

Th e ou t p u t i s as f ol l ow s:
Figure 4.77: Output of distinct occupation values

SOLUTION OF ACTIVITY 7:
READING TABULAR DATA
FROM A WEB PAGE AND
CREATING DATAFRAMES
Th ese ar e t h e st ep s t o c omp l et e t h i s ac t i v i t y :

1 . Im por t Beau tifu lSou p and


load th e data by u sing th e
follow ing com m and:

from bs4 import


BeautifulSoup

import pandas as pd

2 . Open th e Wikipedia file by


u sing th e follow ing
com m and:

fd = open("List of
countries by GDP
(nominal) -
Wikipedia.htm", "r")

soup =
BeautifulSoup(fd)

fd.close()

3 . Calcu late th e tables by u sing


th e follow ing com m and:

all_tables =
soup.find_all("table")

print("Total number of
tables are {}
".format(len(all_table
s)))

Th er e ar e 9 tables in total.
4 . Find th e r igh t table u sing
th e class attr ibu te by u sing
th e follow ing com m and:

data_table =
soup.find("table",
{"class":
'"wikitable"|}'})

print(type(data_table)
)

Th e ou tpu t is as follow s:

<class
'bs4.element.Tag'>

5. Separ ate th e sou r ce and th e


actu al data by u sing th e
follow ing com m and:

sources =
data_table.tbody.findA
ll('tr',
recursive=False)[0]

sources_list = [td for


td in
sources.findAll('td')]

print(len(sources_list
))

Th e ou tpu t is as follow s:

Total number of tables


are 3.

6 . Use findAll fu nction to find


th e data fr om th e
data_table's body tag, u sing
th e follow ing com m and:

data =
data_table.tbody.findA
ll('tr',
recursive=False)
[1].findAll('td',
recursive=False)

7 . Use th e findAll fu nction to


find th e data fr om th e
data_table td tag by u sing
th e follow ing com m and:

data_tables = []
for td in data:

data_tables.append(td.
findAll('table'))

8. Find th e length of
data_tables by u sing th e
follow ing com m and:

len(data_tables)

Th e ou tpu t is as follow s:

9 . Ch eck h ow to get th e sou r ce


nam es by u sing th e follow ing
com m and:

source_names =
[source.findAll('a')
[0].getText() for
source in
sources_list]

print(source_names)

Th e ou tpu t is as follow s:

['International
Monetary Fund', 'World
Bank', 'United
Nations']

1 0. Separ ate th e h eader and


data for th e fir st sou r ce:

header1 =
[th.getText().strip()
for th in
data_tables[0]
[0].findAll('thead')
[0].findAll('th')]

header1

Th e ou tpu t is as follow s:

['Rank', 'Country',
'GDP(US$MM)']

1 1 . Find th e r ow s fr om
data_tables u sing findAll:

rows1 = data_tables[0]
[0].findAll('tbody')
[0].findAll('tr')[1:]
1 2 . Find th e data fr om rows1
u sing th e strip fu nction for
each td tag:

data_rows1 =
[[td.get_text().strip(
) for td in
tr.findAll('td')] for
tr in rows1]

1 3 . Find th e DataFr am e:

df1 =
pd.DataFrame(data_rows
1, columns=header1)

df1.head()

Th e ou tpu t is as follow s:

Figure 5.35: DataFrame created from


Web page

1 4 . Do th e sam e for th e oth er tw o


sou r ces by u sing th e
follow ing com m and:

header2 =
[th.getText().strip()
for th in
data_tables[1]
[0].findAll('thead')
[0].findAll('th')]

header2

Th e ou tpu t is as follow s:

['Rank', 'Country',
'GDP(US$MM)']

1 5. Find th e r ow s fr om
data_tables u sing findAll
by u sing th e follow ing
com m and:
rows2 = data_tables[1]
[0].findAll('tbody')
[0].findAll('tr')[1:]

1 6 . Define find_right_text
u sing th e strip fu nction by
u sing th e follow ing
com m and:

def find_right_text(i,
td):

if i == 0:

return
td.getText().strip()

elif i == 1:

return
td.getText().strip()

else:

index =
td.text.find("♠")

return
td.text[index+1:].stri
p()

1 7 . Find th e r ow s fr om
data_rows u sing
find_right_text by u sing
th e follow ing com m and:

data_rows2 =
[[find_right_text(i,
td) for i, td in
enumerate(tr.findAll('
td'))] for tr in
rows2]

1 8. Calcu late th e df2


DataFr am e by u sing th e
follow ing com m and:

df2 =
pd.DataFrame(data_rows
2, columns=header2)

df2.head()

Th e ou tpu t is as follow s:
Figure 5.36: Output of the DataFrame

1 9 . Now , per for m th e sam e


oper ations for th e th ir d
DataFr am e by u sing th e
follow ing com m and:

header3 =
[th.getText().strip()
for th in
data_tables[2]
[0].findAll('thead')
[0].findAll('th')]

header3

Th e ou tpu t is as follow s:

['Rank', 'Country',
'GDP(US$MM)']

2 0. Find th e r ow s fr om
data_tables u sing findAll
by u sing th e follow ing
com m and:

rows3 = data_tables[2]
[0].findAll('tbody')
[0].findAll('tr')[1:]

2 1 . Find th e r ow s fr om
data_rows3 by u sing
find_right_text:

data_rows3 =
[[find_right_text(i,
td) for i, td in
enumerate(tr.findAll('
td'))] for tr in
rows2]

2 2 . Calcu late th e df3


DataFr am e by u sing th e
follow ing com m and:

df3 =
pd.DataFrame(data_rows
3, columns=header3)

df3.head()

Th e ou tpu t is as follow s:

Figure 5.37: The third DataFrame

SOLUTION OF ACTIVITY 8:
HANDLING OUTLIERS AND
MISSING DATA
Th ese ar e t h e st ep s t o c omp l et e t h i s ac t i v i t y :

1 . Load th e data:

import pandas as pd
import numpy as np

import
matplotlib.pyplot as
plt

%matplotlib inline

2 . Read th e .csv file:

df =
pd.read_csv("visit_dat
a.csv")

3 . Pr int th e data fr om th e
DataFr am e:

df.head()

Th e ou tpu t is as follow s:
Figure 6.10: The contents of the CSV file

A s w e can see, th er e is data


w h er e som e v alu es ar e
m issing, and if w e exam ine
th is, w e w ill see som e
ou tlier s.

4 . Ch eck for du plicates by u sing


th e follow ing com m and:

print("First name is
duplicated -
{}".format(any(df.firs
t_name.duplicated())))

print("Last name is
duplicated -
{}".format(any(df.last
_name.duplicated())))

print("Email is
duplicated -
{}".format(any(df.emai
l.duplicated())))

Th e ou tpu t is as follow s:

First name is
duplicated - True

Last name is
duplicated - True

Email is duplicated -
False

Th er e ar e du plicates in both
th e fir st and last nam es,
w h ich is nor m al. How ev er ,
as w e can see, th er e is no
du plicate in em ail. Th at's
good.

5. Ch eck if any essential


colu m n contains NaN:
# Notice that we have
different ways to
format boolean values
for the % operator

print("The column
Email contains NaN -
%r " %
df.email.isnull().valu
es.any())

print("The column IP
Address contains NaN -
%s " %
df.ip_address.isnull()
.values.any())

print("The column
Visit contains NaN -
%s " %
df.visit.isnull().valu
es.any())

Th e ou tpu t is as follow s:

The column Email


contains NaN - False

The column IP Address


contains NaN - False

The column Visit


contains NaN - True

Th e colu m n v isit contains


som e None v alu es. Giv en
th at th e final task at h and
w ill pr obably be pr edicting
th e nu m ber of v isits, w e
cannot do any th ing w ith
r ow s th at do not h av e th at
infor m ation. Th ey ar e a ty pe
of ou tlier . Let's get r id of
th em .

6 . Get r id of th e ou tlier s:

# There are various


ways to do this. This
is just one way. We
encourage you to
explore other ways.

# But before that we


need to store the
previous size of the
data set and we will
compare it with the
new size

size_prev = df.shape

df =
df[np.isfinite(df['vis
it'])] #This is an
inplace operation.
After this operation
the original DataFrame
is lost.

size_after = df.shape

7 . Repor t th e size differ ence:

# Notice how
parameterized format
is used and then the
indexing is working
inside the quote marks

print("The size of
previous data was -
{prev[0]} rows and the
size of the new one is
- {after[0]} rows".

format(prev=size_prev,
after=size_after))

Th e ou tpu t is as follow s:

The size of previous


data was - 1000 rows
and the size of the
new one is - 974 rows

8. Plot a boxplot to find if th e


data h as ou tlier s.

plt.boxplot(df.visit,
notch=True)

Th e ou tpu t is as follow s:
{'whiskers':
[<matplotlib.lines.Lin
e2D at
0x7fa04cc08668>,

<matplotlib.lines.Line
2D at
0x7fa04cc08b00>],
'caps':
[<matplotlib.lines.Lin
e2D at
0x7fa04cc08f28>,

<matplotlib.lines.Line
2D at
0x7fa04cc11390>],

'boxes':
[<matplotlib.lines.Lin
e2D at
0x7fa04cc08518>],

'medians':
[<matplotlib.lines.Lin
e2D at
0x7fa04cc117b8>],

'fliers':
[<matplotlib.lines.Lin
e2D at
0x7fa04cc11be0>],

'means': []}

Th e boxplot is as follow s:

Figure 6.43: Boxplot using the data

A s w e can see, w e h av e data


in th is colu m n in th e
inter v al (0, 3 000).
How ev er , th e m ain
concentr ation of th e data is
betw een ~7 00 and ~2 3 00.

9 . Get r id of v alu es bey ond


2 9 00 and below 1 00 – th ese
ar e ou tlier s for u s. We need
to get r id of th em :
df1 = df[(df['visit']
<= 2900) &
(df['visit'] >= 100)]
# Notice the powerful
& operator

# Here we abuse the


fact the number of
variable can be
greater than the
number of replacement
targets

print("After getting
rid of outliers the
new size of the data
is -
{}".format(*df1.shape)
)

A fter getting r id of th e
ou tlier s, th e new size of th e
data is 923.

Th is is th e end of th e activ ity


for th is ch apter .

SOLUTION OF ACTIVITY 9:
EXTRACTING THE TOP 100
EBOOKS FROM GUTENBERG
Th ese ar e t h e st ep s t o c omp l et e t h i s ac t i v i t y :

1 . Im por t th e necessar y
libr ar ies, inclu ding regex
and beautifulsoup:

import urllib.request,
urllib.parse,
urllib.error

import requests

from bs4 import


BeautifulSoup

import ssl

import re

2 . Ch eck th e SSL cer tificate:

# Ignore SSL
certificate errors
ctx =
ssl.create_default_con
text()

ctx.check_hostname =
False

ctx.verify_mode =
ssl.CERT_NONE

3 . Read th e HTML fr om th e
URL:

# Read the HTML from


the URL and pass on to
BeautifulSoup

top100url =
'https://www.gutenberg
.org/browse/scores/top
'

response =
requests.get(top100url
)

4 . Wr ite a sm all fu nction to


ch eck th e statu s of th e w eb
r equ est:

def status_check(r):

if r.status_code==200:

print("Success!")

return 1

else:

print("Failed!")

return -1

5. Ch eck th e statu s of
response:

status_check(response)

Th e ou tpu t is as follow s:

Success!

6 . Decode th e r esponse and pass


it on to BeautifulSoup for
HTML par sing:

contents =
response.content.decod
e(response.encoding)

soup =
BeautifulSoup(contents
, 'html.parser')

7 . Find all th e href tags and


stor e th em in th e list of links.
Ch eck w h at th e list looks like
– pr int th e fir st 3 0 elem ents:

# Empty list to hold


all the http links in
the HTML page

lst_links=[]

# Find all the href


tags and store them in
the list of links

for link in
soup.find_all('a'):

#print(link.get('href'
))

lst_links.append(link.
get('href'))

8. Pr int th e links by u sing th e


follow ing com m and:

lst_links[:30]

Th e ou tpu t is as follow s:

['/wiki/Main_Page',

'/catalog/',

'/ebooks/',

'/browse/recent/last1'
,

'/browse/scores/top',

'/wiki/Gutenberg:Offli
ne_Catalogs',

'/catalog/world/mybook
marks',

'/wiki/Main_Page',

'https://www.paypal.co
m/xclick/business=dona
te%40gutenberg.org&ite
m_name=Donation+to+Pro
ject+Gutenberg',
'/wiki/Gutenberg:Proje
ct_Gutenberg_Needs_You
r_Donation',

'http://www.ibiblio.or
g',

'http://www.pgdp.net/'
,

'pretty-pictures',

'#books-last1',

'#authors-last1',

'#books-last7',

'#authors-last7',

'#books-last30',

'#authors-last30',

'/ebooks/1342',

'/ebooks/84',

'/ebooks/1080',

'/ebooks/46',

'/ebooks/219',

'/ebooks/2542',

'/ebooks/98',

'/ebooks/345',

'/ebooks/2701',

'/ebooks/844',

'/ebooks/11']

9 . Use a r egu lar expr ession to


find th e nu m er ic digits in
th ese links. Th ese ar e th e file
nu m ber s for th e top 1 00
books. Initialize th e em pty
list to h old th e file nu m ber s:

booknum=[]

1 0. Nu m ber s 1 9 to 1 1 8 in th e
or iginal list of links h av e th e
top 1 00 eBooks' nu m ber s.
Loop ov er th e appr opr iate
r ange and u se a r egex to find
th e nu m er ic digits in th e
link (h r ef) str ing. Use th e
findall() m eth od:
for i in
range(19,119):

link=lst_links[i]

link=link.strip()

# Regular expression
to find the numeric
digits in the link
(href) string

n=re.findall('[0-
9]+',link)

if len(n)==1:

# Append the
filenumber casted as
integer

booknum.append(int(n[0
]))

1 1 . Pr int th e file nu m ber s:

print ("\nThe file


numbers for the top
100 ebooks on
Gutenberg are shown
below\n"+"-"*70)

print(booknum)

Th e ou tpu t is as follow s:

The file numbers for


the top 100 ebooks on
Gutenberg are shown
below

----------------------
----------------------
----------------------
----

[1342, 84, 1080, 46,


219, 2542, 98, 345,
2701, 844, 11, 5200,
43, 16328, 76, 74,
1952, 6130, 2591,
1661, 41, 174, 23,
1260, 1497, 408, 3207,
1400, 30254, 58271,
1232, 25344, 58269,
158, 44881, 1322, 205,
2554, 1184, 2600, 120,
16, 58276, 5740,
34901, 28054, 829, 33,
2814, 4300, 100, 55,
160, 1404, 786, 58267,
3600, 19942, 8800,
514, 244, 2500, 2852,
135, 768, 58263, 1251,
3825, 779, 58262, 203,
730, 20203, 35, 1250,
45, 161, 30360, 7370,
58274, 209, 27827,
58256, 33283, 4363,
375, 996, 58270, 521,
58268, 36, 815, 1934,
3296, 58279, 105,
2148, 932, 1064,
13415]

1 2 . Wh at does th e sou p object's


text look like? Use th e .text
m eth od and pr int only th e
fir st 2 ,000 ch ar acter s (do
not pr int th e w h ole th ing as
it is too long).

You w ill notice a lot of em pty


spaces/blanks h er e and
th er e. Ignor e th em . Th ey ar e
par t of th e HTML page's
m ar ku p and its w h im sical
natu r e:

print(soup.text[:2000]
)

if (top != self) {

top.location.replace
(http://www.gutenberg.
org);

alert ('Project
Gutenberg is a FREE
service with NO
membership required.
If you paid somebody
else to get here, make
them give you your
money back!');

Th e ou tpu t is as follow s:

Top 100 - Project


Gutenberg
Online Book Catalog

Book Search

-- Recent Books

-- Top 100

-- Offline Catalogs

-- My Bookmarks

Main Page

Pretty Pictures

Top 100 EBooks


yesterday —

Top 100 Authors


yesterday —

Top 100 EBooks last 7


days —

Top 100 Authors last 7


days —

Top 100 EBooks last 30


days —

Top 100 Authors last


30 days

Top 100 EBooks


yesterday

Pride and Prejudice by


Jane Austen (1826)

Frankenstein; Or, The


Modern Prometheus by
Mary Wollstonecraft
Shelley (1367)

A Modest Proposal by
Jonathan Swift (1020)

A Christmas Carol in
Prose; Being a Ghost
Story of Christmas by
Charles Dickens (953)

Heart of Darkness by
Joseph Conrad (887)

Et dukkehjem. English
by Henrik Ibsen (761)

A Tale of Two Cities


by Charles Dickens
(741)

Dracula by Bram Stoker


(732)

Moby Dick; Or, The


Whale by Herman
Melville (651)

The Importance of
Being Earnest: A
Trivial Comedy for
Serious People by
Oscar Wilde (646)

Alice's Adventures in
Wonderland by Lewis
Carrol

1 3 . Sear ch th e extr acted text


(u sing r egu lar expr ession)
fr om th e sou p object to find
th e nam es of top 1 00 eBooks
(y ester day 's r ank):

# Temp empty list of


Ebook names

lst_titles_temp=[]

1 4 . Cr eate a star ting index. It


sh ou ld point at th e text Top
100 Ebooks y est erday . Use
th e splitlines m eth od of
soup.text. It splits th e lines
of th e text of th e sou p object:

start_idx=soup.text.sp
litlines().index('Top
100 EBooks yesterday')

1 5. Loop 1 -1 00 to add th e str ings


of th e next 1 00 lines to th is
tem por ar y list. Hint: u se th e
splitlines m eth od:

for i in range(100):

lst_titles_temp.append
(soup.text.splitlines(
)[start_idx+2+i])

1 6 . Use a r egu lar expr ession to


extr act only text fr om th e
nam e str ings and append
th em to an em pty list. Use
m atch and span to find th e
indices and u se th em :

lst_titles=[]

for i in range(100):

id1,id2=re.match('^[a-
zA-Z
]*',lst_titles_temp[i]
).span()

lst_titles.append(lst_
titles_temp[i]
[id1:id2])

1 7 . Pr int th e list of titles:

for l in lst_titles:

print(l)

Th e ou tpu t is as follow s:

Pride and Prejudice by


Jane Austen

Frankenstein

A Modest Proposal by
Jonathan Swift

A Christmas Carol in
Prose

Heart of Darkness by
Joseph Conrad

Et dukkehjem

A Tale of Two Cities


by Charles Dickens

Dracula by Bram Stoker

Moby Dick

The Importance of
Being Earnest

Alice

Metamorphosis by Franz
Kafka

The Strange Case of Dr

Beowulf

The Russian Army and


the Japanese War
Calculus Made Easy by
Silvanus P

Beyond Good and Evil


by Friedrich Wilhelm
Nietzsche

An Occurrence at Owl
Creek Bridge by
Ambrose Bierce

Don Quixote by Miguel


de Cervantes Saavedra

Blue Jackets by Edward


Greey

The Life and


Adventures of Robinson
Crusoe by Daniel Defoe

The Waterloo Campaign

The War of the Worlds


by H

Democracy in America

Songs of Innocence

The Confessions of St

Modern French Masters


by Marie Van Vorst

Persuasion by Jane
Austen

The Works of Edgar


Allan Poe

The Fall of the House


of Usher by Edgar
Allan Poe

The Masque of the Red


Death by Edgar Allan
Poe

The Lady with the Dog


and Other Stories by
Anton Pavlovich
Chekhov

SOLUTION OF ACTIVITY 10:


EXTRACTING THE TOP 100
EBOOKS FROM
GUTENBERG.ORG
Th ese ar e t h e st ep s t o c omp l et e t h i s ac t i v i t y :

1 . Im por t urllib.request,
urllib.parse,
urllib.error, and json:

import urllib.request,
urllib.parse,
urllib.error

import json

2 . Load th e secr et A PI key (y ou


h av e to get one fr om th e
OMDB w ebsite and u se th at;
it h as a 1 ,000 daily lim it)
fr om a JSON file, stor ed in
th e sam e folder into a
v ar iable, by u sing
json.loads():

Note

The follow ing cell w ill not be


executed in the solution
notebook because the author
cannot give out their private
API key.

3 . Th e
stu dents/u ser s/instr u ctor
w ill need to obtain a key and
stor e it in a JSON file. We ar e
calling th is file
APIkeys.json.

4 . Open th e APIkeys.json file


by u sing th e follow ing
com m and:

with
open('APIkeys.json')
as f:

keys = json.load(f)

omdbapi =
keys['OMDBapi']

Th e final URL to be passed


sh ou ld look like th is:
h ttp://w w w .om dbapi.com /?
t= m ov ie_nam e&apikey = sec
r etapikey .

5. A ssign th e OMDB por tal


(h ttp://w w w .om dbapi.com /
?) as a str ing to a v ar iable
called serviceurl by u sing
th e follow ing com m and:

serviceurl =
'http://www.omdbapi.co
m/?'

6 . Cr eate a v ar iable called


apikey w ith th e last por tion
of th e URL
(&apikey=secretapikey),
w h er e secretapikey is y ou r
ow n A PI key . Th e m ov ie
nam e por tion is
t=movie_name, and w ill be
addr essed later :

apikey =
'&apikey='+omdbapi

7 . Wr ite a u tility fu nction


called print_json to pr int
th e m ov ie data fr om a JSON
file (w h ich w e w ill get fr om
th e por tal). Her e ar e th e
key s of a JSON file: 'Title',
'Year ', 'Rated', 'Released',
'Ru ntim e', 'Genr e', 'Dir ector ',
'Wr iter ', 'A ctor s', 'Plot',
'Langu age','Cou ntr y ',
'A w ar ds', 'Ratings',
'Metascor e', 'im dbRating',
'im dbV otes', and 'im dbID':

def
print_json(json_data):

list_keys=['Title',
'Year', 'Rated',
'Released', 'Runtime',
'Genre', 'Director',
'Writer',

'Actors', 'Plot',
'Language', 'Country',
'Awards', 'Ratings',
'Metascore',
'imdbRating',
'imdbVotes', 'imdbID']

print("-"*50)

for k in list_keys:

if k in
list(json_data.keys())
:

print(f"{k}:
{json_data[k]}")

print("-"*50)

8. Wr ite a u tility fu nction to


dow nload a poster of th e
m ov ie based on th e
infor m ation fr om th e JSON
dataset and sav e it in y ou r
local folder . Use th e os
m odu le. Th e poster data is
stor ed in th e JSON key
Poster. You m ay w ant to
split th e nam e of th e Poster
file and extr act th e file
extension only . Let's say th at
th e extension is jpg. We
w ou ld later join th is
extension to th e m ov ie nam e
and cr eate a filenam e su ch
as movie.jpg. Use th e open
Py th on com m and open to
open a file and w r ite th e
poster data. Close th e file
after y ou 'r e done. Th is
fu nction m ay not r etu r n
any th ing. It ju st sav es th e
poster data as an im age file:

def
save_poster(json_data)
:

import os

title =
json_data['Title']

poster_url =
json_data['Poster']

# Splits the poster


url by '.' and picks
up the last string as
file extension

poster_file_extension=
poster_url.split('.')
[-1]

# Reads the image file


from web

poster_data =
urllib.request.urlopen
(poster_url).read()

savelocation=os.getcwd
()+'\\'+'Posters'+'\\'

# Creates new
directory if the
directory does not
exist. Otherwise, just
use the existing path.

if not
os.path.isdir(saveloca
tion):

os.mkdir(savelocation)

filename=savelocation+
str(title)+'.'+poster_
file_extension

f=open(filename,'wb')

f.write(poster_data)

f.close()

9 . Wr ite a u tility fu nction


called search_movie to
sear ch a m ov ie by its nam e,
pr int th e dow nloaded JSON
data (u se th e print_json
fu nction for th is), and sav e
th e m ov ie poster in th e local
folder (u se th e save_poster
fu nction for th is). Use a try-
except loop for th is, th at is,
tr y to connect to th e w eb
por tal. If su ccessfu l, pr oceed,
bu t if not (th at is, if an
exception is r aised), th en ju st
pr int an er r or m essage. Use
th e pr ev iou sly cr eated
v ar iables serviceurl and
apikey. You h av e to pass on
a dictionar y w ith a key , t,
and th e m ov ie nam e as th e
cor r esponding v alu e to th e
urllib.parse.urlencode
fu nction and th en add th e
serviceurl and apikey to
th e ou tpu t of th e fu nction to
constr u ct th e fu ll URL. Th is
URL w ill be u sed for
accessing th e data. Th e
JSON data h as a key called
Response. If it is True, th at
m eans th at th e r ead w as
su ccessfu l. Ch eck th is befor e
pr ocessing th e data. If it w as
not su ccessfu l, th en pr int th e
JSON key Error, w h ich w ill
contain th e appr opr iate
er r or m essage th at's
r etu r ned by th e m ov ie
database:

def
search_movie(title):

try:

url = serviceurl +
urllib.parse.urlencode
({'t':
str(title)})+apikey

print(f'Retrieving the
data of "{title}"
now... ')

print(url)

uh =
urllib.request.urlopen
(url)

data = uh.read()

json_data=json.loads(d
ata)

if
json_data['Response']=
='True':

print_json(json_data)

# Asks user whether to


download the poster of
the movie
if
json_data['Poster']!='
N/A':

save_poster(json_data)

else:

print("Error
encountered:
",json_data['Error'])

except
urllib.error.URLError
as e:

print(f"ERROR:
{e.reason}"

1 0. Test th e search_movie
fu nction by enter ing
Titanic:

search_movie("Titanic"
)

Th e follow ing is th e r etr iev ed


data for Titanic:

http://www.omdbapi.com
/?
t=Titanic&apikey=17cdc
959

----------------------
----------------------
------

Title: Titanic

Year: 1997

Rated: PG-13

Released: 19 Dec 1997

Runtime: 194 min

Genre: Drama, Romance

Director: James
Cameron

Writer: James Cameron

Actors: Leonardo
DiCaprio, Kate
Winslet, Billy Zane,
Kathy Bates
Plot: A seventeen-
year-old aristocrat
falls in love with a
kind but poor artist
aboard the luxurious,
ill-fated R.M.S.
Titanic.

Language: English,
Swedish

Country: USA

Awards: Won 11 Oscars.


Another 111 wins & 77
nominations.

Ratings: [{'Source':
'Internet Movie
Database', 'Value':
'7.8/10'}, {'Source':
'Rotten Tomatoes',
'Value': '89%'},
{'Source':
'Metacritic', 'Value':
'75/100'}]

Metascore: 75

imdbRating: 7.8

imdbVotes: 913,780

imdbID: tt0120338

----------------------
----------------------
------

1 1 . Test th e search_movie
fu nction by enter ing
"Random_error" (obv iou sly ,
th is w ill not be fou nd, and
y ou sh ou ld be able to ch eck
w h eth er y ou r er r or catch ing
code is w or king pr oper ly ):

search_movie("Random_e
rror")

Retr iev e th e data of


"Random_error":

http://www.omdbapi.com
/?
t=Random_error&apikey=
17cdc959
Error encountered:
Movie not found!

Look f or a f ol der c al l ed Posters i n t h e same


di r ec t or y y ou ar e w or k i ng i n. It sh ou l d c ont ai n a
f i l e c al l ed Titanic.jpg. Ch ec k t h e f i l e.

SOLUTION OF ACTIVITY 11:


RETRIEVING DATA CORRECTLY
FROM DATABASES
Th ese ar e t h e st ep s t o c omp l et e t h i s ac t i v i t y :

1 . Connect to th e su pplied
petsDB database:

import sqlite3

conn =
sqlite3.connect("petsd
b")

2 . Wr ite a fu nction to ch eck


w h eth er th e connection h as
been su ccessfu l:

# a tiny function to
make sure the
connection is
successful
def is_opened(conn):

try:

conn.execute("SELECT *
FROM persons LIMIT 1")

return True

except
sqlite3.ProgrammingErr
or as e:

print("Connection
closed {}".format(e))

return False

print(is_opened(conn))

Th e ou tpu t is as follow s:

True

3 . Close th e connection:

conn.close()
4 . Ch eck w h eth er th e
connection is open or closed:

print(is_opened(conn))

Th e ou tpu t is as follow s:

False

5. Find ou t th e differ ent age


gr ou ps ar e in th e persons
database. Connect to th e
su pplied petsDB database:

conn =
sqlite3.connect("petsd
b")

c = conn.cursor()

6 . Execu te th e follow ing


com m and:

for ppl, age in


c.execute("SELECT
count(*), age FROM
persons GROUP BY
age"):

print("We have {}
people aged
{}".format(ppl, age))

Th e ou tpu t is as follow s:
Figure 8.17: Section of output grouped by
age

7 . To find ou t w h ich age gr ou p


h as th e h igh est nu m ber of
people, execu te th e follow ing
com m and:

sfor ppl, age in


c.execute(

"SELECT count(*), age


FROM persons GROUP BY
age ORDER BY count(*)
DESC"):

print("Highest number
of people is {} and
came from {} age
group".format(ppl,
age))

break

Th e ou tpu t is as follow s:

Highest number of
people is 5 and came
from 73 age group
8. To find ou t h ow m any people
do not h av e a fu ll nam e (th e
last nam e is blank/nu ll),
execu te th e follow ing
com m and:

res =
c.execute("SELECT
count(*) FROM persons
WHERE last_name IS
null")

for row in res:

print(row)

Th e ou tpu t is as follow s:

(60,)

9 . To find ou t h ow m any people


h av e m or e th an one pet,
execu te th e follow ing
com m and:

res =
c.execute("SELECT
count(*) FROM (SELECT
count(owner_id) FROM
pets GROUP BY owner_id
HAVING count(owner_id)
>1)")

for row in res:

print("{} People has


more than one
pets".format(row[0]))

Th e ou tpu t is as follow s:

43 People has more


than one pets

1 0. To find ou t h ow m any pets


h av e r eceiv ed tr eatm ent,
execu te th e follow ing
com m and:
res =
c.execute("SELECT
count(*) FROM pets
WHERE
treatment_done=1")

for row in res:

print(row)
Th e ou tpu t is as follow s:

(36,)

1 1 . To find ou t h ow m any pets


h av e r eceiv ed tr eatm ent and
th e ty pe of pet is know n,
execu te th e follow ing
com m and:

res =
c.execute("SELECT
count(*) FROM pets
WHERE treatment_done=1
AND pet_type IS NOT
null")

for row in res:

print(row)

Th e ou tpu t is as follow s:

(16,)

1 2 . To find ou t h ow m any pets


ar e fr om th e city called "east
por t", execu te th e follow ing
com m and:

res =
c.execute("SELECT
count(*) FROM pets
JOIN persons ON
pets.owner_id =
persons.id WHERE
persons.city='east
port'")

for row in res:

print(row)

Th e ou tpu t is as follow s:
(49,)

1 3 . To find ou t h ow m any pets


ar e fr om th e city called "east
por t" and w h o r eceiv ed
tr eatm ent, execu te th e
follow ing com m and:

res =
c.execute("SELECT
count(*) FROM pets
JOIN persons ON
pets.owner_id =
persons.id WHERE
persons.city='east
port' AND
pets.treatment_done=1"
)

for row in res:

print(row)

Th e ou tpu t is as follow s:

(11,)

SOLUTION OF ACTIVITY 12:


DATA WRANGLING TASK –
FIXING UN DATA
Th ese ar e t h e st ep s t o c omp l et e t h i s ac t i v i t y :

1 . Im por t th e r equ ir ed
libr ar ies:

import numpy as np
import pandas as pd

import
matplotlib.pyplot as
plt

import warnings

warnings.filterwarning
s('ignore')s

2 . Sav e th e URL of th e dataset


and u se th e pandas read_csv
m eth od to dir ectly pass th is
link and cr eate a
DataFr am e:

education_data_link="h
ttp://data.un.org/_Doc
s/SYB/CSV/SYB61_T07_Ed
ucation.csv"

df1 =
pd.read_csv(education_
data_link)

3 . Pr int th e data in th e
DataFr am e:

df1.head()

Th e ou tpu t is as follow s:
Figure 9.3: DataFrame from the UN data

4 . A s th e fir st r ow does not


contain u sefu l infor m ation,
u se th e skiprows par am eter
to r em ov e th e fir st r ow :

df1 =
pd.read_csv(education_
data_link,skiprows=1)

5. Pr int th e data in th e
DataFr am e:

df1.head()

Th e ou tpu t is as follow s:

Figure 9.4: DataFrame a er removing


the first row
6 . Dr op th e colu m n
Region/Cou ntr y /A r ea and
Sou r ce as th ey w ill not be
v er y h elpfu l:

df2 =
df1.drop(['Region/Coun
try/Area','Source'],ax
is=1)

7 . A ssign th e follow ing nam es


as th e colu m ns of th e
DataFr am e:
['Region/Country/Area','
Year','Data','Value','Fo
otnotes']

df2.columns=
['Region/Country/Area'
,'Year','Data','Enroll
ments
(Thousands)','Footnote
s']

8. Pr int th e data in th e
DataFr am e:

df1.head()

Th e ou tpu t is as follow s:

Figure 9.5: DataFrame a er dropping


Region/Country/Area and Source columns

9 . Ch eck h ow m any u niqu e


v alu es th e Footnotes
colu m n contains:

df2['Footnotes'].uniqu
e()

Th e ou tpu t is as follow s:
Figure 9.6: Unique values of the
Footnotes column

1 0. Conv er t th e Value colu m n


data into a nu m er ic one for
fu r th er pr ocessing:

type(df2['Enrollments
(Thousands)'][0])

Th e ou tpu t is as follow s:

str

1 1 . Cr eate a u tility fu nction to


conv er t th e str ings in th e
V alu e colu m n into floating-
point nu m ber s:

def to_numeric(val):

"""

Converts a given
string (with one or
more commas) to a
numeric value

"""

if ',' not in
str(val):

result = float(val)

else:

val=str(val)

val=''.join(str(val).s
plit(','))

result=float(val)

return result

1 2 . Use th e apply m eth od to


apply th is fu nction to th e
Value colu m n data:

df2['Enrollments
(Thousands)']=df2['Enr
ollments
(Thousands)'].apply(to
_numeric)
1 3 . Pr int th e u niqu e ty pes of
data in th e Data colu m n:

df2['Data'].unique()

Th e ou tpu t is as follow s:

Figure 9.7:Unique values in a column

1 4 . Cr eate th r ee DataFr am es by
filter ing and selecting th em
fr om th e or iginal
DataFr am e:
1 . df_primary :
Only stu dents
enr olled in
pr im ar y
edu cation
(th ou sands)

2 . df_secondary :
Only stu dents
enr olled in
secondar y
edu cation
(th ou sands)

3 . df_t ert iary :


Only stu dents
enr olled in
ter tiar y
edu cation
(th ou sands):

df_primary =
df2[df2['Data
']=='Students
enrolled in
primary
education
(thousands)']
df_secondary
=
df2[df2['Data
']=='Students
enrolled in
secondary
education
(thousands)']

df_tertiary =
df2[df2['Data
']=='Students
enrolled in
tertiary
education
(thousands)']

1 5. Com par e th em u sing bar


ch ar ts of th e pr im ar y
stu dents' enr ollm ent of a low -
incom e cou ntr y and a h igh -
incom e cou ntr y :

primary_enrollment_ind
ia =
df_primary[df_primary[
'Region/Country/Area']
=='India']

primary_enrollment_USA
=
df_primary[df_primary[
'Region/Country/Area']
=='United States of
America']

1 6 . Pr int th e
primary_enrollment_india
data:

primary_enrollment_ind
ia

Th e ou tpu t is as follow s:
Figure 9.8: Data for the enrollment in
primary education in India

1 7 . Pr int th e
primary_enrollment_USA
data:

primary_enrollment_USA

Th e ou tpu t is as follow s:

Figure 9.9: Data for the enrollment in


primary education in USA

1 8. Plot th e data for India:

plt.figure(figsize=
(8,4))

plt.bar(primary_enroll
ment_india['Year'],pri
mary_enrollment_india[
'Enrollments
(Thousands)'])

plt.title("Enrollment
in primary
education\nin India
(in
thousands)",fontsize=1
6)

plt.grid(True)

plt.xticks(fontsize=14
)
plt.yticks(fontsize=14
)

plt.xlabel("Year",
fontsize=15)

plt.show()

Th e ou tpu t is as follow s:

Figure 9.10: Bar plot for the enrollment in


primary education in India

1 9 . Plot th e data for th e USA :

plt.figure(figsize=
(8,4))

plt.bar(primary_enroll
ment_USA['Year'],prima
ry_enrollment_USA['Enr
ollments
(Thousands)'])

plt.title("Enrollment
in primary
education\nin the
United States of
America (in
thousands)",fontsize=1
6)

plt.grid(True)

plt.xticks(fontsize=14
)
plt.yticks(fontsize=14
)

plt.xlabel("Year",
fontsize=15)

plt.show()

Th e ou tpu t is as follow s:

Figure 9.11: Bar plot for the enrollment in


primary education in the USA

Data im pu tation: Clear ly ,


w e ar e m issing som e data.
Let's say w e decide to im pu te
th ese data points by sim ple
linear inter polation betw een
th e av ailable data points. We
can take ou t a pen and paper
or a calcu lator and com pu te
th ose v alu es and m anu ally
cr eate a dataset som eh ow .
Bu t being a data w r angler ,
w e w ill of cou r se take
adv antage of Py th on
pr ogr am m ing, and u se
pandas im pu tation m eth ods
for th is task. Bu t to do th at,
w e fir st need to cr eate a
DataFr am e w ith m issing
v alu es inser ted – th at is, w e
need to append anoth er
DataFr am e w ith m issing
v alu es to th e cu r r ent
DataFr am e.

(For India) Append t he


rows corresponding t o
missing t he y ears – 2004
- 2009, 2011 – 2013.

2 0. Find th e m issing y ear s:

missing_years = [y for
y in
range(2004,2010)]+[y
for y in
range(2011,2014)]

2 1 . Pr int th e v alu e in th e
missing_years variable:

missing_years

Th e ou tpu t is as follow s:

[2004, 2005, 2006,


2007, 2008, 2009,
2011, 2012, 2013]

2 2 . Cr eate a dictionar y of v alu es


w ith np.nan. Note th at th er e
ar e 9 m issing data points, so
w e need to cr eate a list w ith
identical v alu es r epeated 9
tim es:

dict_missing =
{'Region/Country/Area'
:
['India']*9,'Year':mis
sing_years,

'Data':'Students
enrolled in primary
education
(thousands)'*9,

'Enrollments
(Thousands)':
[np.nan]*9,'Footnotes'
:[np.nan]*9}

2 3 . Cr eate a DataFr am e of
m issing v alu es (fr om th e
pr eceding dictionar y ) th at
w e can append:
df_missing =
pd.DataFrame(data=dict
_missing)

2 4 . A ppend th e new DataFr am es


to pr ev iou sly existing ones:

primary_enrollment_ind
ia=primary_enrollment_
india.append(df_missin
g,ignore_index=True,so
rt=True)

2 5. Pr int th e data in
primary_enrollment_india
:

primary_enrollment_ind
ia

Th e ou tpu t is as follow s:

Figure 9.12: Data for the enrollment in


primary education in India a er
appending the data

2 6 . Sor t by year and r eset th e


indices u sing reset_index.
Use inplace=True to execu te
th e ch anges on th e
DataFr am e itself:

primary_enrollment_ind
ia.sort_values(by='Yea
r',inplace=True)

primary_enrollment_ind
ia.reset_index(inplace
=True,drop=True)

2 7 . Pr int th e data in
primary_enrollment_india
:

primary_enrollment_ind
ia

Th e ou tpu t is as follow s:

Figure 9.13: Data for the enrollment in


primary education in India a er sorting
the data

2 8. Use th e interpolate m eth od


for linear inter polation. It
fills all th e NaN by linear ly
inter polated v alu es. Ch eck
ou t th is link for m or e details
abou t th is m eth od:
h ttp://pandas.py data.or g/p
andas-
docs/v er sion/0.1 7 /gener ate
d/pandas.DataFr am e.inter p
olate.h tm l:

primary_enrollment_ind
ia.interpolate(inplace
=True)

2 9 . Pr int th e data in
primary_enrollment_india
:

primary_enrollment_ind
ia

Th e ou tpu t is as follow s:

Figure 9.14: Data for the enrollment in


primary education in India a er
interpolating the data

3 0. Plot th e data:

plt.figure(figsize=
(8,4))
plt.bar(primary_enroll
ment_india['Year'],pri
mary_enrollment_india[
'Enrollments
(Thousands)'])

plt.title("Enrollment
in primary
education\nin India
(in
thousands)",fontsize=1
6)

plt.grid(True)

plt.xticks(fontsize=14
)

plt.yticks(fontsize=14
)

plt.xlabel("Year",
fontsize=15)

plt.show()

Th e ou tpu t is as follow s:

Figure 9.15: Bar plot for the enrollment in


primary education in India

3 1 . Repeat th e sam e steps for th e


USA :

missing_years =
[2004]+[y for y in
range(2006,2010)]+[y
for y in
range(2011,2014)]+
[2016]

3 2 . Pr int th e v alu e in
missing_years.

missing_years

Th e ou tpu t is as follow s:

[2004, 2006, 2007,


2008, 2009, 2011,
2012, 2013, 2016]

3 3 . Cr eate dict_missing, as
follow s:

dict_missing =
{'Region/Country/Area'
:['United States of
America']*9,'Year':mis
sing_years,
'Data':'Students
enrolled in primary
education
(thousands)'*9,
'Value':
[np.nan]*9,'Footnotes'
:[np.nan]*9}

3 4 . Cr eate th e DataFr am e fpr


df_missing, as follow s:

df_missing =
pd.DataFrame(data=dict
_missing)

3 5. A ppend th is to th e
primary_enrollment_USA
v ar iable, as follow s:

primary_enrollment_USA
=primary_enrollment_US
A.append(df_missing,ig
nore_index=True,sort=T
rue)

3 6 . Sor t th e v alu es in th e
primary_enrollment_USA
v ar iable, as follow s:

primary_enrollment_USA
.sort_values(by='Year'
,inplace=True)
3 7 . Reset th e index of th e
primary_enrollment_USA
v ar iable, as follow s:

primary_enrollment_USA
.reset_index(inplace=T
rue,drop=True)

3 8. Inter polate th e
primary_enrollment_USA
v ar iable, as follow s:

primary_enrollment_USA
.interpolate(inplace=T
rue)

3 9 . Pr int th e
primary_enrollment_USA
v ar iable:

primary_enrollment_USA

Th e ou tpu t is as follow s:

Figure 9.16: Data for the enrollment in


primary education in USA a er all
operations have been completed

4 0. Still, th e fir st v alu e is


u nfilled. We can u se th e
limit and limit_direction
par am eter s w ith th e
inter polate m eth od to fill
th at. How did w e know th is?
By sear ch ing on Google and
looking at th is
StackOv er flow page. A lw ay s
sear ch for th e solu tion to
y ou r pr oblem and look for
w h at h as alr eady been done
and tr y to im plem ent it:

primary_enrollment_USA
.interpolate(method='l
inear',limit_direction
='backward',limit=1)

Th e ou tpu t is as follow s:

Figure 9.17: Data for the enrollment in


primary education in the USA a er
limiting the data

4 1 . Pr int th e data in
pr im ar y _enr ollm ent_USA :

primary_enrollment_USA

Th e ou tpu t is as follow s:
Figure 9.18: Data for the enrollment in
primary education in USA

4 2 . Plot th e data:

plt.figure(figsize=
(8,4))

plt.bar(primary_enroll
ment_USA['Year'],prima
ry_enrollment_USA['Enr
ollments
(Thousands)'])

plt.title("Enrollment
in primary
education\nin the
United States of
America (in
thousands)",fontsize=1
6)

plt.grid(True)

plt.xticks(fontsize=14
)

plt.yticks(fontsize=14
)
plt.xlabel("Year",
fontsize=15)

plt.show()

Th e ou tpu t is as follow s:

Figure 9.19: Bar plot for the enrollment in primary


education in the USA

ACTIVITY 13: DATA


WRANGLING TASK –
CLEANING GDP DATA
Th ese ar e t h e st ep s t o c omp l et e t h i s ac t i v i t y :

1 . GDP data for India: We w ill


tr y to r ead th e GDP data for
India fr om a CSV file th at
w as fou nd in a Wor ld Bank
por tal. It is giv en to y ou and
also h osted on th e Packt
GitHu b r epositor y . Bu t th e
Pandas read_csv m eth od
w ill th r ow an er r or in w e tr y
to r ead it nor m ally . Let's look
at a step-by -step gu ide on
h ow w e can r ead u sefu l
infor m ation fr om it:

df3=pd.read_csv("India
_World_Bank_Info.csv")
Th e ou tpu t is as follow s:

----------------------
----------------------
----------------------
---------

ParserError Traceback
(most recent call
last)

<ipython-input-45-
9239cae67df7> in
<module>()

…..

ParserError: Error
tokenizing data. C
error: Expected 1
fields in line 6, saw
3

We can tr y and u se th e
error_bad_lines=False
option in th is kind of
situ ation.

2 . Read th e India Wor ld Bank


Infor m ation .csv file:

df3=pd.read_csv("India
_World_Bank_Info.csv",
error_bad_lines=False)

df3.head(10)

Th e ou tpu t is as follow s:
Figure 9.20: DataFrame from the India
World Bank Information

Note:

At times, the output may not


found because there are three
row s instead of the expected
one row .

3 . Clear ly , th e delim iter in th is


file is tab (\t):

df3=pd.read_csv("India
_World_Bank_Info.csv",
error_bad_lines=False,
delimiter='\t')

df3.head(10)

Th e ou tpu t is as follow s:
Figure 9.21: DataFrame from the India
World Bank Information a er using a
delimiter

4 . Use th e skiprows par am eter


to skip th e fir st 4 r ow s:

df3=pd.read_csv("India
_World_Bank_Info.csv",
error_bad_lines=False,
delimiter='\t',skiprow
s=4)

df3.head(10)

Th e ou tpu t is as follow s:

Figure 9.22: DataFrame from the India


World Bank Information a er using
skiprows
5. Closely exam ine th e dataset:
In th is file, th e colu m ns ar e
th e y ear ly data and r ow s ar e
th e v ar iou s ty pes of
infor m ation. Upon
exam ining th e file w ith
Excel, w e find th at th e
colu m n Indicator Name is
th e one w ith th e nam e of th e
par ticu lar data ty pe. We
filter th e dataset w ith th e
infor m ation w e ar e
inter ested in and also
tr anspose (th e r ow s and
colu m ns ar e inter ch anged) it
to m ake it a sim ilar for m at
as ou r pr ev iou s edu cation
dataset:

df4=df3[df3['Indicator
Name']=='GDP per
capita (current
US$)'].T

df4.head(10)

Th e ou tpu t is as follow s:

Figure 9.23: DataFrame focusing on GDP


per capita

6 . Th er e is no index, so let's u se
reset_index again:

df4.reset_index(inplac
e=True)

df4.head(10)
Th e ou tpu t is as follow s:

Figure 9.24: DataFrame from the India


World Bank Information using reset_index

7 . Th e fir st 3 r ow s ar en't u sefu l.


We can r edefine th e
DataFr am e w ith ou t th em .
Th en, w e r e-index again:

df4.drop([0,1,2],inpla
ce=True)

df4.reset_index(inplac
e=True,drop=True)

df4.head(10)

Th e ou tpu t is as follow s:

Figure 9.25: DataFrame from the India


World Bank Information a er dropping
and resetting the index
8. Let's r enam e th e colu m ns
pr oper ly (th is is necessar y
for m er ging, w h ich w e w ill
look at sh or tly ):

df4.columns=
['Year','GDP']

df4.head(10)

Th e ou tpu t is as follow s:

Figure 9.26: DataFrame focusing on Year


and GDP

9 . It looks like th at w e h av e
GDP data fr om 1 9 6 0
onw ar d. Bu t w e ar e
inter ested in 2 003 - 2 01 6 .
Let's exam ine th e last 2 0
r ow s:

df4.tail(20)

Th e ou tpu t is as follow s:
Figure 9.27: DataFrame from the India
World Bank Information

1 0. So, w e sh ou ld be good w ith


r ow s 4 3 -56 . Let's cr eate a
DataFr am e called df_gdp:

df_gdp=df4.iloc[[i for
i in range(43,57)]]

df_gdp

Th e ou tpu t is as follow s:
Figure 9.28: DataFrame from the India
World Bank Information

1 1 . We need to r eset th e index


again (for m er ging):

df_gdp.reset_index(inp
lace=True,drop=True)

df_gdp

Th e ou tpu t is as follow s:
Figure 9.29: DataFrame from the India
World Bank Information

1 2 . Th e y ear in th is DataFr am e
is not of th e int ty pe. So, it
w ill h av e pr oblem s m er ging
w ith th e edu cation
DataFr am e:

df_gdp['Year']

Th e ou tpu t is as follow s:

Figure 9.30: DataFrame focusing on year

1 3 . Use th e apply m eth od w ith


Py th on's bu ilt-in int
fu nction. Ignor e any
w ar nings th at ar e th r ow n:

df_gdp['Year']=df_gdp[
'Year'].apply(int)

SOLUTION OF ACTIVITY 14:


DATA WRANGLING TASK –
MERGING UN DATA AND GDP
DATA
Th ese ar e t h e st ep s t o c omp l et e t h i s ac t i v i t y :

1 . Now , m er ge th e tw o
DataFr am es, th at is,
primary_enrollment_india
and df_gdp, on th e Year
colu m n:

primary_enrollment_wit
h_gdp=primary_enrollme
nt_india.merge(df_gdp,
on='Year')

primary_enrollment_wit
h_gdp

Th e ou tpu t is as follow s:

Figure 9.31: Merged data


2 . Now , w e can dr op th e Data,
Footnotes, and
Region/Country/Area
colu m ns:

primary_enrollment_wit
h_gdp.drop(['Data','Fo
otnotes','Region/Count
ry/Area'],axis=1,inpla
ce=True)

primary_enrollment_wit
h_gdp

Th e ou tpu t is as follow s:

Figure 9.32: Merged data a er dropping


the Data, Footnotes, and
Region/Country/Area columns

3 . Rear r ange th e colu m ns for


pr oper v iew ing and
pr esentation to a data
scientist:

primary_enrollment_wit
h_gdp =
primary_enrollment_wit
h_gdp[['Year','Enrollm
ents
(Thousands)','GDP']]
primary_enrollment_wit
h_gdp

Th e ou tpu t is as follow s:

Figure 9.33: Merged data a er


rearranging the columns

4 . Plot th e data:

plt.figure(figsize=
(8,5))

plt.title("India's GDP
per capita vs primary
education
enrollment",fontsize=1
6)

plt.scatter(primary_en
rollment_with_gdp['GDP
'],

primary_enrollment_wit
h_gdp['Enrollments
(Thousands)'],

edgecolor='k',color='o
range',s=200)

plt.xlabel("GDP per
capita (US
$)",fontsize=15)

plt.ylabel("Primary
enrollment
(thousands)",fontsize=
15)

plt.xticks(fontsize=14
)

plt.yticks(fontsize=14
)

plt.grid(True)

plt.show()

Th e ou tpu t is as follow s:

Figure 9.34: Scatter plot of merged data

ACTIVITY 15: DATA


WRANGLING TASK –
CONNECTING THE NEW DATA
TO A DATABASE
Th ese ar e t h e st ep s t o c omp l et e t h i s ac t i v i t y :

1 . Connect to a database and


w r iting v alu es it. We star t
by im por ting th e sqlite3
m odu le of Py th on and th en
u se th e connect fu nction to
connect to a database.
Designate Year as th e
PRIMARY KEY of th is table:

import sqlite3

with
sqlite3.connect("Educa
tion_GDP.db") as conn:

cursor = conn.cursor()

cursor.execute("CREATE
TABLE IF NOT EXISTS \

education_gdp(Year
INT, Enrollment FLOAT,
GDP FLOAT, PRIMARY KEY
(Year))")

2 . Ru n a loop w ith th e dataset


r ow s one by one to inser t
th em in th e table:

with
sqlite3.connect("Educa
tion_GDP.db") as conn:

cursor = conn.cursor()

for i in range(14):

year =
int(primary_enrollment
_with_gdp.iloc[i]
['Year'])

enrollment =
primary_enrollment_wit
h_gdp.iloc[i]
['Enrollments
(Thousands)']
gdp =
primary_enrollment_wit
h_gdp.iloc[i]['GDP']

#print(year,enrollment
,gdp)

cursor.execute("INSERT
INTO education_gdp
(Year,Enrollment,GDP)
VALUES(?,?,?)",
(year,enrollment,gdp))

If w e look at th e cu r r ent
folder , w e sh ou ld see a file
called Education_GDP.db,
and if w e can exam ine th at
u sing a database v iew er
pr ogr am , w e can see th e data
tr ansfer r ed th er e.

In t h ese ac t i v i t i es, w e h av e ex ami ned a c omp l et e


dat a w r angl i ng f l ow , i nc l u di ng r eadi ng dat a
f r om t h e w eb and a l oc al dr i v e, f i l t er i ng,
c l eani ng, qu i c k v i su al i zat i on, i mp u t at i on,
i ndex i ng, mer gi ng, and w r i t i ng b ac k t o a
dat ab ase t ab l e. W e al so w r ot e c u st om f u nc t i ons
t o t r ansf or m some of t h e dat a and saw h ow t o
h andl e si t u at i ons w h er e w e may get er r or s u p on
r eadi ng t h e f i l e.

You might also like