You are on page 1of 105

Python by Example

Book 2

(Data Manipulation and Analysis)


(First Draft)

Compiled & Edited


by
Muhammad Nadeem Khokhar
(mnkhokhar@gmail.com)

August 2023
Disclaimer
This book has been created using various tools, including AI tools,
development tools, and other services. While the book's development
has involved the utilization of these tools, it is important to note that
the content has been planned and organized by the author.

Efforts have been made to ensure the accuracy and completeness


of the information presented. However, absolute correctness or
suitability for specific purposes cannot be guaranteed. Readers are
advised to exercise their own judgment and discretion when applying
the information contained in this book and are requested to share
their comments and suggestions with the author through email.

Thank you for your understanding and support.


Contents
Chapter 1: Introduction to Data Manipulation ............................................. 1
1.1 Understanding Data Manipulation and Its Importance ...................... 1
1.2 Introducing Python's Data Structures ................................................. 2
1.3 Accessing and Modifying Data Elements ............................................ 6
Chapter 2: Data Processing (Loops and Comprehensions) ......................... 12
2.1 Using Loops for Data Iteration .......................................................... 12
2.2 List Comprehensions for Efficient Data Transformations ................. 16
2.3 Dictionary Comprehensions and Set Comprehensions..................... 22
Chapter 3: NumPy: Foundation for Numerical Computing......................... 29
3.1 Introduction to NumPy and Its Key Features .................................... 29
3.2 Creating NumPy Arrays ..................................................................... 31
3.3 Array Indexing and Slicing ................................................................. 32
3.4 Array Operations (Element-wise and Broadcasting)......................... 35
Chapter 4: Data Analysis with Pandas ........................................................ 37
4.1 Getting Started with Pandas Series and DataFrames ....................... 38
4.2 Data Indexing and Selection in Pandas ............................................. 39
4.3 Data Cleaning and Handling Missing Values ..................................... 41
4.4 Data Aggregation and Grouping ....................................................... 44
Chapter 5: Data Visualization with Matplotlib............................................ 46
5.1 Introduction to Data Visualization .................................................... 46
5.2 Creating Basic Plots with Matplotlib ................................................. 48
5.3 Customizing Plots: Labels, Titles, Colors, and Styles ......................... 51
5.4 Plotting Data from NumPy Arrays and Pandas DataFrames ............. 54
Chapter 6: Advanced Data Manipulation Techniques ................................ 57
6.1 Data Merging and Joining in Pandas ................................................. 58
6.2 Reshaping Data: Pivoting, Melting, and Stack/Unstack .................... 60
6.3 Combining DataFrames with Concatenation and Appending ........... 62
Chapter 7: Working with Time Series Data ................................................. 64
7.1 Handling Time and Date Data in Python ........................................... 64
7.2 Time Series Indexing and Slicing with Pandas................................... 67
7.3 Resampling and Frequency Conversion ............................................ 68
Chapter 8: Data Analysis Case Study........................................................... 71
8.1 Analyzing Real-World Datasets with Python..................................... 71
8.2 Extracting Insights and Patterns ....................................................... 73
8.3 Presenting Findings with Visualizations ............................................ 75
Chapter 9: Large Datasets and Performance Optimization ........................ 78
9.1 Strategies for Handling Large Datasets ............................................. 78
9.2 Efficient Data Processing Techniques ............................................... 81
9.3 Performance Optimization with NumPy and Pandas ....................... 83
Chapter 10: Data Manipulation Best Practices ........................................... 85
10.1 Writing Clean and Efficient Data Manipulation Code ..................... 86
10.2 Using Pythonic Idioms and Best Practices ...................................... 89
10.3 Tips for Error Handling and Debugging ........................................... 92
Case Study: Book Library Analysis............................................................... 94
Code ........................................................................................................ 97
Step by Step Description ......................................................................... 99
Python by Example (Book 2: Data Manipulation and Analysis)

Chapter 1: Introduction to Data Manipulation


In this chapter, we'll explore the significance of data manipulation and its crucial
role in various data-driven tasks. We'll dive into Python's data structures, learn to
access and modify data elements efficiently, and acquire essential skills to reshape
datasets effectively. Whether you're new to Python or already familiar with its
basics, this chapter will equip you with the necessary tools to tackle real-world data
challenges and make informed data-driven decisions.

1.1 Understanding Data Manipulation and Its Importance

Data manipulation is the process of transforming raw data into a more structured
and usable format, making it easier to extract meaningful insights and derive
valuable information. It encompasses a wide range of operations, including cleaning,
filtering, sorting, aggregating, and transforming data. It involves modifying the
structure or content of data to meet specific requirements, making it suitable for
analysis and interpretation. Data manipulation plays a vital role in the entire data
analysis workflow, from data preprocessing and cleaning to advanced analytics and
modeling.

Data manipulation is essential for several reasons:

 Data Cleaning: Real-world datasets are often noisy and may contain missing
or inconsistent values. Data manipulation allows us to clean and preprocess
the data, ensuring its accuracy and reliability.
 Data Integration: In many scenarios, data is collected from multiple sources.
Data manipulation helps in integrating and merging data from different
sources to create a unified dataset for analysis.
 Feature Engineering: Data manipulation allows us to create new features
from existing data, which can significantly improve the performance of
machine learning models.
Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 1|P a g e
Python by Example (Book 2: Data Manipulation and Analysis)

 Data Transformation: By transforming data into a suitable format, we can


gain valuable insights, detect patterns, and make data more amenable to
statistical analysis.
 Data Aggregation: Aggregating data enables us to summarize large datasets
and extract key statistics, facilitating quick and informed decision-making.
 Data Visualization: Well-structured and manipulated data can be effectively
visualized, aiding in the communication of insights to stakeholders.

1.2 Introducing Python's Data Structures

One of the key reasons behind Python's popularity is its rich set of data
structures, which enable efficient and organized data manipulation. In this section,
we will explore Python's essential data structures, including lists, tuples, sets, and
dictionaries. Through coding examples, we will demonstrate the versatility and
power of these data structures in various scenarios.

Lists: Lists are one of the most fundamental data structures in Python, allowing
us to store collections of items in a sequential order. Lists are versatile, as they can
hold elements of different data types and can be modified after creation.

# Creating a list of numbers


numbers = [1, 2, 3, 4, 5]

# Accessing elements in the list


print(numbers[0]) # Output: 1
print(numbers[-1]) # Output: 5

# Modifying elements in the list


numbers[2] = 10
print(numbers) # Output: [1, 2, 10, 4, 5]

# Adding elements to the list


numbers.append(6)
print(numbers) # Output: [1, 2, 10, 4, 5, 6]

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 2|P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

# List slicing
subset = numbers[1:4]
print(subset) # Output: [2, 10, 4]

Brief description:

1. Creating a List: The list "numbers" is created using square brackets,


containing the elements 1, 2, 3, 4, and 5.
2. Accessing Elements: The code uses indexing to access specific elements
within the list. It shows how to access the first element using `numbers[0]`
(which gives the output 1) and the last element using `numbers[-1]` (which
gives the output 5).
3. Modifying Elements: The code demonstrates how to modify an element in
the list by assigning a new value to it. In this case, the element at index 2 is
changed from 3 to 10 using `numbers[2] = 10`. After modification, the list
becomes [1, 2, 10, 4, 5].
4. Adding Elements: The code appends the value 6 to the list using the
`append()` method. After the addition, the list becomes [1, 2, 10, 4, 5, 6].
5. List Slicing: List slicing is showcased by extracting a subset of elements from
the list. The code uses slicing with `numbers[1:4]` to obtain a subset of
elements from index 1 (inclusive) to index 4 (exclusive), resulting in the
output [2, 10, 4].

Tuples: Tuples are similar to lists, but they are immutable, meaning their
elements cannot be modified after creation. Tuples are used to represent fixed
collections of items that should not change throughout the program's execution.

# Creating a tuple of colors


colors = ('red', 'green', 'blue')

# Accessing elements in the tuple


print(colors[0]) # Output: 'red'
print(colors[-1]) # Output: 'blue'

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 3|P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

# Tuples are immutable (this will raise an error)


colors[0] = 'yellow'

Brief Description:

1. Creating a Tuple: The "colors" tuple is created using parentheses and contains
the elements 'red', 'green', and 'blue'.
2. Accessing Elements: The code demonstrates how to access specific elements
within the tuple using indexing. `colors[0]` is used to access the first element,
which returns the output 'red', and `colors[-1]` accesses the last element,
returning 'blue'.
3. Tuple Immutability: The code showcases the immutability of tuples by
attempting to modify the element at index 0 using `colors[0] = 'yellow'`. Since
tuples cannot be changed after creation, this operation raises an error.

Sets: Sets are unordered collections of unique elements. They are useful for
performing mathematical operations like union, intersection, and difference
efficiently.

# Creating sets
set1 = {1, 2, 3, 4, 5}
set2 = {4, 5, 6, 7, 8}

# Union of sets
union_set = set1.union(set2)
print(union_set) # Output: {1, 2, 3, 4, 5, 6, 7, 8}

# Intersection of sets
intersection_set = set1.intersection(set2)
print(intersection_set) # Output: {4, 5}

# Difference of sets
difference_set = set1.difference(set2)
print(difference_set) # Output: {1, 2, 3}

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 4|P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

Brief Description:

1. Creating Sets: Two sets, "set1" and "set2," are created using curly braces
and contain unique elements. "set1" includes elements 1, 2, 3, 4, and 5,
while "set2" includes elements 4, 5, 6, 7, and 8.
2. Union of Sets: The code showcases the union operation using the
`union()` method. The union of "set1" and "set2" combines all unique
elements from both sets, resulting in the output `{1, 2, 3, 4, 5, 6, 7, 8}`.
3. Intersection of Sets: The code demonstrates the intersection operation
using the `intersection()` method. The intersection of "set1" and "set2"
identifies the common elements present in both sets, yielding the output
`{4, 5}`.
4. Difference of Sets: The code showcases the difference operation using the
`difference()` method. The difference of "set1" and "set2" identifies the
elements that are present in "set1" but not in "set2," resulting in the
output `{1, 2, 3}`.

Dictionaries: Dictionaries are unordered collections of key-value pairs. They


provide fast access to values based on their corresponding keys, making them ideal
for storing and retrieving data with meaningful labels.

# Creating a dictionary of student information


student = {
'name': 'John Doe',
'age': 25,
'grade': 'A'
}

# Accessing values in the dictionary


print(student['name']) # Output: 'John Doe'
print(student['grade']) # Output: 'A'

# Modifying values in the dictionary


student['age'] = 26

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 5|P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

print(student)
# Output: {'name': 'John Doe', 'age': 26, 'grade': 'A'}

# Adding new key-value pairs to the dictionary


student['gender'] = 'Male'
print(student)
# Output: {'name': 'John Doe', 'age': 26, 'grade': 'A',
# 'gender': 'Male'}

Brief Description:

1. Creating a Dictionary: The "student" dictionary is created using curly braces


and contains key-value pairs representing student information. The keys are
'name', 'age', and 'grade', and the corresponding values are 'John Doe', 25,
and 'A', respectively.
2. Accessing Values: The code showcases how to access specific values in the
dictionary using their respective keys. For instance, `student['name']`
retrieves the value 'John Doe', and `student['grade']` retrieves the value 'A'.
3. Modifying Values: The code demonstrates how to modify the value
associated with a particular key in the dictionary. In this case, the value of the
'age' key is updated from 25 to 26 using `student['age'] = 26`. After
modification, the dictionary becomes `{'name': 'John Doe', 'age': 26, 'grade':
'A'}`.
4. Adding New Key-Value Pairs: The code showcases how to add new key-value
pairs to the dictionary. A new key 'gender' with the value 'Male' is added to
the "student" dictionary using `student['gender'] = 'Male'`. After the addition,
the dictionary becomes `{'name': 'John Doe', 'age': 26, 'grade': 'A', 'gender':
'Male'}`.

1.3 Accessing and Modifying Data Elements

Accessing and modifying data elements are fundamental operations in any


programming language, and Python offers a range of intuitive methods to perform
Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 6|P a g e
Python by Example (Book 2: Data Manipulation and Analysis)

these tasks efficiently. In this section, we will explore how to access and modify data
elements in lists, tuples, sets, and dictionaries.

Accessing and Modifying Elements in Lists: Lists are mutable data structures that
allow us to store a collection of items in a sequential order. Accessing and modifying
elements within a list is a straightforward process in Python.

# Creating a list of fruits


fruits = ['apple', 'banana', 'cherry', 'date']

# Accessing elements in the list


print(fruits[0]) # Output: 'apple'
print(fruits[-1]) # Output: 'date'

# Modifying elements in the list


fruits[1] = 'grape'
print(fruits) # Output: ['apple', 'grape', 'cherry', 'date']

# Adding elements to the list


fruits.append('orange')
print(fruits)
# Output: ['apple', 'grape', 'cherry', 'date', 'orange']

# List slicing
subset = fruits[1:4]
print(subset) # Output: ['grape', 'cherry', 'date']

Brief Description:

1. Creating a List: The "fruits" list is created using square brackets and contains
the elements 'apple', 'banana', 'cherry', and 'date'.
2. Accessing Elements: The code demonstrates how to access specific elements
within the list using indexing. For instance, `fruits[0]` retrieves the first
element, which is 'apple', and `fruits[-1]` retrieves the last element, which is
'date'.

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 7|P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

3. Modifying Elements: The code showcases how to modify the value of an


element in the list. In this case, the element at index 1, which is 'banana', is
changed to 'grape' using `fruits[1] = 'grape'`. After the modification, the list
becomes `['apple', 'grape', 'cherry', 'date']`.
4. Adding Elements: The code demonstrates how to add a new element to the
end of the list using the `append()` method. The value 'orange' is appended to
the "fruits" list, resulting in `['apple', 'grape', 'cherry', 'date', 'orange']`.
5. List Slicing: List slicing is showcased by extracting a subset of elements from
the list. The code uses slicing with `fruits[1:4]` to obtain a subset of elements
from index 1 (inclusive) to index 4 (exclusive), resulting in `['grape', 'cherry',
'date']`.

Accessing Elements in Tuples: Tuples, unlike lists, are immutable, meaning their
elements cannot be changed after creation. Accessing elements within a tuple is
similar to accessing elements in a list.

# Creating a tuple of weekdays


weekdays = ('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday')

# Accessing elements in the tuple


print(weekdays[0]) # Output: 'Monday'
print(weekdays[-1]) # Output: 'Friday'

Brief Description:

1. Creating a Tuple: The "weekdays" tuple is created using parentheses and


contains the elements 'Monday', 'Tuesday', 'Wednesday', 'Thursday', and
'Friday'.
2. Accessing Elements: The code showcases how to access specific elements
within the tuple using indexing. For instance, `weekdays[0]` retrieves the first
element, which is 'Monday', and `weekdays[-1]` retrieves the last element,
which is 'Friday'.

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 8|P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

Since tuples are immutable, trying to modify elements in a tuple will raise an
error.

Accessing and Modifying Elements in Sets: Sets are unordered collections of


unique elements. Due to their nature, indexing is not supported in sets, but we can
perform operations like adding and removing elements.

# Creating a set of prime numbers


prime_numbers = {2, 3, 5, 7, 11}

# Adding elements to the set


prime_numbers.add(13)
print(prime_numbers) # Output: {2, 3, 5, 7, 11, 13}

# Removing elements from the set


prime_numbers.remove(5)
print(prime_numbers) # Output: {2, 3, 7, 11, 13}

Brief Description:

1. Creating a Set: The "prime_numbers" set is created using curly braces and
contains the elements 2, 3, 5, 7, and 11. Since sets only store unique
elements, duplicate values are automatically removed.
2. Adding Elements: The code demonstrates how to add a new element to the
set using the `add()` method. The value 13 is added to the "prime_numbers"
set, resulting in `{2, 3, 5, 7, 11, 13}`.
3. Removing Elements: The code showcases how to remove a specific element
from the set using the `remove()` method. In this case, the element 5 is
removed from the "prime_numbers" set, resulting in `{2, 3, 7, 11, 13}`.

Accessing and Modifying Elements in Dictionaries: Dictionaries are collections of


key-value pairs. Accessing elements in a dictionary is done using their respective
keys, and modifying the values associated with keys is a straightforward process.

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 9|P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

# Creating a dictionary of student scores


student_scores = {
'Alice': 85,
'Bob': 90,
'Charlie': 78,
'David': 92
}

# Accessing values in the dictionary


print(student_scores['Bob']) # Output: 90

# Modifying values in the dictionary


student_scores['Charlie'] = 82
print(student_scores)
# Output: {'Alice': 85, 'Bob': 90, 'Charlie': 82, 'David': 92}

# Adding new key-value pairs to the dictionary


student_scores['Eve'] = 88
print(student_scores)

# Output: {'Alice': 85, 'Bob': 90, 'Charlie': 82, 'David': 92,


# 'Eve': 88}

Brief Description:

1. Creating a Dictionary: The "student_scores" dictionary is created using curly


braces and contains key-value pairs representing student names and their
respective scores. For example, 'Alice' scored 85, 'Bob' scored 90, 'Charlie'
scored 78, and 'David' scored 92.
2. Accessing Values: The code showcases how to access specific values in the
dictionary using their respective keys. For instance, `student_scores['Bob']`
retrieves the score of Bob, which is 90.
3. Modifying Values: The code demonstrates how to modify the value
associated with a particular key in the dictionary. In this case, Charlie's score
is updated from 78 to 82 using `student_scores['Charlie'] = 82`. After the

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 10 | P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

modification, the dictionary becomes `{'Alice': 85, 'Bob': 90, 'Charlie': 82,
'David': 92}`.
4. Adding New Key-Value Pairs: The code showcases how to add a new key-
value pair to the dictionary. A new key 'Eve' with the value 88 is added to the
"student_scores" dictionary using `student_scores['Eve'] = 88`. After the
addition, the dictionary becomes `{'Alice': 85, 'Bob': 90, 'Charlie': 82, 'David':
92, 'Eve': 88}`.

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 11 | P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

Chapter 2: Data Processing (Loops and Comprehensions)


Loops are powerful constructs that allow us to iterate over data collections, while
comprehensions offer a concise and expressive way to transform data. Whether
you're dealing with lists, dictionaries, or sets, mastering loops and comprehensions
will significantly enhance your ability to handle data efficiently and tackle complex
tasks with elegance.

2.1 Using Loops for Data Iteration

Data iteration is a fundamental operation in data processing, enabling us to


traverse through data collections and perform various tasks efficiently. In Python,
loops are powerful constructs that facilitate data iteration, allowing us to repetitively
execute a block of code for each item in a data collection.

The "for" Loop: The "for" loop is commonly used for iterating over elements in
data structures like lists, tuples, sets, and dictionaries. It iterates through each item
in the collection and executes the associated block of code until all items have been
processed.

Example: Iterating over a List

# Creating a list of numbers


numbers = [1, 2, 3, 4, 5]

# Using the "for" loop to iterate over the list


for num in numbers:
print(num)

Brief Description:

1. Creating a List: The "numbers" list is created using square brackets and
contains the elements 1, 2, 3, 4, and 5.

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 12 | P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

2. Iterating with "for" Loop: The code uses a "for" loop to iterate over each
element in the "numbers" list. The "for" loop syntax is as follows: `for
element in list`. In this case, the loop iterates through the "numbers" list, and
the variable "num" takes on the value of each element during each iteration.
3. Printing the Elements: Inside the "for" loop, the code uses the `print()`
function to output each element to the console. The output will be the
numbers 1, 2, 3, 4, and 5, each printed on a new line.

Example: Iterating over a Dictionary

# Creating a dictionary of students and their scores


student_scores = {'Alice': 85, 'Bob': 90, 'Charlie': 78}

# Using the "for" loop to iterate over the dictionary


for name, score in student_scores.items():
print(f"{name} scored {score}")

Brief Description:

1. Creating a Dictionary: The "student_scores" dictionary is created using curly


braces and contains key-value pairs representing student names as keys and
their corresponding scores as values. For example, 'Alice' scored 85, 'Bob'
scored 90, and 'Charlie' scored 78.
2. Iterating with "for" Loop: The code uses a "for" loop with the `.items()`
method to iterate over each key-value pair in the "student_scores"
dictionary. The "for" loop syntax is as follows: `for key, value in
dictionary.items()`. In this case, during each iteration, the "name" variable
takes on the key (student name), and the "score" variable takes on the value
(student's score).
3. Printing the Information: Inside the "for" loop, the code uses string
formatting with an "f-string" to print each student's name and score to the
console. The output will display each student's name along with their

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 13 | P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

corresponding score, like "Alice scored 85," "Bob scored 90," and "Charlie
scored 78."

The "while" Loop: The "while" loop executes a block of code repeatedly as long
as a specified condition is true. It is useful when the number of iterations is
uncertain, and the loop continues until the condition becomes false.

Example: Using a "while" Loop to Find Even Numbers

# Finding the first 5 even numbers


even_numbers = []
num = 0

while len(even_numbers) < 5:


if num % 2 == 0:
even_numbers.append(num)
num += 1

print(even_numbers)

Brief Description:

1. Initializing Variables: The code creates an empty list named "even_numbers" to


store the found even numbers. Additionally, it initializes the variable "num" to 0,
which will be used to iterate through numbers to find the even ones.
2. While Loop: The code uses a while loop to keep iterating until there are 5 even
numbers in the "even_numbers" list. The condition `len(even_numbers) < 5`
checks the length of the list to determine if there are fewer than 5 even numbers.
3. Finding Even Numbers: Inside the while loop, the code checks if the current value
of "num" is even using the condition `num % 2 == 0`. If "num" is even, it is
appended to the "even_numbers" list using `even_numbers.append(num)`.
4. Incrementing "num": After each iteration of the loop, "num" is incremented by 1
using `num += 1`, allowing the while loop to check the next number for evenness.

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 14 | P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

5. Printing the Result: Once the while loop exits (when 5 even numbers are found),
the "even_numbers" list is printed, displaying the first 5 even numbers.

Loop Control Statements: Python provides loop control statements like "break"
and "continue" to alter the flow of loops. "break" is used to exit the loop
prematurely, while "continue" skips the current iteration and moves to the next.

Example: Using "break" to Find a Target Value

# Searching for a target value in a list


numbers = [10, 25, 5, 18, 30, 12]
target = 30

for num in numbers:


if num == target:
print(f"Target value {target} found!")
break
else:
print("Target value not found.")

Brief Description:

1. List and Target Value: The code creates a list named "numbers" containing
elements 10, 25, 5, 18, 30, and 12. It also sets the variable "target" to 30,
representing the value we want to find in the list.
2. "for" Loop: The code uses a "for" loop to iterate through each element in the
"numbers" list. During each iteration, the variable "num" takes on the value
of the current element in the list.
3. Comparing with Target: Inside the "for" loop, the code compares the value of
"num" with the "target" value using the condition `if num == target`. If a
match is found (the target value is equal to an element in the list), the code
prints a message indicating that the "target value" has been found and then
exits the loop using `break`.

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 15 | P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

4. "else" Block: If the "for" loop completes without finding the target value, the
code executes the "else" block, which prints the message "Target value not
found."

2.2 List Comprehensions for Efficient Data Transformations

When it comes to data transformations, efficiency is of paramount importance,


especially when dealing with large datasets. In Python, list comprehensions provide
a concise and powerful way to perform data transformations on lists. List
comprehensions allow us to create new lists by applying operations to each element
of an existing list, making it an essential tool for data processing tasks.

Understanding List Comprehensions: List comprehensions are a compact and


expressive way to generate new lists based on existing ones. The syntax for list
comprehensions follows the pattern `[expression for item in list if condition]`. The
"expression" represents the operation to be applied to each "item" in the "list," and
the optional "condition" filters elements based on specified criteria.

Example: Squaring Elements in a List Using a for loop

# Using a for loop to square elements in a list


numbers = [1, 2, 3, 4, 5]
squared_numbers = []

for num in numbers:


squared_numbers.append(num ** 2)

print(squared_numbers) # Output: [1, 4, 9, 16, 25]

Brief Description:

1. List of Numbers: The code creates a list named "numbers" containing


elements 1, 2, 3, 4, and 5.

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 16 | P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

2. Initializing an Empty List: The code creates an empty list named


"squared_numbers" to store the squared values of the elements from the
"numbers" list.
3. "for" Loop: The code uses a "for" loop to iterate through each element in the
"numbers" list. During each iteration, the variable "num" takes on the value
of the current element in the list.
4. Squaring the Elements: Inside the "for" loop, the code squares each element
(num) using the exponentiation operator (`num ** 2`) and appends the
squared value to the "squared_numbers" list using
`squared_numbers.append(num ** 2)`.
5. Printing the Result: After the "for" loop completes, the "squared_numbers"
list is printed, displaying the squared values of the original elements from the
"numbers" list.

The same transformation can be achieved more concisely using a list


comprehension:

# Using a list comprehension to square elements in a list


numbers = [1, 2, 3, 4, 5]
squared_numbers = [num ** 2 for num in numbers]

print(squared_numbers) # Output: [1, 4, 9, 16, 25]

Brief Description:

1. List of Numbers: The code creates a list named "numbers" containing


elements 1, 2, 3, 4, and 5.
2. List Comprehension: The code uses a list comprehension, which is an elegant
and concise way to create a new list based on an existing list (or any iterable).
The list comprehension syntax is `[expression for item in iterable]`, where the
expression is evaluated for each item in the iterable. In this case, the

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 17 | P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

expression is `num ** 2`, which squares each element "num" in the


"numbers" list.
3. Squaring the Elements: The list comprehension iterates through each
element in the "numbers" list, squares it using `num ** 2`, and creates a new
list called "squared_numbers" with the squared values.
4. Printing the Result: After the list comprehension completes, the
"squared_numbers" list is printed, displaying the squared values of the
original elements from the "numbers" list.

Applying Conditions in List Comprehensions: List comprehensions can include


optional conditions to filter elements based on specific criteria. The condition is
specified at the end of the expression, and elements that meet the condition are
included in the new list.

Example: Selecting Even Numbers Using a for loop

# Using a for loop to select even numbers in a list


numbers = [1, 2, 3, 4, 5]
even_numbers = []

for num in numbers:


if num % 2 == 0:
even_numbers.append(num)

print(even_numbers) # Output: [2, 4]

Brief Description:

1. List of Numbers: The code creates a list named "numbers" containing


elements 1, 2, 3, 4, and 5.
2. Initializing an Empty List: The code creates an empty list named
"even_numbers" to store the even numbers selected from the "numbers" list.

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 18 | P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

3. "for" Loop: The code uses a "for" loop to iterate through each element in the
"numbers" list. During each iteration, the variable "num" takes on the value
of the current element in the list.
4. Checking for Even Numbers: Inside the "for" loop, the code checks if the
current element (num) is even using the condition `if num % 2 == 0`. If the
number is even (the remainder of the division by 2 is 0), it is appended to the
"even_numbers" list using `even_numbers.append(num)`.
5. Printing the Result: After the "for" loop completes, the "even_numbers" list is
printed, displaying the even numbers selected from the original "numbers"
list.

With a list comprehension, the same transformation can be achieved more


succinctly:

# Using a list comprehension to select even numbers in a list


numbers = [1, 2, 3, 4, 5]
even_numbers = [num for num in numbers if num % 2 == 0]

print(even_numbers) # Output: [2, 4]

Brief Description:

1. List of Numbers: The code creates a list named "numbers" containing


elements 1, 2, 3, 4, and 5.
2. List Comprehension: The code uses a list comprehension to create a new list
called "even_numbers" based on the elements of the "numbers" list. The list
comprehension syntax is `[expression for item in iterable if condition]`, where
the expression is evaluated for each item in the iterable if it satisfies the
specified condition. In this case, the expression is `num`, which selects the
element "num" from the "numbers" list, and the condition is `if num % 2 ==
0`, which checks if the element is even.

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 19 | P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

3. Selecting Even Numbers: The list comprehension iterates through each


element in the "numbers" list, and for each element "num", it checks if the
number is even (i.e., the remainder of the division by 2 is 0) based on the
condition `if num % 2 == 0`. If the condition is true, the element "num" is
included in the new list "even_numbers".
4. Printing the Result: After the list comprehension completes, the
"even_numbers" list is printed, displaying the even numbers selected from
the original "numbers" list.

Nested List Comprehensions: List comprehensions can also be nested to perform


more complex data transformations. Nested list comprehensions are particularly
useful when working with multi-dimensional lists.

Example: Flattening a 2D List Using a for loop

# Nested list
matrix = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]

# Using a for loop to flatten the 2D list


flattened_list = []

for row in matrix:


for num in row:
flattened_list.append(num)

print(flattened_list) # Output: [1, 2, 3, 4, 5, 6, 7, 8, 9]

Brief Description:

1. Nested List (2D List): The code creates a nested list named "matrix"
containing three sublists, each representing a row in the 2D matrix. Each
sublist contains three elements, forming a 3x3 matrix.

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 20 | P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

2. Initializing an Empty List: The code creates an empty list named


"flattened_list" to store the flattened (1D) version of the elements in the
"matrix."
3. Nested "for" Loop: The code uses two nested "for" loops to iterate through
the elements in the nested "matrix" list. The outer "for" loop iterates through
each row (sublist) in the "matrix," and the inner "for" loop iterates through
each element "num" in each row.
4. Flattening the List: Inside the nested "for" loops, the code appends each
element "num" from the 2D "matrix" to the "flattened_list" using
`flattened_list.append(num)`.
5. Printing the Result: After the nested "for" loops complete, the "flattened_list"
is printed, displaying the flattened 1D list containing all the elements from
the original 2D "matrix."

With a nested list comprehension, the same operation can be performed more
succinctly:

# Using a nested list comprehension to flatten the 2D list


matrix = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
flattened_list = [num for row in matrix for num in row]

print(flattened_list) # Output: [1, 2, 3, 4, 5, 6, 7, 8, 9]

Brief Description:

1. Nested List (2D List): The code creates a nested list named "matrix"
containing three sublists, each representing a row in the 2D matrix.
2. List Comprehension: The code uses a nested list comprehension to create the
"flattened_list." The list comprehension syntax is `[expression for item in
iterable for item2 in iterable2]`, where the "expression" is evaluated for each
combination of "item" and "item2" from the specified iterables. In this case,
the "expression" is simply `num`, which represents each element in the

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 21 | P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

nested "matrix," and the nested for loops iterate through each "row" in the
"matrix" and each "num" in the "row."
3. Flattening the List: The nested list comprehension iterates through each row
of the "matrix" using the first "for" loop (`for row in matrix`), and for each
"row," it iterates through each element "num" using the second "for" loop
(`for num in row`). The "num" variable represents each individual element in
the "matrix," and these elements are directly included in the "flattened_list."
4. Printing the Result: After the nested list comprehension is complete, the
"flattened_list" is printed, displaying the flattened 1D list containing all the
elements from the original 2D "matrix."

2.3 Dictionary Comprehensions and Set Comprehensions

In addition to list comprehensions, Python provides two more powerful tools for
data transformations: dictionary comprehensions and set comprehensions.
Dictionary comprehensions allow us to create dictionaries with concise syntax, while
set comprehensions enable the creation of sets with unique elements effortlessly.

Dictionary Comprehensions: Dictionary comprehensions are a compact way to


create dictionaries based on existing sequences like lists or other dictionaries. The
syntax for dictionary comprehensions follows the pattern `{key_expression:
value_expression for item in sequence if condition}`.

Example: Creating a Dictionary with Squared Values Using a for loop

# Using a for loop to create a dictionary with squared values


numbers = [1, 2, 3, 4, 5]
squared_dict = {}

for num in numbers:


squared_dict[num] = num ** 2

print(squared_dict) # Output: {1: 1, 2: 4, 3: 9, 4: 16, 5: 25}

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 22 | P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

Brief Description:

1. List of Numbers: The code creates a list named "numbers" containing


elements 1, 2, 3, 4, and 5.
2. Initializing an Empty Dictionary: The code creates an empty dictionary named
"squared_dict" to store the squared values with their corresponding keys.
3. "for" Loop: The code uses a "for" loop to iterate through each element in the
"numbers" list. During each iteration, the variable "num" takes on the value
of the current element in the list.
4. Creating the Dictionary: Inside the "for" loop, the code creates key-value
pairs in the "squared_dict" dictionary. The key is the current element "num"
from the "numbers" list, and the value is the square of that element,
calculated as `num ** 2`.
5. Printing the Result: After the "for" loop completes, the "squared_dict"
dictionary is printed, displaying the keys (numbers) and their corresponding
squared values.

With a dictionary comprehension, the same transformation can be achieved


more succinctly:

# Using a dictionary comprehension to create a dictionary


# with squared values
numbers = [1, 2, 3, 4, 5]
squared_dict = {num: num ** 2 for num in numbers}

print(squared_dict) # Output: {1: 1, 2: 4, 3: 9, 4: 16, 5: 25}

Brief Description:

1. List of Numbers: The code creates a list named "numbers" containing


elements 1, 2, 3, 4, and 5.

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 23 | P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

2. Dictionary Comprehension: The code uses a dictionary comprehension to


create the "squared_dict." The dictionary comprehension syntax is
`{key_expression: value_expression for item in iterable}`, where the
key_expression and value_expression are evaluated for each "item" in the
specified iterable. In this case, the key_expression is `num`, which represents
each element in the "numbers" list, and the value_expression is `num ** 2`,
which calculates the squared value of each element.
3. Creating the Dictionary: The dictionary comprehension iterates through each
element in the "numbers" list, and for each "num", it creates a key-value pair
in the "squared_dict" dictionary. The "num" variable represents each
individual element in the "numbers" list, and the squared value `num ** 2` is
assigned as the value corresponding to the key "num" in the dictionary.
4. Printing the Result: After the dictionary comprehension is complete, the
"squared_dict" is printed, displaying the keys (numbers) and their
corresponding squared values.

Set Comprehensions: Set comprehensions provide a concise way to create sets


with unique elements from sequences like lists, tuples, or other sets. The syntax for
set comprehensions is similar to that of list comprehensions, with the only
difference being the use of curly braces `{}` instead of square brackets `[]`.

Example: Creating a Set of Squared Values Using a for loop

# Using a for loop to create a set with squared values


numbers = [1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 5]
squared_set = set()

for num in numbers:


squared_set.add(num ** 2)

print(squared_set) # Output: {1, 4, 9, 16, 25}

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 24 | P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

Brief Description:

1. List of Numbers: The code creates a list named "numbers" containing


elements 1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 5.
2. Initializing an Empty Set: The code creates an empty set named
"squared_set" to store the unique squared values.
3. "for" Loop: The code uses a "for" loop to iterate through each element in the
"numbers" list. During each iteration, the variable "num" takes on the value
of the current element in the list.
4. Creating the Set: Inside the "for" loop, the code calculates the square of the
current element "num" using `num ** 2` and adds it to the "squared_set"
using `squared_set.add(num ** 2)`.
5. Unique Values: Since sets do not allow duplicate elements, the "squared_set"
only contains unique squared values of the elements in the "numbers" list.
6. Printing the Result: After the "for" loop completes, the "squared_set" is
printed, displaying the unique squared values from the original "numbers"
list.

With a set comprehension, the same transformation can be achieved more


concisely:

# Using a set comprehension to create a set with squared values


numbers = [1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 5]
squared_set = {num ** 2 for num in numbers}

print(squared_set) # Output: {1, 4, 9, 16, 25}

Brief Description:

1. List of Numbers: The code creates a list named "numbers" containing


elements 1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 5.

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 25 | P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

2. Set Comprehension: The code uses a set comprehension to create the


"squared_set." The set comprehension syntax is `{expression for item in
iterable}`, where the "expression" is evaluated for each "item" in the
specified iterable. In this case, the expression is `num ** 2`, which calculates
the squared value of each element "num" from the "numbers" list.
3. Creating the Set: The set comprehension iterates through each element in
the "numbers" list, and for each "num", it calculates the square of the
element using `num ** 2` and adds it to the "squared_set" automatically.
4. Unique Values: Since sets do not allow duplicate elements, the set
comprehension ensures that only unique squared values are included in the
"squared_set."
5. Printing the Result: After the set comprehension is complete, the
"squared_set" is printed, displaying the unique squared values from the
original "numbers" list.

Conditional Dictionary and Set Comprehensions: Similar to list comprehensions,


both dictionary and set comprehensions can include optional conditions to filter
elements based on specific criteria.

Example: Filtering Even Squared Values Using a for loop

# Using a for loop to create a dictionary with squared values


# of even numbers
numbers = [1, 2, 3, 4, 5]
even_squared_dict = {}

for num in numbers:


if num % 2 == 0:
even_squared_dict[num] = num ** 2

print(even_squared_dict) # Output: {2: 4, 4: 16}

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 26 | P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

Brief Description:

1. List of Numbers: The code creates a list named "numbers" containing


elements 1, 2, 3, 4, and 5.
2. Initializing an Empty Dictionary: The code creates an empty dictionary named
"even_squared_dict" to store the squared values of even numbers as key-
value pairs.
3. "for" Loop: The code uses a "for" loop to iterate through each element in the
"numbers" list. During each iteration, the variable "num" takes on the value
of the current element in the list.
4. Checking for Even Numbers: Inside the "for" loop, the code checks if the
current element "num" is even by using the condition `if num % 2 == 0`. If the
number is even, it proceeds to the next step; otherwise, it skips the current
iteration.
5. Creating the Dictionary: If the current element "num" is even (i.e., it satisfies
the condition `num % 2 == 0`), the code creates a key-value pair in the
"even_squared_dict" dictionary. The key is the even number "num," and the
value is the square of that number, calculated as `num ** 2`.
6. Printing the Result: After the "for" loop completes, the "even_squared_dict"
dictionary is printed, displaying the keys (even numbers) and their
corresponding squared values.

With a dictionary comprehension, the same transformation can be achieved


more succinctly:

# Using a dictionary comprehension to create a dictionary


# with squared values of even numbers
numbers = [1, 2, 3, 4, 5]

even_squared_dict = {
num: num ** 2
for num in numbers

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 27 | P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

if num % 2 == 0
}

print(even_squared_dict) # Output: {2: 4, 4: 16}

Brief Description:

1. List of Numbers: The code creates a list named "numbers" containing


elements 1, 2, 3, 4, and 5.
2. Dictionary Comprehension: The code uses a dictionary comprehension to
create the "even_squared_dict." The dictionary comprehension syntax is
`{key_expression: value_expression for item in iterable if condition}`, where
the key_expression and value_expression are evaluated for each "item" in
the specified iterable if the "condition" is met. In this case, the
key_expression is `num`, which represents each element in the "numbers"
list, and the value_expression is `num ** 2`, which calculates the squared
value of each element. The condition `if num % 2 == 0` ensures that only even
numbers are included in the dictionary.
3. Creating the Dictionary: The dictionary comprehension iterates through each
element in the "numbers" list, and for each "num", it checks if it is even by
using the condition `if num % 2 == 0`. If the number is even, it creates a key-
value pair in the "even_squared_dict" dictionary. The key is the even number
"num," and the value is the square of that number, calculated as `num ** 2`.
4. Printing the Result: After the dictionary comprehension is complete, the
"even_squared_dict" is printed, displaying the keys (even numbers) and their
corresponding squared values.

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 28 | P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

Chapter 3: NumPy: Foundation for Numerical Computing


NumPy empowers us to handle large datasets, perform complex mathematical
computations, and manipulate multi-dimensional arrays with ease. We will explore
creating arrays, accessing and modifying elements, performing operations,
understanding broadcasting, and extracting valuable insights from data.

3.1 Introduction to NumPy and Its Key Features

NumPy (Numerical Python) is a fundamental library for numerical computing in


Python, widely used in various scientific, engineering, and data-related fields. It
provides a powerful and efficient way to handle large datasets and perform complex
mathematical operations, making it an essential tool for data analysis, machine
learning, image processing, signal processing, and more.

Key Features of NumPy:

Multi-dimensional Arrays: One of the primary features of NumPy is its support


for multi-dimensional arrays. These arrays, known as "NumPy arrays" or "ndarrays,"
are similar to Python lists but offer much more functionality and efficiency for
numerical computations. NumPy arrays can have one or more dimensions, making
them versatile for representing data in various forms, such as vectors, matrices, and
tensors. The ability to work with multi-dimensional data allows for faster and more
convenient mathematical operations and data manipulations.

Fast and Efficient Operations: NumPy is built on top of highly optimized C and
Fortran libraries, enabling it to perform array operations much faster than standard
Python lists. These operations are implemented as low-level routines, making them
highly efficient and suitable for handling large datasets. The ability to perform
element-wise operations and array broadcasting allows for concise and expressive

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 29 | P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

code that operates on entire arrays at once, reducing the need for explicit loops and
improving performance.

Mathematical and Statistical Functions: NumPy provides an extensive library of


mathematical and statistical functions, making it a powerful tool for numerical
computations. It includes standard arithmetic operations (addition, subtraction,
multiplication, division), trigonometric functions, exponential and logarithmic
functions, and more. Additionally, NumPy offers statistical functions for calculating
mean, median, standard deviation, variance, and other measures, making it valuable
for data analysis tasks.

Broadcasting: Broadcasting is a unique feature of NumPy that allows arrays with


different shapes to be used together in arithmetic operations. When operating on
arrays with different shapes, NumPy automatically "broadcasts" the smaller array to
match the shape of the larger array, enabling element-wise operations between
them. Broadcasting simplifies code and makes it more concise, as there is no need to
explicitly align the arrays' shapes.

Array Indexing and Slicing: NumPy provides flexible and powerful indexing and
slicing capabilities for accessing elements or subsets of an array. The indexing starts
from 0, similar to Python lists, and supports various slicing techniques, including
using slices, integer arrays, boolean arrays, and even fancy indexing. These features
make it easy to extract specific elements or subsets of data from large arrays,
enabling efficient data manipulations.

Universal Functions (ufuncs): NumPy's universal functions, or ufuncs, are fast


and vectorized functions that operate element-wise on arrays. These functions are
essential for performing element-wise mathematical operations and are significantly
faster than their Python counterparts. Ufuncs allow users to apply complex
mathematical operations efficiently to entire arrays without the need for explicit
loops, resulting in more concise and faster code.

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 30 | P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

3.2 Creating NumPy Arrays

NumPy provides a powerful array object called "ndarray" that enables us to work
with multi-dimensional data efficiently. In this section, we will explore different
methods to create NumPy arrays and understand their flexibility and usefulness in
numerical computing.

Creating Arrays from Python Lists: One of the simplest ways to create a NumPy
array is by converting a Python list into an ndarray using the `numpy.array()`
function.

import numpy as np

# Creating a NumPy array from a Python list


data_list = [1, 2, 3, 4, 5]
numpy_array = np.array(data_list)

print(numpy_array)

In this example, we import NumPy as `np` for brevity. We then create a Python
list called `data_list` containing elements 1, 2, 3, 4, and 5. Using the `np.array()`
function, we convert the Python list into a NumPy array named `numpy_array`.

Creating Arrays Using NumPy Functions: NumPy provides several functions to


create arrays with specific patterns or filled with constant values. One such function
is `numpy.zeros()`, which creates an array of zeros with a specified shape.

import numpy as np

# Creating an array of zeros with shape (3, 4)


zeros_array = np.zeros((3, 4))

print(zeros_array)

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 31 | P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

In this example, we import NumPy as `np` and use the `np.zeros()` function to
create an array of zeros with shape (3, 4).

Creating Arrays with Sequences: NumPy provides functions to create arrays with
sequences of numbers. One such function is `numpy.arange()`, which creates an
array with a range of values. Let's consider an example:

import numpy as np

# Creating an array with values from 0 to 9


sequence_array = np.arange(10)

print(sequence_array)

In this example, we import NumPy as `np` and use the `np.arange()` function to
create an array with values from 0 to 9.

Creating Arrays with Random Values: NumPy's `numpy.random` module allows


us to create arrays filled with random values. For example, we can use
`numpy.random.rand()` to create an array with random values from a uniform
distribution between 0 and 1.

import numpy as np

# Creating an array with random values from a uniform distribution


random_array = np.random.rand(3, 4)

print(random_array)

In this example, we import NumPy as `np` and use `np.random.rand()` to create a


3x4 array with random values from a uniform distribution between 0 and 1.

3.3 Array Indexing and Slicing

Array indexing and slicing are powerful features of NumPy that allow us to access
and manipulate specific elements or subsets of elements in a NumPy array. In this
Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 32 | P a g e
Python by Example (Book 2: Data Manipulation and Analysis)

section, we will explore how to perform array indexing and slicing, providing
examples to demonstrate their utility in data manipulation and analysis.

Array Indexing: Array indexing in NumPy is similar to indexing in Python lists,


where we access elements using their positions (indices).

import numpy as np

# Creating a NumPy array


data_array = np.array([10, 20, 30, 40, 50])

# Accessing the element at index 2


element_at_index_2 = data_array[2]

print(element_at_index_2)

In this example, we import NumPy as `np` and create a NumPy array called
`data_array` containing elements 10, 20, 30, 40, and 50. We then access the element
at index 2 using `data_array[2]`.

Array Slicing: Array slicing allows us to extract a subset of elements from a


NumPy array based on a specified range of indices. The syntax for slicing is
`array[start:stop:step]`, where `start` is the starting index (inclusive), `stop` is the
stopping index (exclusive), and `step` is the interval between elements.

import numpy as np

# Creating a NumPy array


data_array = np.array([10, 20, 30, 40, 50])

# Slicing the array from index 1 to 4


sliced_array = data_array[1:4]

print(sliced_array)

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 33 | P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

In this example, we import NumPy as `np` and create a NumPy array called
`data_array` with elements 10, 20, 30, 40, and 50. We then use slicing to extract a
subset of elements from index 1 to 4 (exclusive) using `data_array[1:4]`.

Array Slicing with Step: We can also use the `step` parameter in slicing to skip
elements and create subarrays with a specific interval.

import numpy as np

# Creating a NumPy array with values from 0 to 9


data_array = np.arange(10)

# Slicing the array with a step of 2


sliced_array = data_array[::2]

print(sliced_array)

In this example, we import NumPy as `np` and use `np.arange()` to create a


NumPy array with values from 0 to 9. We then use slicing with a step of 2
(`data_array[::2]`) to extract elements with an interval of 2.

Modifying Array Elements using Slicing: Slicing can also be used to modify
elements of a NumPy array.

import numpy as np

# Creating a NumPy array


data_array = np.array([10, 20, 30, 40, 50])

# Modifying elements using slicing


data_array[1:4] = [25, 35, 45]

print(data_array)

In this example, we import NumPy as `np` and create a NumPy array called
`data_array` with elements 10, 20, 30, 40, and 50. We use slicing (`data_array[1:4]`)

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 34 | P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

to access elements from index 1 to 4 (exclusive) and modify them with the values
[25, 35, 45].

3.4 Array Operations (Element-wise and Broadcasting)

NumPy offers powerful capabilities for performing element-wise operations and


broadcasting on arrays. Element-wise operations allow us to apply mathematical
operations to each element of an array independently, while broadcasting extends
the element-wise concept to arrays with different shapes, making operations
between them convenient and efficient. In this section, we will explore these
essential array operations with coding examples to illustrate their significance in
numerical computing.

Element-wise Operations: Element-wise operations involve applying


mathematical functions or operators to each element of an array independently.
NumPy allows us to perform element-wise operations with arithmetic operators (+, -
, *, /, etc.) and various mathematical functions (sqrt, sin, cos, exp, etc.).

import numpy as np

# Creating two NumPy arrays


array1 = np.array([1, 2, 3, 4, 5])
array2 = np.array([10, 20, 30, 40, 50])

# Element-wise addition
result_addition = array1 + array2

print(result_addition)

In this example, we import NumPy as `np` and create two NumPy arrays, `array1`
and `array2`, with values [1, 2, 3, 4, 5] and [10, 20, 30, 40, 50] respectively. We
perform element-wise addition using the `+` operator (`array1 + array2`) and store
the result in `result_addition`.

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 35 | P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

Broadcasting: Broadcasting is a powerful feature that allows NumPy to perform


element-wise operations on arrays with different shapes. It automatically aligns the
arrays' shapes to make element-wise operations possible, eliminating the need for
explicit loop operations.

import numpy as np

# Creating a NumPy array


array = np.array([1, 2, 3, 4, 5])

# Element-wise multiplication with a scalar


result_broadcasting = array * 10

print(result_broadcasting)

In this example, we import NumPy as `np` and create a NumPy array called
`array` with values [1, 2, 3, 4, 5]. We perform element-wise multiplication with a
scalar value (10) using the `*` operator (`array * 10`). NumPy automatically
broadcasts the scalar to match the shape of the array, and the result is stored in
`result_broadcasting`.

Element-wise Functions: NumPy allows us to apply various mathematical


functions element-wise to arrays. These functions can be used to perform complex
operations efficiently on arrays without the need for explicit loops.

import numpy as np

# Creating a NumPy array


array = np.array([1, 2, 3, 4, 5])

# Element-wise square root


result_sqrt = np.sqrt(array)

print(result_sqrt)

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 36 | P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

In this example, we import NumPy as `np` and create a NumPy array called
`array` with values [1, 2, 3, 4, 5]. We use the `np.sqrt()` function to perform element-
wise square root on the array and store the result in `result_sqrt`.

Combining Broadcasting with Element-wise Operations: Broadcasting and


element-wise operations can be combined to perform operations between arrays
with different shapes efficiently.

import numpy as np

# Creating two NumPy arrays


array1 = np.array([1, 2, 3])
array2 = np.array([[10], [20], [30]])

# Element-wise multiplication with broadcasting


result_broadcasting = array1 * array2

print(result_broadcasting)

In this example, we import NumPy as `np` and create two NumPy arrays, `array1`
and `array2`, with values [1, 2, 3] and [[10], [20], [30]] respectively. We perform
element-wise multiplication with broadcasting (`array1 * array2`). NumPy broadcasts
the arrays to match their shapes and then performs the element-wise multiplication.

Chapter 4: Data Analysis with Pandas


In the world of data manipulation and analysis, having a tool that seamlessly
handles the intricacies of data sets is essential. This is where Pandas, the Python
Data Analysis Library, steps in as a powerful ally. Whether you're a data scientist,
analyst, or enthusiast, Pandas equips you with the tools to effortlessly clean,
transform, and gain insights from data.

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 37 | P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

4.1 Getting Started with Pandas Series and DataFrames

Pandas, a cornerstone of data analysis in Python, provides two fundamental data


structures: Series and DataFrames. These structures form the building blocks for
managing and analyzing data efficiently.

Pandas Series: A Pandas Series is a one-dimensional array-like object that can


hold various data types, including numbers, strings, and more. Each element in a
Series has a corresponding label, known as an index. This index facilitates easy data
retrieval and manipulation. Let's consider an example:

import pandas as pd

# Creating a Pandas Series


fruits = pd.Series(['apple', 'banana', 'cherry', 'date'])

print(fruits)

In this example, we import Pandas as `pd` and create a Series called `fruits` with
four elements. The output will display the Series along with its index.

Pandas DataFrames: A Pandas DataFrame is a two-dimensional table-like


structure consisting of rows and columns. It's a versatile data structure that can
handle heterogeneous data types, akin to a spreadsheet or SQL table. Let's explore
how to create a DataFrame:

import pandas as pd

# Creating a DataFrame from a dictionary


data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 28, 22]}

df = pd.DataFrame(data)

print(df)

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 38 | P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

In this example, we import Pandas as `pd` and create a DataFrame `df` using a
dictionary `data`. Each key in the dictionary corresponds to a column name, and its
associated values form the column's data.

Accessing Data in Series and DataFrames: Both Series and DataFrames support
indexing and slicing for data retrieval. For Series, indexing is based on the provided
labels, while for DataFrames, it extends to both rows and columns. Let's see an
example:

import pandas as pd

# Creating a Series and DataFrame


fruits = pd.Series(['apple', 'banana', 'cherry', 'date'])
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 28, 22]}
df = pd.DataFrame(data)

# Accessing elements in a Series


print(fruits[1]) # Output: 'banana'

# Accessing columns in a DataFrame


print(df['Name'])

Here, we create a Series `fruits` and a DataFrame `df`. We showcase element


access for Series (`fruits[1]`) and column access for DataFrames (`df['Name']`).

4.2 Data Indexing and Selection in Pandas

Efficiently accessing and selecting specific data within a Pandas DataFrame is a


critical skill for effective data analysis. In this section, we will explore various
techniques for indexing and selecting data using Pandas, enabling you to extract the
information you need from your datasets.

Indexing with Labels: Pandas provides the `loc` indexer to access data by labels,
both for rows and columns. Let's consider an example:

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 39 | P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

import pandas as pd

# Creating a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 28, 22]}
df = pd.DataFrame(data)

# Using loc to access data by labels


print(df.loc[1, 'Name']) # Output: 'Bob'

Here, we create a DataFrame `df` and use the `loc` indexer to access the value in
the second row and the 'Name' column.

Indexing with Position: Pandas also provides the `iloc` indexer for accessing data
by integer position. This is particularly useful when dealing with numeric indexing.
Let's see an example:

import pandas as pd

# Creating a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 28, 22]}
df = pd.DataFrame(data)

# Using iloc to access data by position


print(df.iloc[2, 1]) # Output: 28

In this example, we use the `iloc` indexer to access the value in the third row and
the second column.

Selecting Columns: You can easily select specific columns from a DataFrame by
providing their names in a list. Let's consider an example:

import pandas as pd

# Creating a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 28, 22]}

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 40 | P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

df = pd.DataFrame(data)

# Selecting specific columns


selected_columns = df[['Name', 'Age']]

print(selected_columns)

Here, we create a DataFrame `df` and select only the 'Name' and 'Age' columns
using double square brackets.

Conditional Selection: You can also use boolean conditions to filter data within a
DataFrame. This is particularly useful for extracting rows that meet specific criteria.
Let's see an example:

import pandas as pd

# Creating a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 28, 22]}
df = pd.DataFrame(data)

# Conditional selection
young_people = df[df['Age'] < 30]

print(young_people)

In this example, we create a DataFrame `df` and use a boolean condition to


select only the rows where the 'Age' is less than 30.

4.3 Data Cleaning and Handling Missing Values

In the realm of data analysis, real-world datasets often come with imperfections,
such as missing or inconsistent data. Pandas equips you with powerful tools to clean
and handle these issues, ensuring that your data is accurate and ready for analysis.

Detecting Missing Values: Pandas provides the `isna()` and `isnull()` methods to
detect missing values within a DataFrame. Let's consider an example:
Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 41 | P a g e
Python by Example (Book 2: Data Manipulation and Analysis)

import pandas as pd

# Creating a DataFrame with missing values


data = {'Name': ['Alice', 'Bob', None, 'David'],
'Age': [25, None, 28, 22]}
df = pd.DataFrame(data)

# Detecting missing values


missing_values = df.isna()

print(missing_values)

Here, we create a DataFrame `df` with missing values and use the `isna()` method
to create a boolean DataFrame that indicates the presence of missing values.

Handling Missing Values: Pandas provides several methods for handling missing
values. The `dropna()` method allows you to remove rows or columns with missing
values. The `fillna()` method lets you replace missing values with specified values or
strategies. Let's explore an example:

import pandas as pd

# Creating a DataFrame with missing values


data = {'Name': ['Alice', 'Bob', None, 'David'],
'Age': [25, None, 28, 22]}
df = pd.DataFrame(data)

# Dropping rows with missing values


cleaned_df = df.dropna()

print(cleaned_df)

In this example, we use the `dropna()` method to create a new DataFrame


`cleaned_df` by removing rows with missing values.

Filling Missing Values: You can use the `fillna()` method to replace missing values
with specified values or strategies. Let's see an example:

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 42 | P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

import pandas as pd

# Creating a DataFrame with missing values


data = {'Name': ['Alice', 'Bob', None, 'David'],
'Age': [25, None, 28, 22]}
df = pd.DataFrame(data)

# Filling missing values with a specified value


filled_df = df.fillna('Unknown')

print(filled_df)

Here, we use the `fillna()` method to replace missing values with the string
'Unknown'.

Handling Missing Values with Strategies: You can also use strategies like mean,
median, or mode to fill missing values based on the distribution of the data. Let's
consider an example:

import pandas as pd

# Creating a DataFrame with missing values


data = {'Name': ['Alice', 'Bob', None, 'David'],
'Age': [25, None, 28, 22]}
df = pd.DataFrame(data)

# Filling missing values with the mean of the 'Age' column


mean_age = df['Age'].mean()
df['Age'] = df['Age'].fillna(mean_age)

print(df)

In this example, we compute the mean of the 'Age' column using `.mean()` and
then fill the missing values with this mean using `.fillna()`.

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 43 | P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

4.4 Data Aggregation and Grouping

In the process of data analysis, it's often essential to aggregate and summarize
data to gain insights and draw meaningful conclusions. Pandas provides powerful
tools for data aggregation and grouping, allowing you to efficiently analyze and
manipulate data based on specific criteria.

Grouping Data: Pandas allows you to group data based on one or more columns
using the `groupby()` function. This function creates a grouped object that can be
used for aggregation. Let's consider an example:

import pandas as pd

# Creating a DataFrame
data = {'Category': ['A', 'B', 'A', 'B', 'A', 'B'],
'Value': [10, 20, 15, 25, 12, 18]}
df = pd.DataFrame(data)

# Grouping data by 'Category'


grouped = df.groupby('Category')

print(grouped)

Here, we create a DataFrame `df` and group the data based on the 'Category'
column using the `groupby()` function. The result is a grouped object that can be
used for further aggregation.

Aggregating Data: Once you have a grouped object, you can apply various
aggregation functions to compute summary statistics for each group. Common
aggregation functions include `sum()`, `mean()`, `max()`, `min()`, and more. Let's
explore an example:

import pandas as pd

# Creating a DataFrame
data = {'Category': ['A', 'B', 'A', 'B', 'A', 'B'],
Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 44 | P a g e
Python by Example (Book 2: Data Manipulation and Analysis)

'Value': [10, 20, 15, 25, 12, 18]}


df = pd.DataFrame(data)

# Grouping data by 'Category' and computing the mean


grouped = df.groupby('Category')
mean_values = grouped['Value'].mean()

print(mean_values)

In this example, we group the data by 'Category' and compute the mean value of
the 'Value' column for each group using `.mean()`.

Aggregating with Multiple Functions: You can apply multiple aggregation


functions simultaneously using the `agg()` function. This allows you to compute
various summary statistics in one step. Let's consider an example:

import pandas as pd

# Creating a DataFrame
data = {'Category': ['A', 'B', 'A', 'B', 'A', 'B'],
'Value': [10, 20, 15, 25, 12, 18]}
df = pd.DataFrame(data)

# Grouping data by 'Category' and applying multiple aggregations


grouped = df.groupby('Category')
summary = grouped['Value'].agg(['sum', 'mean', 'max'])

print(summary)

Here, we group the data by 'Category' and apply multiple aggregation functions
(`sum`, `mean`, `max`) to the 'Value' column using `.agg()`.

Custom Aggregation: You can also define custom aggregation functions using the
`agg()` function. This allows you to perform more complex calculations based on
specific requirements. Let's explore an example:

import pandas as pd

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 45 | P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

# Creating a DataFrame
data = {'Category': ['A', 'B', 'A', 'B', 'A', 'B'],
'Value': [10, 20, 15, 25, 12, 18]}
df = pd.DataFrame(data)

# Custom aggregation function


def custom_agg(arr):
return arr.sum() - arr.mean()

# Grouping data by 'Category' and applying custom aggregation


grouped = df.groupby('Category')
custom_summary = grouped['Value'].agg(custom_agg)

print(custom_summary)

In this example, we define a custom aggregation function `custom_agg()` that


calculates the sum minus the mean of an array. We then apply this custom
aggregation function to the 'Value' column within each group.

Chapter 5: Data Visualization with Matplotlib


Data visualization is a powerful tool that allows us to uncover patterns, trends,
and insights hidden within the data, making it easier to communicate complex
information and facilitate data-driven decision-making.

Matplotlib, one of the most popular and versatile visualization libraries in


Python, provides a wide range of tools for creating various types of plots, charts, and
graphs. Whether you're aiming to create simple line plots, intricate scatter plots,
informative bar charts, or detailed histograms, Matplotlib's extensive capabilities
have got you covered.

5.1 Introduction to Data Visualization

Data visualization is a powerful and essential tool in the field of data analysis and
interpretation. It involves the representation of data through visual elements such as

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 46 | P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

charts, graphs, and plots, with the primary goal of communicating complex
information in a more accessible and understandable format. Visualization goes
beyond mere aesthetics; it provides a means to uncover patterns, trends, and
insights that might otherwise remain hidden within raw data.

The importance of data visualization cannot be overstated, as it plays a crucial


role in conveying findings, supporting decision-making, and enhancing
understanding across various domains, including science, business, and academia. By
transforming data into visual representations, we can effectively present our
discoveries and narratives, making it easier for stakeholders, colleagues, and the
general audience to grasp the significance of the data.

Key Benefits of Data Visualization:

1. Clarity and Understanding: Visualizations simplify complex data by


converting it into intuitive visual forms. This clarity helps users understand
the underlying information quickly and make informed conclusions.
2. Pattern Recognition: Visualizations highlight patterns, trends, correlations,
and anomalies that might not be apparent in tabular or textual data. This
enables analysts to make data-driven decisions with greater accuracy.
3. Communication: Visual representations transcend language barriers and are
more engaging than lengthy textual explanations. They enable efficient and
effective communication of insights to a diverse audience.
4. Storytelling: Visualizations allow analysts to weave narratives around data,
creating a compelling and coherent story. This aids in presenting findings,
addressing questions, and guiding audiences through the data's narrative arc.
5. Exploration: Interactive visualizations enable users to explore data sets
dynamically, uncovering details and relationships on-demand. This promotes
a deeper understanding of the data and encourages discovery.

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 47 | P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

6. Hypothesis Testing: Visualizations assist in formulating and testing


hypotheses by visualizing data distributions and relationships, aiding in the
validation or rejection of assumptions.

Common Types of Data Visualizations:

1. Line Charts: Used to display trends over time or a sequence of data points,
line charts are effective for showing continuous data patterns.
2. Bar Charts: These charts are suitable for comparing discrete categories or
data points, making them ideal for showcasing differences or trends.
3. Scatter Plots: Scatter plots depict the relationship between two variables,
helping to identify correlations, clusters, and outliers.
4. Pie Charts: Useful for illustrating parts of a whole, pie charts provide a visual
representation of proportions and percentages.
5. Histograms: Histograms visualize the distribution of continuous data by
grouping it into bins, allowing the analysis of frequency patterns.
6. Heatmaps: Heatmaps represent data values using color intensity, making
them effective for visualizing large datasets and correlations.

5.2 Creating Basic Plots with Matplotlib

In this section, we delve into the practical realm of data visualization using
Matplotlib. We explore the creation of fundamental plot types, equipping you with
the skills to convey data insights effectively. Through concise examples and hands-on
experience, we'll uncover how to construct essential visualizations that lay the
foundation for more advanced techniques.

Line Plot: A line plot is a fundamental visualization type used to represent data
points with connected lines. It is suitable for illustrating trends over time or a
sequence of data points. Let's create a simple line plot:

import matplotlib.pyplot as plt


Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 48 | P a g e
Python by Example (Book 2: Data Manipulation and Analysis)

# Sample data
x = [1, 2, 3, 4, 5]
y = [10, 25, 15, 30, 20]

# Creating a line plot


plt.plot(x, y)

# Adding labels and title


plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Simple Line Plot')

# Display the plot


plt.show()

In this example, we import Matplotlib as `plt` and create two lists, `x` and `y`,
representing data points. We use `plt.plot()` to create the line plot and `plt.xlabel()`,
`plt.ylabel()`, and `plt.title()` to add labels and a title. Finally, `plt.show()` displays the
plot.

Scatter Plot: A scatter plot is used to visualize the relationship between two
numerical variables. Each data point is represented as a dot, and patterns like
correlation or clustering become apparent. Let's create a scatter plot:

import matplotlib.pyplot as plt

# Sample data
x = [1, 2, 3, 4, 5]
y = [10, 25, 15, 30, 20]

# Creating a scatter plot


plt.scatter(x, y, color='red', marker='o')

# Adding labels and title


plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Scatter Plot')

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 49 | P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

# Display the plot


plt.show()

In this example, we import Matplotlib as `plt`, create `x` and `y` lists, and use
`plt.scatter()` to generate the scatter plot. The parameters `color` and `marker`
customize the appearance of the dots. Labels and a title are added using
`plt.xlabel()`, `plt.ylabel()`, and `plt.title()`, followed by `plt.show()` to display the
plot.

Bar Chart: A bar chart is effective for comparing categorical data or discrete
values. It uses rectangular bars to represent data points, making it easy to compare
quantities across categories. Let's create a bar chart:

import matplotlib.pyplot as plt

# Sample data
categories = ['A', 'B', 'C', 'D', 'E']
values = [10, 25, 15, 30, 20]

# Creating a bar chart


plt.bar(categories, values, color='blue')

# Adding labels and title


plt.xlabel('Categories')
plt.ylabel('Values')
plt.title('Bar Chart')

# Display the plot


plt.show()

In this example, we import Matplotlib as `plt`, define `categories` and `values`


lists, and use `plt.bar()` to create the bar chart. The `color` parameter specifies the
color of the bars. Labels and a title are added, and the plot is displayed using
`plt.show()`.

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 50 | P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

Histogram: A histogram is used to visualize the distribution of a dataset by


grouping data into bins and representing their frequencies. It provides insights into
the data's underlying structure. Let's create a histogram:

import matplotlib.pyplot as plt

# Sample data
data = [10, 25, 15, 30, 20, 40, 50, 35, 10, 25]

# Creating a histogram
plt.hist(data, bins=5, color='green', edgecolor='black')

# Adding labels and title


plt.xlabel('Values')
plt.ylabel('Frequency')
plt.title('Histogram')

# Display the plot


plt.show()

In this example, we import Matplotlib as `plt`, define a `data` list, and use
`plt.hist()` to create the histogram. The `bins` parameter specifies the number of
bins, and `color` and `edgecolor` customize the appearance. Labels and a title are
added, and the plot is displayed using `plt.show()`.

5.3 Customizing Plots: Labels, Titles, Colors, and Styles

Effective data visualization involves not only conveying information accurately


but also making visualizations engaging and informative. Matplotlib, a versatile
Python library, offers a wide range of customization options to enhance the
appearance and readability of plots. In this section, we will explore how to customize
various aspects of plots, such as labels, titles, colors, and styles, using coding
examples that highlight each customization's impact on the visualization.

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 51 | P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

Adding Labels and Titles: Clear and descriptive labels and titles provide context
and guide the audience's understanding of a plot. Let's see how to add labels and
titles to a scatter plot:

import matplotlib.pyplot as plt

# Sample data
x = [1, 2, 3, 4, 5]
y = [10, 25, 15, 30, 20]

# Creating a scatter plot


plt.scatter(x, y, color='blue', marker='o')

# Adding labels and a title


plt.xlabel('X-axis (Time)')
plt.ylabel('Y-axis (Value)')
plt.title('Scatter Plot: Value vs. Time')

# Display the plot


plt.show()

In this example, we utilize `plt.xlabel()` and `plt.ylabel()` to add labels to the x and
y axes, respectively. The `plt.title()` function adds a title to the plot, enhancing its
context and clarity.

Customizing Colors and Styles: Matplotlib allows you to choose colors and styles
that align with your visualization's purpose and aesthetic. Let's customize the style
and color of a line plot:

import matplotlib.pyplot as plt

# Sample data
x = [1, 2, 3, 4, 5]
y = [10, 25, 15, 30, 20]

# Creating a line plot with customized style and color


plt.plot(x, y, color='green', linestyle='--',
marker='s', label='Data Points')

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 52 | P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

# Adding labels and a title


plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Customized Line Plot')

# Adding a legend
plt.legend()

# Display the plot


plt.show()

Here, we use the `color`, `linestyle`, and `marker` parameters in `plt.plot()` to


customize the appearance of the line plot. The chosen style and color enhance the
plot's visual appeal.

Color Maps and Colorbars: Color maps are crucial for visualizing data with color
intensity. They are particularly useful for heatmaps and contour plots. Let's use a
color map and colorbar with a heatmap:

import matplotlib.pyplot as plt


import numpy as np

# Sample data
data = np.random.rand(5, 5)

# Creating a heatmap with color map and colorbar


plt.imshow(data, cmap='viridis')

# Adding a colorbar
plt.colorbar()

# Display the plot


plt.show()

In this example, we use `plt.imshow()` with the `cmap` parameter to apply the
'viridis' color map to the heatmap. The `plt.colorbar()` function adds a colorbar to
indicate the color mapping.
Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 53 | P a g e
Python by Example (Book 2: Data Manipulation and Analysis)

Styling Text and Annotations: Annotations and text enhance plot clarity by
providing additional context. Let's add annotations and text to a bar chart:

import matplotlib.pyplot as plt

# Sample data
categories = ['A', 'B', 'C', 'D', 'E']
values = [10, 25, 15, 30, 20]

# Creating a bar chart with annotations and text


plt.bar(categories, values, color='purple', label='Data Bars')

# Adding annotations
for i, v in enumerate(values):
plt.text(i, v + 1, str(v), color='black', ha='center')

# Adding labels and a title


plt.xlabel('Categories')
plt.ylabel('Values')
plt.title('Bar Chart with Annotations')

# Display the plot


plt.show()

Using `plt.text()`, we add annotations above each bar to display the


corresponding value. This annotation enhances the audience's understanding of the
data distribution.

5.4 Plotting Data from NumPy Arrays and Pandas DataFrames

Data visualization often involves working with data stored in NumPy arrays and
Pandas DataFrames, which are powerful data structures commonly used in Python
for data manipulation and analysis. Matplotlib, a versatile plotting library, seamlessly
integrates with these structures to create insightful visualizations.

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 54 | P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

Plotting from NumPy Arrays: NumPy arrays provide a foundation for numerical
computing, and Matplotlib can visualize this data effectively. Let's create a simple
line plot from a NumPy array:

import numpy as np
import matplotlib.pyplot as plt

# Generating data using NumPy


x = np.linspace(0, 10, 100)
y = np.sin(x)

# Creating a line plot from NumPy arrays


plt.plot(x, y, label='Sine Curve')

# Adding labels and a title


plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Plotting from NumPy Array')

# Adding a legend
plt.legend()

# Display the plot


plt.show()

Here, we generate a NumPy array `x` with values evenly spaced between 0 and
10. The `np.sin()` function calculates the sine of each value in `x`, creating a
sinusoidal curve. We then use `plt.plot()` to create a line plot from the NumPy
arrays.

Plotting from Pandas DataFrames: Pandas DataFrames offer powerful data


manipulation capabilities, and Matplotlib complements this by facilitating data
visualization. Let's create a bar plot from a Pandas DataFrame:

import pandas as pd
import matplotlib.pyplot as plt

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 55 | P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

# Creating a sample Pandas DataFrame


data = {'Category': ['A', 'B', 'C', 'D', 'E'],
'Value': [10, 25, 15, 30, 20]}

df = pd.DataFrame(data)

# Creating a bar plot from Pandas DataFrame


plt.bar(df['Category'], df['Value'], color='orange',
label='Data Bars')

# Adding labels and a title


plt.xlabel('Categories')
plt.ylabel('Values')
plt.title('Plotting from Pandas DataFrame')

# Adding a legend
plt.legend()

# Display the plot


plt.show()

We construct a Pandas DataFrame `df` with categories and corresponding values.


Using `plt.bar()`, we create a bar plot from the DataFrame's columns. This example
demonstrates how Matplotlib seamlessly integrates with Pandas DataFrames for
visualization.

Combining Plotting with NumPy and Pandas: Matplotlib can visualize data
derived from both NumPy arrays and Pandas DataFrames within the same plot. Let's
illustrate this by overlaying a line plot and scatter plot:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Generating data using NumPy


x = np.linspace(0, 10, 100)
y = np.sin(x)

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 56 | P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

# Creating a Pandas DataFrame


data = {'X': x, 'Y': y}
df = pd.DataFrame(data)

# Creating a line plot and scatter plot in the same plot


plt.plot(df['X'], df['Y'], label='Sine Curve')
plt.scatter(df['X'][::10], df['Y'][::10], color='red',
label='Sample Points')

# Adding labels and a title


plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Combining Plotting with NumPy and Pandas')

# Adding a legend
plt.legend()

# Display the plot


plt.show()

In this example, we generate a Pandas DataFrame `df` with columns 'X' and 'Y'
containing the NumPy-generated values. The `plt.plot()` function creates a line plot,
and `plt.scatter()` overlays selected data points as red dots.

Chapter 6: Advanced Data Manipulation Techniques


This chapter introduces advanced techniques that empower you to wield data
with even greater precision and flexibility. Building upon the fundamental concepts
covered earlier, we'll explore intricate strategies for reshaping and combining data,
paving the way for intricate analyses and comprehensive insights. From merging and
pivoting to advanced concatenation methods, this chapter equips you with the tools
to navigate complex data structures and orchestrate them harmoniously for more
sophisticated data handling.

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 57 | P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

6.1 Data Merging and Joining in Pandas

One frequently encounters the need to combine datasets, which can originate
from various sources or possess related information. Pandas provides powerful tools
for merging and joining data, allowing data professionals to seamlessly integrate
disparate datasets and unlock deeper insights. In this section, we'll explore the
techniques of data merging and joining using Pandas.

Concatenating DataFrames: Concatenation is the process of stacking or


combining DataFrames along a specified axis. This technique proves valuable when
dealing with data partitioned into separate but related pieces. Consider this
example:

import pandas as pd

# Creating sample DataFrames


df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2']})

df2 = pd.DataFrame({'A': ['A3', 'A4', 'A5'],


'B': ['B3', 'B4', 'B5']})

# Concatenating along rows (axis=0)


result = pd.concat([df1, df2])

print(result)

In this case, two DataFrames, `df1` and `df2`, are concatenated along the rows
using `pd.concat()`. The resulting DataFrame, `result`, contains all rows from both
input DataFrames.

Merging DataFrames: Merging involves combining DataFrames based on


common columns. Pandas offers various types of joins, such as inner, outer, left, and
right joins. Let's explore an inner join:

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 58 | P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

import pandas as pd

# Creating sample DataFrames


left = pd.DataFrame({'key': ['K0', 'K1', 'K2'],
'value_left': ['V0', 'V1', 'V2']})

right = pd.DataFrame({'key': ['K1', 'K2', 'K3'],


'value_right': ['V3', 'V4', 'V5']})

# Performing an inner merge


merged_df = pd.merge(left, right, on='key')

print(merged_df)

Here, the `pd.merge()` function performs an inner join on the 'key' column of the
`left` and `right` DataFrames, producing a merged DataFrame with only the matching
rows.

Joining DataFrames on Index: In addition to merging on columns, Pandas


enables joining on the index. This is particularly useful when the indices themselves
convey meaningful information. Let's explore this with an example:

import pandas as pd

# Creating sample DataFrames


left = pd.DataFrame({'A': ['A0', 'A1', 'A2']},
index=['K0', 'K1', 'K2'])

right = pd.DataFrame({'B': ['B0', 'B1', 'B2']},


index=['K1', 'K2', 'K3'])

# Joining on index
joined_df = left.join(right)

print(joined_df)

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 59 | P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

The `left.join(right)` operation joins the DataFrames based on their indices. Non-
matching indices result in NaN values, providing a consolidated view of data from
both DataFrames.

6.2 Reshaping Data: Pivoting, Melting, and Stack/Unstack

Data rarely conforms to a single structure, and effective data manipulation often
requires reshaping to facilitate analysis. Pandas provides powerful tools for
reshaping data, enabling data professionals to transform data between wide and
long formats seamlessly. In this section, we'll explore key reshaping techniques,
including pivoting, melting, and using `stack` and `unstack` methods.

Pivoting DataFrames: Pivoting involves transforming data from a long format to


a wide format, making it easier to analyze. Consider the following example:

import pandas as pd

# Creating a sample DataFrame


data = {'Date': ['2021-01-01', '2021-01-01', '2021-01-02'],
'Variable': ['A', 'B', 'A'],
'Value': [10, 20, 15]}

df = pd.DataFrame(data)

# Pivoting the DataFrame


pivot_df = df.pivot(index='Date',
columns='Variable', values='Value')

print(pivot_df)

In this case, the `pivot()` method transforms the DataFrame `df` by using 'Date'
as the index, 'Variable' as the columns, and 'Value' as the values. This operation
creates a pivoted DataFrame, `pivot_df`, which provides a clearer view of the data.

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 60 | P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

Melting DataFrames: Melting is the reverse of pivoting, converting wide-format


data back into a long format. This can be especially useful for data analysis and
visualization. Let's explore melting with an example:

import pandas as pd

# Creating a sample DataFrame


data = {'Date': ['2021-01-01', '2021-01-01', '2021-01-02'],
'A': [10, 20, 15],
'B': [25, 30, 35]}

df = pd.DataFrame(data)

# Melting the DataFrame


melted_df = df.melt(id_vars='Date', var_name='Variable',
value_name='Value')

print(melted_df)

The `melt()` function converts the wide-format DataFrame `df` into a long-format
DataFrame, `melted_df`, where 'Date' is the identifier variable, 'Variable' represents
the original column names, and 'Value' contains the corresponding values.

Stack and Unstack: The `stack()` and `unstack()` methods provide a dynamic way
to reshape data by moving levels of the DataFrame's column index to become the
row index or vice versa. Let's explore this concept:

import pandas as pd

# Creating a sample DataFrame


data = {'Date': ['2021-01-01', '2021-01-02'],
'A': [10, 20],
'B': [25, 35]}

df = pd.DataFrame(data)

# Setting 'Date' as the index


indexed_df = df.set_index('Date')

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 61 | P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

# Stacking and unstacking


stacked_df = indexed_df.stack()
unstacked_df = stacked_df.unstack()

print(unstacked_df)

In this example, `stack()` and `unstack()` are used to reshape the DataFrame.
Initially, 'Date' is set as the index using `set_index()`. Then, `stack()` converts
columns into rows, and `unstack()` reverses the process, restoring the original
DataFrame structure.

6.3 Combining DataFrames with Concatenation and Appending

As datasets grow and evolve, the need to combine multiple DataFrames into a
cohesive structure becomes paramount. Pandas offers powerful methods for
combining DataFrames, allowing data professionals to seamlessly merge data from
various sources. In this section, we'll explore the techniques of concatenation and
appending, demonstrating how to merge DataFrames both vertically and
horizontally.

Concatenating DataFrames Vertically: Concatenation involves stacking


DataFrames along a common axis, and is particularly useful when dealing with
similar data split across multiple sources. Consider this example:

import pandas as pd

# Creating sample DataFrames


df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2']})

df2 = pd.DataFrame({'A': ['A3', 'A4', 'A5'],


'B': ['B3', 'B4', 'B5']})

# Concatenating DataFrames vertically


concatenated_df = pd.concat([df1, df2])

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 62 | P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

print(concatenated_df)

Here, the `pd.concat()` function is used to concatenate `df1` and `df2` vertically.
The resulting DataFrame, `concatenated_df`, combines the rows from both input
DataFrames.

Concatenating DataFrames Horizontally: Concatenation can also be performed


along columns, allowing for the aggregation of related information from different
DataFrames. Let's illustrate this with an example:

import pandas as pd

# Creating sample DataFrames


df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2']})

df2 = pd.DataFrame({'C': ['C0', 'C1', 'C2'],


'D': ['D0', 'D1', 'D2']})

# Concatenating DataFrames horizontally


concatenated_df = pd.concat([df1, df2], axis=1)

print(concatenated_df)

In this case, `pd.concat()` with `axis=1` concatenates `df1` and `df2` horizontally,
merging columns from both DataFrames. The resulting DataFrame,
`concatenated_df`, presents the combined information side by side.

Appending DataFrames: Appending DataFrames is similar to concatenation, with


the difference that appending occurs only along the specified axis, effectively
stacking one DataFrame on top of the other. Let's see an example:

import pandas as pd

# Creating sample DataFrames


df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 63 | P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

'B': ['B0', 'B1', 'B2']})

df2 = pd.DataFrame({'A': ['A3', 'A4', 'A5'],


'B': ['B3', 'B4', 'B5']})

# Appending DataFrame df2 to df1


appended_df = df1.append(df2)

print(appended_df)

The `append()` method stacks `df2` on top of `df1`, resulting in `appended_df`.


The two DataFrames share the same columns, and the appended DataFrame
consolidates their rows.

Chapter 7: Working with Time Series Data


Time series data, characterized by its sequential nature and timestamped
observations, holds a pivotal role across a wide spectrum of industries and
disciplines. Whether you're delving into financial markets, studying climate patterns,
or analyzing user behaviors, the ability to effectively handle and derive insights from
time-ordered data is of paramount importance. Throughout this chapter, we'll
explore the intricacies of time series data manipulation, visualization, and analysis
using Python and its powerful libraries.

7.1 Handling Time and Date Data in Python

Time and date data are fundamental elements in many real-world datasets,
providing context and structure to observations. Python offers robust libraries for
handling and manipulating time-related information, enabling data professionals to
effectively manage temporal data. In this section, we'll cover everything from
creating and formatting dates to performing arithmetic operations and handling
time zones.

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 64 | P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

Working with `datetime` Module: The `datetime` module in Python provides


classes and functions for working with dates and times. Let's start by creating and
formatting dates:

import datetime

# Creating a date object


today = datetime.date.today()
print(today) # Output: YYYY-MM-DD

# Formatting a date
formatted_date = today.strftime('%d-%m-%Y')
print(formatted_date) # Output: DD-MM-YYYY

In this example, we create a `date` object using `datetime.date.today()` and then


format it using `strftime()` to achieve the desired presentation.

Performing Date Arithmetic: Date arithmetic allows us to perform operations


like addition and subtraction on dates. Let's see how to calculate the difference
between two dates:

import datetime

# Creating date objects


date1 = datetime.date(2023, 7, 1)
date2 = datetime.date(2023, 7, 15)

# Calculating the difference between dates


date_difference = date2 - date1
print(date_difference.days) # Output: 14

Here, we calculate the difference between `date2` and `date1`, which yields a
`timedelta` object. By accessing the `days` attribute, we obtain the difference in
days.

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 65 | P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

Working with `pandas` Timestamps: The `pandas` library extends time handling
capabilities with its `Timestamp` object, enhancing time series data manipulation.
Let's explore creating and indexing `Timestamps`:

import pandas as pd

# Creating a Timestamp
timestamp = pd.Timestamp('2023-07-01 09:00:00')
print(timestamp) # Output: 2023-07-01 09:00:00

# Indexing with Timestamps


data = {'values': [10, 20, 15]}
df = pd.DataFrame(data, index=[timestamp,
timestamp + pd.Timedelta(days=1)])
print(df)

In this example, we create a `Timestamp` and then use it as an index for a


`pandas` DataFrame. The `pd.Timedelta` function allows us to manipulate time
spans.

Handling Time Zones: Time zones are crucial when dealing with global data.
`pandas` simplifies time zone handling, making it easier to work with diverse
temporal datasets:

import pandas as pd

# Creating Timestamps with time zones


timestamp_utc = pd.Timestamp('2023-07-01 12:00:00', tz='UTC')
timestamp_est = timestamp_utc.tz_convert('US/Eastern')

print(timestamp_utc) # Output: 2023-07-01 12:00:00+00:00 (UTC)


print(timestamp_est)
# Output: 2023-07-01 08:00:00-04:00 (Eastern Time)

Here, we create a `Timestamp` in UTC, then convert it to Eastern Time using the
`tz_convert()` function.

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 66 | P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

7.2 Time Series Indexing and Slicing with Pandas

Time series data, characterized by its sequential nature, requires specialized


indexing and slicing techniques for effective analysis. Pandas, a versatile data
manipulation library, offers powerful tools for working with time-based data. In this
section, we'll explore how to index and slice time series data using Pandas, enabling
you to extract and manipulate temporal observations with precision and ease.

Creating a Time Series DataFrame: To begin, let's create a time series DataFrame
using Pandas. We'll generate a sample dataset with timestamped data points:

import pandas as pd
import numpy as np

# Creating a time range


time_range = pd.date_range(start='2023-01-01', periods=10, freq='D')

# Creating a DataFrame with random data


data = {'values': np.random.randint(1, 100, size=10)}
time_series_df = pd.DataFrame(data, index=time_range)

print(time_series_df)

Here, we create a time range using `pd.date_range()` and use it as an index for a
DataFrame containing random data. This establishes a time series dataset for
exploration.

Indexing by Date and Time: Pandas allows indexing using specific dates or date
ranges. Let's demonstrate this by indexing data for a particular date:

# Indexing by specific date


specific_date = '2023-01-05'
print(time_series_df.loc[specific_date])

Using `.loc[]`, we can access data for a specific date, extracting the corresponding
row from the DataFrame.
Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 67 | P a g e
Python by Example (Book 2: Data Manipulation and Analysis)

Slicing Time Series Data: Slicing empowers us to extract specific time periods
from a time series. Let's slice the data for a range of dates:

# Slicing a date range


date_range_slice = time_series_df['2023-01-03':'2023-01-07']
print(date_range_slice)

By providing a date range as the index, we use slicing to extract data between
the specified dates, creating a new DataFrame.

Resampling Time Series Data: Resampling is useful for changing the frequency of
time series data. Let's demonstrate resampling by aggregating data to a weekly
frequency:

# Resampling to a weekly frequency


weekly_resampled = time_series_df.resample('W').sum()
print(weekly_resampled)

The `resample()` function aggregates the data to a weekly frequency, summing


the values within each week.

Shifting Time Series Data: Shifting allows us to move data points forwards or
backwards in time. Let's shift our data by one time step:

# Shifting data by one time step


shifted_data = time_series_df.shift(1)
print(shifted_data)

Using `shift()`, we displace the data by one time step, creating a DataFrame with
data points shifted.

7.3 Resampling and Frequency Conversion

Time series data often comes with varying frequencies, which can make analysis
and comparison challenging. Resampling, a crucial technique in time series analysis,

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 68 | P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

allows us to change the frequency of our data, enabling better insight extraction and
trend identification. In this section, we'll delve into resampling and frequency
conversion using the powerful Pandas library.

Upsampling and Downsampling: Upsampling involves increasing the frequency


of time series data, while downsampling involves reducing the frequency. Let's
explore both concepts using a sample time series:

import pandas as pd
import numpy as np

# Creating a time range


time_range = pd.date_range(start='2023-01-01', periods=10, freq='D')

# Creating a DataFrame with random data


data = {'values': np.random.randint(1, 100, size=10)}
time_series_df = pd.DataFrame(data, index=time_range)

# Upsampling to hourly frequency


hourly_upsampled = time_series_df.resample('H').ffill()
print(hourly_upsampled)

# Downsampling to weekly frequency


weekly_downsampled = time_series_df.resample('W').mean()
print(weekly_downsampled)

In this example, we upsample our daily data to an hourly frequency using


`resample()` and forward-fill missing values (`ffill()`) to maintain consistency.
Additionally, we downsample our data to a weekly frequency and calculate the
mean for each week.

Applying Aggregation Functions: Resampling enables us to apply various


aggregation functions to summarize data within the new frequency. Let's explore
this by resampling to a monthly frequency and using the `sum()` and `max()`
functions:

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 69 | P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

# Resampling to monthly frequency


monthly_resampled_sum = time_series_df.resample('M').sum()
monthly_resampled_max = time_series_df.resample('M').max()

print(monthly_resampled_sum)
print(monthly_resampled_max)

Here, we demonstrate resampling to a monthly frequency and showcase two


different aggregation functions—`sum()` and `max()`—that provide insight into the
cumulative and peak values for each month.

Handling Missing Data: Resampling can lead to missing data points, especially
when upsampling. Handling missing data is crucial for accurate analysis. Let's
address this using a combination of resampling and interpolation:

# Upsampling with linear interpolation


upsampled_interpolated = time_series_df.resample('6H').interpolate()

print(upsampled_interpolated)

In this example, we upsample our data to a 6-hour frequency and employ linear
interpolation (`interpolate()`) to estimate missing values, enhancing the accuracy of
our upsampled dataset.

Using Custom Resampling Methods: Pandas allows custom aggregation


functions for resampling. Let's explore resampling using a custom aggregation
method that calculates the range between maximum and minimum values:

# Custom resampling function


def custom_resampler(arr):
return arr.max() - arr.min()

# Applying custom resampler


custom_resampled =
time_series_df.resample('W').apply(custom_resampler)

print(custom_resampled)

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 70 | P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

Here, we define a custom resampling function that computes the range between
maximum and minimum values. We then apply this function to downsample our
data to a weekly frequency, gaining insights into the variability within each week.

Chapter 8: Data Analysis Case Study


Practical application is where the true power of acquired skills and knowledge
comes to life. This chapter will take you through a comprehensive case study,
illuminating the process of extracting valuable insights from real-world datasets
using Python and its data manipulation tools.

8.1 Analyzing Real-World Datasets with Python

The ability to extract meaningful insights from real-world datasets is a


fundamental skill. This section will guide you through the process of analyzing real-
world datasets using Python, demonstrating how to transform raw data into
actionable knowledge.

Exploratory Data Analysis (EDA): Exploratory Data Analysis is the first step in
analyzing any dataset. Let's dive into EDA using Python and the Pandas library:

import pandas as pd

# Load a dataset
url = ('https://raw.githubusercontent.com/datasciencedojo/'
'datasets/master/titanic.csv')
data = pd.read_csv(url)

# Display basic statistics


print(data.describe())

# Check for missing values


print(data.isnull().sum())

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 71 | P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

Here, we load the Titanic dataset from an online source and perform basic
exploratory analysis. We display statistical summaries and identify missing values
using the Pandas library.

Data Visualization: Visualizing data is crucial for gaining insights. Let's use
Matplotlib and Seaborn to create visualizations:

import matplotlib.pyplot as plt


import seaborn as sns

# Create a histogram
plt.figure(figsize=(8, 5))
sns.histplot(data['Age'].dropna(), bins=20, kde=True)
plt.title('Age Distribution')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

# Create a bar plot


plt.figure(figsize=(10, 6))
sns.barplot(x='Sex', y='Fare', data=data)
plt.title('Average Fare by Gender')
plt.show()

In this example, we use Matplotlib and Seaborn to create a histogram of age


distribution and a bar plot to compare the average fare by gender, enhancing our
understanding of the data's characteristics.

Data Transformation: Data transformation is essential for preparing data for


analysis. Let's encode categorical variables and create new features:

# Encode categorical variables


data['Sex'] = data['Sex'].map({'male': 0, 'female': 1})

# Create a new feature


data['FamilySize'] = data['SibSp'] + data['Parch'] + 1

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 72 | P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

Here, we encode the 'Sex' variable into numerical values and create a new
feature, 'FamilySize', by combining 'SibSp' and 'Parch'. These transformations
enhance the dataset's suitability for analysis.

Correlation Analysis: Understanding correlations between variables is crucial.


Let's compute and visualize correlations:

# Compute correlation matrix


correlation_matrix = data.corr()

# Visualize correlations using a heatmap


plt.figure(figsize=(10, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

In this example, we compute the correlation matrix and use a heatmap to


visualize correlations among variables, gaining insights into relationships within the
dataset.

Data Filtering: Filtering data allows us to focus on specific subsets. Let's filter
passengers who survived:

# Filter survivors
survivors = data[data['Survived'] == 1]

# Display statistics for survivors


print(survivors.describe())

Here, we filter the dataset to isolate survivors and display statistical summaries
specifically for this subset, aiding our understanding of survivor demographics.

8.2 Extracting Insights and Patterns

Extracting meaningful insights and identifying patterns from datasets is the


pinnacle of the analytical journey. This section delves into the techniques and
Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 73 | P a g e
Python by Example (Book 2: Data Manipulation and Analysis)

methodologies that empower data analysts to uncover hidden information,


recognize trends, and derive actionable conclusions.

Feature Engineering: Feature engineering involves creating new features that


enhance the predictive power of a model. Let's engineer a new feature based on
passenger titles in the Titanic dataset:

# Extract titles from the 'Name' column


data['Title'] = data['Name'].str.extract(
' ([A-Za-z]+)\.', expand=False)

# Group titles and analyze survival rates


title_survival = data.groupby('Title')['Survived']\
.mean().sort_values(ascending=False)
print(title_survival)

This code snippet demonstrates feature engineering by extracting titles from


passenger names and analyzing the survival rates for each title, offering insights into
the impact of social status on survival.

Anomaly Detection: Anomalies are data points that deviate significantly from the
norm. Let's use Z-score to detect anomalies in the 'Fare' column:

from scipy.stats import zscore

# Calculate Z-scores for Fare


data['Fare_ZScore'] = zscore(data['Fare'])

# Identify and analyze anomalies


anomalies = data[data['Fare_ZScore'].abs() > 3]
print(anomalies[['Name', 'Fare', 'Fare_ZScore']])

By calculating Z-scores and identifying data points with Z-scores exceeding a


threshold, this example demonstrates the detection of anomalies in the 'Fare'
column, aiding in identifying unusual fare values.

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 74 | P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

Pattern Recognition: Pattern recognition involves identifying recurring patterns


within data. Let's use clustering to identify patterns in the 'Age' and 'Fare' columns:

from sklearn.cluster import KMeans

# Select relevant columns


features = data[['Age', 'Fare']].dropna()

# Perform K-means clustering


kmeans = KMeans(n_clusters=3, random_state=0).fit(features)

# Add cluster labels to the dataset


data['Cluster'] = kmeans.labels_

By employing K-means clustering, this example showcases pattern recognition


within the 'Age' and 'Fare' columns, grouping passengers based on age and fare
similarities.

Insights from Patterns: Deriving insights from identified patterns is the


culmination of the analysis. Let's explore survival rates based on age and fare
clusters:

# Analyze survival rates by cluster


cluster_survival = data.groupby('Cluster')['Survived']\
.mean().sort_values(ascending=False)
print(cluster_survival)

This code snippet analyzes the survival rates of passengers within each cluster,
offering insights into the relationship between age, fare clusters, and survival
outcomes.

8.3 Presenting Findings with Visualizations

The art of effective data communication lies in presenting complex insights and
patterns in a clear and concise manner. Visualizations serve as powerful tools to
convey information, enabling data analysts to communicate findings, support
Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 75 | P a g e
Python by Example (Book 2: Data Manipulation and Analysis)

conclusions, and engage audiences. This section explores various visualization


techniques using Matplotlib and other libraries to create informative and visually
appealing plots, charts, and graphs.

Line Plot: Line plots are suitable for showing trends and variations over time.
Let's visualize the change in stock prices using a line plot:

import matplotlib.pyplot as plt

# Data: Date and Stock Prices


dates = ['2022-01-01', '2022-01-02', '2022-01-03', ...]
prices = [100, 105, 110, ...]

# Create a line plot


plt.plot(dates, prices, marker='o', linestyle='-', color='b')
plt.title('Stock Price Trend')
plt.xlabel('Date')
plt.ylabel('Price')
plt.xticks(rotation=45)
plt.grid(True)
plt.show()

This code snippet demonstrates the creation of a line plot to visualize the trend
in stock prices over time, enhancing the audience's understanding of price
fluctuations.

Bar Chart: Bar charts are effective for comparing values across categories. Let's
create a bar chart to display sales data for different products:

# Data: Products and Sales


products = ['Product A', 'Product B', 'Product C', ...]
sales = [500, 700, 300, ...]

# Create a bar chart


plt.bar(products, sales, color='g')
plt.title('Product Sales')
plt.xlabel('Product')
plt.ylabel('Sales')
Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 76 | P a g e
Python by Example (Book 2: Data Manipulation and Analysis)

plt.xticks(rotation=30)
plt.tight_layout()
plt.show()

This example illustrates the use of a bar chart to compare sales figures for
different products, providing a clear visualization of sales performance.

Histogram: Histograms help analyze the distribution of data. Let's visualize the
distribution of exam scores using a histogram:

# Data: Exam Scores


scores = [85, 92, 78, 60, 70, 88, ...]

# Create a histogram
plt.hist(scores, bins=10, color='orange', edgecolor='black')
plt.title('Exam Score Distribution')
plt.xlabel('Score Range')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()

This code snippet demonstrates the creation of a histogram to depict the


distribution of exam scores, enabling insights into score concentration and
variability.

Pie Chart: Pie charts display the proportion of different categories in a dataset.
Let's visualize the market share of mobile operating systems:

# Data: Operating Systems and Market Share


os_names = ['Android', 'iOS', 'Others']
market_share = [75, 22, 3]

# Create a pie chart


plt.pie(market_share, labels=os_names, autopct='%1.1f%%',
colors=['blue', 'green', 'red'])
plt.title('Mobile OS Market Share')
plt.show()

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 77 | P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

This example showcases the creation of a pie chart to represent the market share
of different mobile operating systems, providing a visual depiction of their respective
proportions.

Scatter Plot: Scatter plots reveal relationships between two variables. Let's
visualize the correlation between study hours and exam scores:

# Data: Study Hours and Exam Scores


study_hours = [2, 3, 4, 5, 6, ...]
exam_scores = [60, 70, 75, 85, 90, ...]

# Create a scatter plot


plt.scatter(study_hours, exam_scores, color='purple', marker='o')
plt.title('Study Hours vs. Exam Scores')
plt.xlabel('Study Hours')
plt.ylabel('Exam Score')
plt.grid(True)
plt.show()

This code snippet demonstrates the creation of a scatter plot to visualize the
relationship between study hours and exam scores, facilitating an understanding of
their correlation.

Chapter 9: Large Datasets and Performance Optimization


As datasets continue to grow in size and complexity, mastering techniques for
efficient data manipulation and optimizing code performance becomes essential. We
will explore strategies to manage large volumes of data effectively, employ advanced
data manipulation techniques, and implement optimization strategies that enhance
the speed and efficiency of your Python applications.

9.1 Strategies for Handling Large Datasets

As the era of big data continues to unfold, the ability to effectively manage and
manipulate large datasets has become a crucial skill for data professionals. In this
Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 78 | P a g e
Python by Example (Book 2: Data Manipulation and Analysis)

section, we will explore strategies and techniques for handling large datasets in
Python, ensuring that your data analysis remains efficient, scalable, and
manageable.

Memory-efficient Data Structures: When dealing with large datasets, memory


consumption is a critical concern. Utilizing memory-efficient data structures like
NumPy arrays and Pandas DataFrames can significantly enhance your ability to
process substantial amounts of data without exhausting system resources.

Data Streaming: Data streaming is a powerful technique that processes data


piece by piece, avoiding the need to load the entire dataset into memory. The
`pandas.read_csv` function supports streaming through the `chunksize` parameter,
enabling iterative processing of large CSV files.

import pandas as pd

# Reading CSV in chunks


chunk_size = 1000
for chunk in pd.read_csv('large_dataset.csv', chunksize=chunk_size):
# Process each chunk
process_chunk(chunk)

The code demonstrates how to read a large CSV file in chunks using Pandas'
`read_csv` function with the `chunksize` parameter. This enables the iterative
processing of each chunk of data, alleviating memory constraints.

Dask: Dask is a parallel computing library that seamlessly integrates with familiar
APIs like NumPy and Pandas. It enables you to work with larger-than-memory
datasets by breaking them into smaller computational units called "tasks" that can
be executed in parallel.

import dask.dataframe as dd

# Load and process large CSV using Dask


df = dd.read_csv('large_dataset.csv')
Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 79 | P a g e
Python by Example (Book 2: Data Manipulation and Analysis)

result = df.groupby('category')['value'].compute()

This code showcases Dask's ability to handle larger-than-memory datasets. It


loads a large CSV file using Dask's DataFrame, performs a computation (grouping and
aggregation), and then computes the result efficiently.

Database Management: Databases provide efficient ways to manage and query


large datasets. Utilize database management systems (DBMS) such as SQLite,
MySQL, or PostgreSQL to store and manipulate data, benefiting from their query
optimization and indexing capabilities.

import sqlite3

# Create a SQLite database


conn = sqlite3.connect('large_data.db')

# Load data into the database


df.to_sql('table_name', conn, if_exists='replace', index=False)

# Query the database


result = pd.read_sql_query('SELECT * FROM table_name '
'WHERE condition', conn)

This code illustrates how to utilize SQLite to create a database, load data into it,
and perform SQL queries. Databases offer efficient storage and retrieval mechanisms
for handling large datasets.

Parallel Processing: Leveraging parallel processing techniques can expedite


computations on large datasets. Python's `concurrent.futures` module provides a
simple interface for parallel execution, allowing you to distribute tasks across
multiple CPU cores.

import concurrent.futures

# Process data in parallel using ThreadPoolExecutor


with concurrent.futures.ThreadPoolExecutor() as executor:

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 80 | P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

results = list(executor.map(process_data, large_data))

This code demonstrates how to use Python's `concurrent.futures` module to


process data in parallel using a ThreadPoolExecutor. It efficiently distributes tasks
across multiple threads, enhancing computation speed for large datasets.

9.2 Efficient Data Processing Techniques

Efficiency is paramount when working with large datasets, especially in scenarios


where processing time directly impacts productivity and decision-making. In this
section, we delve into various efficient data processing techniques that allow you to
optimize your data manipulation workflows, ensuring that you can extract valuable
insights from extensive datasets in a timely manner.

Vectorized Operations with NumPy: NumPy's vectorized operations enable you


to perform computations on entire arrays without the need for explicit loops. This
approach leverages optimized C and Fortran libraries under the hood, significantly
boosting processing speed.

import numpy as np

# Performing vectorized operations


data = np.array([1, 2, 3, 4, 5])
result = data * 2

NumPy's vectorized operations eliminate the need for explicit loops, enhancing
computation speed. This code showcases how to multiply each element of an array
by 2 in a vectorized manner.

Efficient Aggregation with Pandas: Pandas provides powerful aggregation


functions that efficiently summarize data. By grouping data based on specific criteria
and applying aggregation functions, you can swiftly obtain insights from large
datasets.

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 81 | P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

import pandas as pd

# Efficient aggregation with Pandas


grouped_data = df.groupby('category')['value'].sum()

This code snippet demonstrates how to use Pandas' `groupby` and aggregation
functions to efficiently calculate the sum of values for each category. Aggregating
data in this way minimizes computation time.

Streaming Data Processing: For continuous data streams or very large files,
streaming data processing avoids loading the entire dataset into memory. Libraries
like `streamz` provide tools to work with streaming data efficiently.

import streamz

source = streamz.Source()
stream = source.scatter()
stream.map(process_data).sink(print_result)
source.start()

Here, the code sets up a data stream using the `streamz` library. The stream
processes data using the `map` function and outputs the results through the `sink`.
Streaming data processing ensures efficient handling of continuous or large
datasets.

Parallel Processing with Dask: Dask enables parallel and distributed computing
with a familiar API. By breaking tasks into smaller units, Dask efficiently utilizes
multicore processors or distributed clusters for faster data processing.

import dask.dataframe as dd
import dask.bag as db

# Parallel processing with Dask


dask_df = dd.read_csv('large_dataset.csv')
result = dask_df.groupby('category')['value'].sum().compute()

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 82 | P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

This code showcases Dask's ability to perform parallel processing on a


DataFrame, leveraging its parallel execution capabilities to enhance data processing
speed.

Caching and Memoization: Caching and memoization involve storing


intermediate results to avoid redundant computations. Libraries like `functools`
provide tools to implement memoization, while caching libraries like `joblib` can
efficiently store and retrieve computed results.

from functools import lru_cache

@lru_cache(maxsize=None)
def expensive_function(arg):
# Expensive computation
return result

The code demonstrates how to use the `functools` library to apply memoization,
caching the results of expensive computations. This approach avoids recalculating
results, enhancing processing efficiency.

9.3 Performance Optimization with NumPy and Pandas

Efficient data manipulation is crucial for working with large datasets. NumPy and
Pandas offer various techniques to optimize the performance of your data
processing tasks. In this section, we'll explore key strategies to enhance the speed
and efficiency of your code, enabling you to handle sizable datasets with ease.

Vectorized Operations with NumPy: NumPy's array-based computations are


inherently faster than traditional Python loops. By leveraging vectorized operations,
you can perform computations on entire arrays, eliminating the need for explicit
loops.

import numpy as np

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 83 | P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

# Vectorized operations with NumPy


data = np.array([1, 2, 3, 4, 5])
result = data * 2

This code demonstrates the power of NumPy's vectorized operations. By


multiplying each element of an array by 2, we avoid the overhead of iterating
through elements individually.

Pandas' Built-in Optimizations: Pandas provides various optimizations under the


hood, such as efficient memory storage and parallel processing. Utilizing these
optimizations, you can handle large datasets without sacrificing performance.

import pandas as pd

# Pandas' memory-efficient data types


df = pd.read_csv('large_dataset.csv')
optimized_df = df.astype({'column_name': 'category'})

Here, Pandas' `astype` method is used to convert a column to a memory-efficient


data type, reducing memory usage and boosting performance.

Using the Apply Function Wisely: While Pandas' `apply` function is versatile, it
can be slow on large datasets. Utilize it for complex operations, but opt for
vectorized operations when possible to maximize performance.

# Using the apply function


def complex_function(row):
# Complex computation
return result

df['new_column'] = df.apply(complex_function, axis=1)

This code demonstrates using the `apply` function to perform a complex


operation on each row of a DataFrame. While useful, this approach might be slower
compared to vectorized operations.

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 84 | P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

NumPy's Broadcasting: NumPy's broadcasting allows you to perform operations


on arrays of different shapes, efficiently expanding smaller arrays to match larger
ones.

import numpy as np

# Broadcasting with NumPy


array = np.array([[1, 2, 3], [4, 5, 6]])
result = array + np.array([10, 20, 30])

Here, NumPy's broadcasting enables element-wise addition between a 2D array


and a 1D array without explicit looping, optimizing the computation.

Filtering Data with NumPy and Pandas: Efficiently filtering data based on
conditions is crucial. NumPy's boolean indexing and Pandas' query function offer
optimized ways to filter data.

import numpy as np
import pandas as pd

# Filtering data with NumPy and Pandas


array = np.array([1, 2, 3, 4, 5])
filtered_array = array[array > 2]

df = pd.read_csv('large_dataset.csv')
filtered_df = df.query('column_name > 100')

This code showcases how NumPy's boolean indexing and Pandas' query function
efficiently filter data based on conditions, improving performance.

Chapter 10: Data Manipulation Best Practices


As we near the culmination of our journey through the world of Python data
manipulation, it's time to delve into the realm of best practices. In this chapter, we
will explore a set of guidelines, techniques, and principles that will help you write
clean, efficient, and maintainable data manipulation code. Just as a skilled craftsman
Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 85 | P a g e
Python by Example (Book 2: Data Manipulation and Analysis)

carefully hones their tools and techniques, a proficient data practitioner must also
adopt a set of practices that ensure their code is not only functional but also robust
and scalable. From structuring your code for clarity to optimizing performance and
ensuring reliability, this chapter is designed to equip you with the skills needed to
elevate your data manipulation endeavors to new heights.

10.1 Writing Clean and Efficient Data Manipulation Code

Writing code that is not only functional but also clean and efficient is of
paramount importance. Clean code is more readable, easier to maintain, and less
prone to errors. Efficient code ensures that your data processing tasks are executed
swiftly, enabling you to analyze large datasets without unnecessary delays. In this
section, we will explore essential practices and techniques for crafting clean and
efficient data manipulation code in Python.

Meaningful Variable Names: Choosing descriptive variable names is crucial for


code readability. Aim for names that convey the purpose of the variable or data
structure.

# Poor variable naming


a = df['col'] + 5

# Improved variable naming


total_sales = sales_data['revenue'] + 5

The improved variable name "total_sales" provides clear context, enhancing the
code's readability and making its purpose evident.

Avoiding Magic Numbers: Avoid using magic numbers (unexplained constants) in


your code. Assign them to named variables with clear explanations.

# Using magic number


if len(data) > 1000:
process_data(data)

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 86 | P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

# Improved with named variable


max_data_length = 1000
if len(data) > max_data_length:
process_data(data)

By assigning the magic number to a named variable, such as "max_data_length,"


you enhance code readability and make its intent clearer.

Consistent Indentation and Formatting: Maintain consistent code indentation


and formatting throughout your script. Use spaces or tabs consistently to improve
readability.

# Inconsistent indentation
if condition:
do_something()
do_something_else()

# Improved with consistent indentation


if condition:
do_something()
do_something_else()

Consistent indentation enhances code structure and readability, making it easier


to understand and maintain.

Modularization: Break down complex data manipulation tasks into smaller,


modular functions. This promotes code reusability and allows you to focus on one
task at a time.

# Complex data manipulation


for index, row in df.iterrows():
# Many lines of code
...

# Improved with modularization


def process_row(row):
# Code to process a row
Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 87 | P a g e
Python by Example (Book 2: Data Manipulation and Analysis)

...

for index, row in df.iterrows():


process_row(row)

Modularizing code improves readability, allows for easier debugging, and makes
code maintenance more manageable.

Efficient Looping: When working with Pandas, prefer vectorized operations over
explicit loops whenever possible. Vectorized operations are often faster and more
concise.

# Loop-based calculation
result = []
for value in df['column']:
result.append(value * 2)

# Improved with vectorized operation


result = df['column'] * 2

Using vectorized operations enhances code performance and readability, as well


as reduces the chances of bugs in loop logic.

Documentation: Provide clear and concise comments to explain the purpose and
functionality of your code. Documenting complex sections or functions is particularly
important.

# Unclear code
def process_data(data):
# ...
if flag == 1:
# Process data differently
...

# Improved with comments


def process_data(data):
# ...
if flag == 1:
Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 88 | P a g e
Python by Example (Book 2: Data Manipulation and Analysis)

# Process data for special case


...

Documentation helps you and others understand the code's intent and
functionality, making it easier to maintain and collaborate on.

10.2 Using Pythonic Idioms and Best Practices

Pythonic idioms and best practices are the cornerstone of writing clean,
readable, and efficient Python code. These practices are rooted in the philosophy of
the Python programming language, emphasizing simplicity, readability, and the
utilization of built-in language features. In this section, we will see some essential
Pythonic idioms and best practices that contribute to the development of high-
quality data manipulation code.

List Comprehensions: List comprehensions provide a concise and Pythonic way


to create lists based on existing iterables. They replace traditional for loops when
constructing lists.

# Traditional for loop


squared_numbers = []
for num in numbers:
squared_numbers.append(num ** 2)

# Using list comprehension


squared_numbers = [num ** 2 for num in numbers]

List comprehensions offer a more elegant and compact syntax for creating lists,
enhancing code readability and reducing the number of lines.

Context Managers with "with": Context managers, often used with the "with"
statement, facilitate resource management and exception handling. They ensure
that resources are properly acquired and released.

# Without context manager

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 89 | P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

file = open('data.txt', 'r')


content = file.read()
file.close()

# Using context manager


with open('data.txt', 'r') as file:
content = file.read()

Context managers simplify resource management and ensure that resources are
properly cleaned up, even in the presence of exceptions.

Generator Expressions: Generator expressions produce values lazily, which can


be more memory-efficient compared to creating entire lists. They are especially
useful for large datasets.

# List comprehension
squared_numbers = [num ** 2 for num in numbers]

# Generator expression
squared_generator = (num ** 2 for num in numbers)

Generator expressions generate values on-the-fly, avoiding memory overhead


and improving performance when dealing with large data.

Enumerate: The "enumerate" function simplifies iterating over an iterable while


keeping track of the index. This is particularly useful when needing both the value
and its index.

# Without enumerate
for i in range(len(names)):
print(f"Name at index {i}: {names[i]}")

# Using enumerate
for i, name in enumerate(names):
print(f"Name at index {i}: {name}")

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 90 | P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

Enumerate makes code more readable by eliminating the need to manually


manage loop counters.

PEP 8: Adhering to the PEP 8 style guide promotes code consistency and
readability. Consistent naming conventions, proper indentation, and clear formatting
enhance code quality.

# Inconsistent naming
MaxValue = max(numbers)
Total_sum = sum(numbers)

# Using PEP 8 naming conventions


max_value = max(numbers)
total_sum = sum(numbers)

Following PEP 8 guidelines ensures that your code is easily readable and
understandable by the Python community.

DRY Principle: The "Don't Repeat Yourself" (DRY) principle emphasizes code
reusability by avoiding duplicate code. Create functions and modules for repeated
logic.

# Repeated logic
result1 = perform_calculation(data1)
result2 = perform_calculation(data2)

# Improved with a function


def calculate_result(data):
return perform_calculation(data)

result1 = calculate_result(data1)
result2 = calculate_result(data2)

Adhering to the DRY principle reduces redundancy, enhances maintainability,


and simplifies code management.

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 91 | P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

10.3 Tips for Error Handling and Debugging

Error handling and debugging are integral skills for any programmer. When
working with data manipulation and analysis, it's crucial to effectively manage errors
and troubleshoot issues that may arise in your code. In this section, we will explore
various strategies and techniques for error handling and debugging in Python.

Exception Handling: Exception handling allows you to gracefully handle runtime


errors and prevent your program from crashing. The "try", "except", and "finally"
blocks are used to catch and manage exceptions.

try:
result = 10 / 0
except ZeroDivisionError:
print("Error: Division by zero")

Exception handling ensures that your program continues running even when
encountering errors, making it more robust.

Logging: Logging is an essential tool for understanding the behavior of your code.
The "logging" module provides various levels of logging, helping you track the flow
and state of your program.

import logging

logging.basicConfig(level=logging.DEBUG)
logging.debug("Debugging message")

Logging allows you to collect valuable information during runtime, aiding in


identifying issues and understanding program flow.

Assertions: Assertions are used to check if a condition is met, providing an


effective way to catch logical errors in your code during development and testing.

def calculate_tax(income):

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 92 | P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

assert income > 0, "Income must be positive"


# Calculate tax logic

calculate_tax(-1000)

Assertions act as self-checks during development, highlighting potential issues


early in the development process.

Using IDEs and Debuggers: Integrated Development Environments (IDEs) like


PyCharm and Visual Studio Code offer powerful debugging features, including
breakpoints, variable inspection, and step-by-step execution. IDEs enhance your
debugging process by allowing you to visualize and understand the behavior of your
code during execution.

Print Statements: Print statements are a simple yet effective way to inspect
variable values and trace the execution flow of your code.

def calculate_interest(principal, rate, years):


print("Calculating interest...")
# Calculation logic
print("Interest calculated:", interest)

calculate_interest(1000, 0.05, 3)

Print statements provide quick insights into variable values and the execution
sequence, helping you locate issues.

Error Messages and Stack Traces: When an error occurs, Python generates an
error message and a stack trace, indicating where the error occurred in your code.
Understanding error messages and stack traces helps pinpoint the root cause of
errors and facilitates effective troubleshooting.

Unit Testing: Writing unit tests using frameworks like "unittest" and "pytest" can
help catch errors early in development and ensure the correctness of your code.

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 93 | P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

import unittest

def divide(a, b):


return a / b

class TestDivision(unittest.TestCase):
def test_division(self):
self.assertEqual(divide(10, 2), 5)

if __name__ == '__main__':
unittest.main()

Unit tests provide a systematic way to validate the functionality of your code and
identify regressions.

Case Study: Book Library Analysis


You are the owner of a personal book library dataset (Download:
https://drive.google.com/open?id=16pTX5gosDxyqKxTEaDmlyUnaF34BJ6jT&
authuser=mnkhokhar%40gmail.com&usp=drive_fs), comprising three
distinct CSV files: "Books," "Ratings," and "Users." Each file contains vital
information to facilitate an in-depth analysis of your
collection. The "Books" file encompasses fields such as ISBN, Book Title, Book
Author, Year of Publication, Publisher, and Image URLs in varying sizes for
cover images. The "Ratings" file includes User ID, ISBN, and Book Rating,
providing insights into how users perceive and rate the books in your
collection. The "Users" file captures User ID, Location, and Age, offering
valuable demographic details about the individuals interacting with your
book library.

Your task involves conducting a comprehensive exploration of these


datasets, uncovering valuable insights and patterns that reveal intriguing

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 94 | P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

relationships between books, user ratings, and user demographics. Required


steps for this engaging endeavor are outlined below.

Exploring the Dataset: Commence by utilizing the Pandas library to load


the CSV files containing the comprehensive information about your personal
book library. Engage in an initial exploration of the dataset to grasp its size,
inspect data types, and gain insights into its preliminary values. This
foundational step sets the stage for your forthcoming analysis, providing a
clear understanding of the dataset's structure and content.

Data Cleansing and Preprocessing: Ensure the dataset's integrity by


addressing any instances of missing or erroneous data. Employ effective
techniques such as data imputation or removal to rectify gaps. Furthermore,
standardize diverse data formats, such as dates, to ensure homogeneity and
accuracy across the dataset. These preprocessing endeavors are pivotal in
rendering the dataset suitable for meaningful analysis.

Data Manipulation and Insight Generation: Leverage the power of


Python to delve into the dataset's depths, embarking on an array of data
manipulation and analysis tasks. Calculate the average rating attributed to
each author, and delve into the distribution of genres and publication years.
Employ Python's computational capabilities to extract valuable insights that
shed light on the attributes and trends within your book collection.

Unveiling Time-Driven Patterns: Initiate a comprehensive time series


analysis that unveils the temporal evolution of your book library. Employ
resampling techniques to create visually informative plots that visualize the
growth of your collection across different years. Through these plots,

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 95 | P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

observe how your library has expanded and evolved over the passage of
time.

Crafting Informative Visualizations: Harness the capabilities of Matplotlib


to design an assortment of visualizations that enrich your analysis. Construct
bar plots, where average ratings are juxtaposed with authors, scatter plots
that contrast book length against ratings, and histograms that depict the
distribution of ratings. Utilize these visual representations to gain a clearer
comprehension of the data.

Unearthing Complex Relationships: Venture into the realm of advanced


data manipulation by merging and joining datasets, enabling the exploration
of multifaceted relationships extending beyond individual attributes. Employ
pivot tables and melt operations to reshape the data, uncovering intricate
insights that may otherwise remain concealed.

Enhancing Visual Appeal: Elevate the aesthetic quality of your


visualizations by incorporating labels, titles, colors, and stylistic elements.
This enhancement ensures that your plots are not only informative but also
visually captivating, facilitating a more engaging presentation of your
findings.

Deriving Insights and Discerning Patterns: Delve into the data to discern
patterns and insights that underscore the popularity of specific genres,
authors, and other noteworthy attributes. Through meticulous analysis, gain
a deeper understanding of your book collection and its underlying dynamics.

Conveying Discoveries through Visuals: Craft a presentation-worthy


visualization that encapsulates your most significant findings. Employ this

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 96 | P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

visualization as a compelling tool to communicate your insights effectively,


allowing others to glean a comprehensive understanding of the key
takeaways from your analysis.

Adhering to Best Practices and Error Handling: Navigate the realm of


coding with precision by adhering to established best practices. Ensure that
your code is clear, efficient, and well-structured. Implement robust error-
handling techniques to anticipate and manage potential issues that may arise
during the analysis process.

Optimizing Performance: Delve into strategies for optimizing data


processing efficiency, particularly when dealing with larger datasets. Utilize
the capabilities of NumPy and Pandas to expedite data manipulation tasks
while maintaining optimal performance levels.

Pythonic Excellence and Rigorous Testing: Infuse your code with


Pythonic idioms, enhancing its readability and conciseness. Implement unit
tests to rigorously validate critical functions, thereby ensuring the accuracy
and reliability of your analysis outcomes.

Code

The following code is organized into modules that correspond to the steps
outlined in the problem statement. Each module involves loading, cleaning,
manipulating, and analyzing the dataset while utilizing Pandas, NumPy, and
Matplotlib libraries. Proper comments provide clarity and guidance throughout the
code, ensuring a comprehensive and effective analysis of the personal book library
dataset.

import pandas as pd
import numpy as np

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 97 | P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

import matplotlib.pyplot as plt

# Step 1: Exploring the Dataset


# Load the CSV files into Pandas DataFrames
books_df = pd.read_csv('Books.csv')
ratings_df = pd.read_csv('Ratings.csv')
users_df = pd.read_csv('Users.csv')

# Explore the dataset's size, data types, and initial values


print("Books Dataset Info:")
print(books_df.info())
print("\nRatings Dataset Info:")
print(ratings_df.info())
print("\nUsers Dataset Info:")
print(users_df.info())

# Step 2: Data Cleansing and Preprocessing


# Handle missing or erroneous data
books_df.dropna(inplace=True)
ratings_df.dropna(inplace=True)
users_df.dropna(inplace=True)

# Standardize data formats


books_df['Year of Publication'] =
pd.to_datetime(books_df['Year of Publication'], errors='coerce')

# Step 3: Data Manipulation and Insight Generation


average_rating_by_author =
books_df.groupby('Book Author')['Book Rating'].mean()
genre_distribution = books_df['Genre'].value_counts()

yearly_growth = books_df.set_index('Year of Publication')['ISBN']


yearly_count = yearly_growth.resample('Y').count()

# Step 4: Unveiling Time-Driven Patterns


plt.plot(yearly_growth)
plt.title('Yearly Growth of Book Library')
plt.xlabel('Year')
plt.ylabel('Number of Books')
plt.show()

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 98 | P a g e


Python by Example (Book 2: Data Manipulation and Analysis)

# Step 5: Crafting Informative Visualizations


plt.bar(average_rating_by_author.index,
average_rating_by_author.values)
plt.title('Average Rating by Author')
plt.xlabel('Author')
plt.ylabel('Average Rating')
plt.xticks(rotation=90)
plt.show()

# Step 6: Unearthing Complex Relationships


merged_df = pd.merge(books_df, ratings_df, on='ISBN')
pivot_table = merged_df.pivot_table(index='Book Author',
columns='Genre', values='Book Rating', aggfunc=np.mean)

# Step 7: Enhancing Visual Appeal


plt.scatter(books_df['Book Length'], books_df['Book Rating'],
c='blue', marker='o')
plt.title('Book Length vs. Ratings')
plt.xlabel('Book Length')
plt.ylabel('Book Rating')
plt.show()

# Step 8: Deriving Insights and Discerning Patterns


popular_genres = genre_distribution[:5]
print("Top 5 Popular Genres:", popular_genres)

# Step 9: Conveying Discoveries through Visuals


plt.pie(popular_genres, labels=popular_genres.index,
autopct='%1.1f%%')
plt.title('Top 5 Popular Genres')
plt.show()

Step by Step Description

The following description provides a detailed walkthrough of the solution code


for the given problem statement. Each step is thoroughly explained, guiding the
reader through the process of loading, cleaning, analyzing, and visualizing the
personal book library dataset. By following these steps, readers can gain a
comprehensive understanding of how to effectively explore and extract valuable
Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 99 | P a g e
Python by Example (Book 2: Data Manipulation and Analysis)

insights from the dataset using Python and various data manipulation and
visualization techniques.

Exploring the Dataset: In this step, the necessary CSV files (Books.csv,
Ratings.csv, and Users.csv) are loaded into Pandas DataFrames. The `.read_csv()`
function is used to read the CSV files, and the `.info()` method provides information
about the datasets, including their sizes, data types, and non-null counts.

Data Cleansing and Preprocessing: In this step, missing values are handled by
using the `.dropna()` method, which removes rows with any missing values. The
`'Year of Publication'` column is standardized by converting it to a datetime format
using `pd.to_datetime()`, with `errors='coerce'` handling any errors by converting
them to NaN values.

Data Manipulation and Insight Generation: In this step, various insights are
generated from the dataset. The average rating for each book author is computed
using `.groupby()` and `.mean()` methods. The distribution of book genres is
calculated using `.value_counts()`. The yearly growth of the book library is obtained
by setting the `'Year of Publication'` column as the index and using `.resample()` to
count the number of books published each year.

Unveiling Time-Driven Patterns: This step involves creating a line plot using
Matplotlib to visualize the yearly growth of the book library. The `plt.plot()` function
is used to plot the `yearly_growth` data, and labels and a title are added using
`plt.xlabel()`, `plt.ylabel()`, and `plt.title()` functions. The resulting plot is displayed
using `plt.show()`.

Crafting Informative Visualizations: This step involves creating a bar plot using
Matplotlib to visualize the average rating by author. The `plt.bar()` function is used
to create the plot, and labels, a title, and rotation for x-axis labels are added using
Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 100 | P a g e
Python by Example (Book 2: Data Manipulation and Analysis)

`plt.xlabel()`, `plt.ylabel()`, `plt.title()`, and `plt.xticks()` functions. The resulting plot is


displayed using `plt.show()`.

Unearthing Complex Relationships: In this step, two DataFrames are merged


using the `.merge()` method based on the `'ISBN'` column. Then, a pivot table is
created using the `.pivot_table()` method to explore the relationship between book
authors, genres, and average book ratings.

Enhancing Visual Appeal: This step involves creating a scatter plot using
Matplotlib to visualize the relationship between book length and ratings. The
`plt.scatter()` function is used to create the plot, and labels and a title are added
using `plt.xlabel()`, `plt.ylabel()`, and `plt.title()` functions. The resulting plot is
displayed using `plt.show()`.

Deriving Insights and Discerning Patterns: In this step, the five most popular
book genres are extracted from the `genre_distribution` using slicing. The resulting
data is printed to the console.

Conveying Discoveries through Visuals: This step involves creating a pie chart
using Matplotlib to visualize the distribution of the top 5 popular book genres. The
`plt.pie()` function is used to create the chart, and labels and a title are added using
using plt.title() and plt.show() functions.

Compiled & Edited by Muhammad Nadeem Khokhar (mnkhokhar@gmail.com) 101 | P a g e

You might also like