You are on page 1of 36

Financial Econometrics with Python

Introduction to Python
Python Libraries

Kevyn Stefanelli
2023-2024

3. Python libraries
Python libraries are pre-written collections of code modules that provide a wide range of
functionalities to extend the capabilities of the Python programming language.

Libraries are created to solve specific problems or tasks, allowing developers to avoid
reinventing the wheel by utilizing existing well-tested code.

They offer ready-to-use functions, classes, and methods that streamline the development
process and enable the creation of complex applications with relative ease.

When working with libraries, you have to import the required modules into your code, gaining
access to their functions and classes.

3.1 The NumPy library


NumPy stands for "Numerical Python." It is a fundamental library for numerical and
mathematical operations in Python. NumPy provides support for large, multi-dimensional arrays
and matrices, along with an extensive set of mathematical functions to operate on these arrays
efficiently.

You can find more at:

https://numpy.org/doc/stable/user/absolute_beginners.html

3.1.1 Importing NumPy


When working with libraries like NumPy, it's a common practice to import the library and give it
a shorter alias, which can make code more concise and readable.

For example, we can assign the alias "np" to the NumPy library.

This allows you to access NumPy's functions and classes using the shorter "np" prefix instead of
the full "numpy"
# import numpy and assign it an alias
import numpy as np

Once we have imported numpy, we can use the functions contained within it by using the syntax:

• libraryName.function, e.g., numpy.array or


• alias.function, if you specified an alias for the library, e.g., np.array

For instance, "np.array" defines a NumPy vector:

arr = np.array([2, 3, 4, 5])


print(arr)

[2 3 4 5]

Which is the difference between a list and a np array?


A list in Python is a flexible sequence containing diverse data types, while a NumPy array
(ndarray) is a specialized data structure optimized for numerical operations with homogeneous
data types, offering better performance and efficiency.

3.1.2 Define a matrix


There are several ways define a matrix (2-dimensional array) in Python:

• creating a multidimensional array using np.array()


• using the function np.matrix()
# using np.array
array = np.array([[1, 2, 3], [4, 5, 6]])
# the two dimensions are separated by a comma
print(array)

[[1 2 3]
[4 5 6]]

# use the array defined to fill a matrix


matrix = np.matrix(array)
print(matrix)

[[1 2 3]
[4 5 6]]

Note: In NumPy, when you create a multidimensional array, data are stored sequentially by
rows.

3.1.3 Math functions available in NumPy:


np.sum([1, 2, 3])
np.sqrt(4)
np.log(1)
np.max([1, 2])
np.min([1, 2])
np.mean([1, 2])
np.median([1, 2, 3])
np.nan # create a Not a Number (NaN) object
np.nansum([1, 2, np.nan])
np.exp(1)
np.cumprod([1, 2, 3, 4])
np.cumsum([1, 2, 3, 4])
np.abs(-10)

10

3.1.4 Operation on multidimensional array


# define a matrix
mat = np.array([[1, 2], [10, 2]])

# vertical mean (sums all rows for each column and takes the mean);
# axis=0 means axis='rows'
np.mean(mat, axis=0)

# horizontal mean (sums all columns for each row and takes the mean);
# axis=1 means axis='columns'
np.mean(mat, axis=1)

# if axis is not specified, np.mean compute the mean of all the matrix
elements
np.mean(mat)

# The syntax is the same for many other operations


np.max(mat, axis=0)
np.min(mat, axis=1)
np.abs(np.array([-1, -2, -3]))

array([1, 2, 3])

You can find all the mathematical functions contained in NumPy here:

https://numpy.org/doc/stable/reference/routines.math.html

3.1.5 Defining and working with particular arrays (vectors and


matrices)
# create a range of numbers (1, 2, 3, ...)
np.arange(1, 10)
# equally spaced values (parameters are: start, stop, number of
values)
np.linspace(1, 10, 5)
# identity matrix
np.eye(5) # define a 5x5 identity matrix
# vector or matrix of ones (1 param = vector, 2 param = matrix)
np.ones((5, 5))
# matrix of zeros (1 param = vector, 2 param = matrix)
np.zeros((5, 5))

# Operations
# retrieve the diagonal of a matrix
# trace of a matrix
np.trace(np.ones((5, 5)))
np.diag([[1, 2], [1, 2]])

# matrix multiplication
np.dot(np.array([[1, 2], [1, 2]]), np.array([[3, 3], [3, 3]]))

# appends an element to an array (similar to .append() for lists)


np.append([1, 2, 3], 100)

# and also:
# np.gradient() # gradient
# np.kron() # Kronecker product
# np.outer() # outer product

array([ 1, 2, 3, 100])

3.1.6 Other useful functions:


# return the shape of a function
np.array([[1, 2, 3], [1, 2, 3]]).shape

(2, 3)

# transpose the matrix


np.array([[1, 2, 3], [1, 2, 3]]).transpose()
# it can also be written as np.transpose(np.array(...))

array([[1, 1],
[2, 2],
[3, 3]])

# sorts elements in ascending order


np.sort([1, 2, 10, 1, 100])

# to sort in descending order add [::-1] at the end


# np.sort([1, 2, 10, 1, 100])[::-1]

array([ 1, 1, 2, 10, 100])

# ceiling and flooring


np.ceil(1.2) # rounds up to the nearest integer
np.floor(1.9) # rounds down to the nearest integer
1.0

# get the quantile of the distribution


np.quantile(a=np.array([1, 2, 3, 4, 5]), q=0.1)

1.4

## Boolean
# checks if all elements are True
np.all([True, True, False])
# checks if at least 1 element in the iterable is True
np.any([True, True, False])

True

3.1.7 Learning by doing: matrices


import numpy as np

# Define w and E
w = np.array([1, -1, 7, 2, 0])
E = np.array([[1, 0, 4, 5],
[2, 1, -1, 0],
[-1, -2, 3, -1],
[2, 0, -3, 0]])

# Print the element of w in position 5


print(w[4]) # Python uses 0-based indexing, so index 4 corresponds to
the 5th element

# Save the second element of w in a variable called "d"


d = w[1]
print(d)

# Print the element of E in row 2 and column 1


print(E[1, 0]) # Again, Python uses 0-based indexing

# Print the element of E in row 1 and column 2


print(E[0, 1])

# Print the entire second column of E


print(E[:, 1])

# Print the entire first row of E


print(E[0, :])

# Select only the first two rows of E


print(E[0:2, :])

# Select from the second to the fourth elements of the third row of E
print(E[2, 1:4])
# Select only the elements in the first two rows and in the third and
forth columns of E
print(E[0:2, [2, 3]])

# Define G as a new matrix equal to E without the third column


G = np.delete(E, 2, axis=1)
print(G)

# Define H by removing the rows 1 and 4 from E


H = np.delete(E, [0, 3], axis=0)
print(H)

0
-1
2
0
[ 0 1 -2 0]
[1 0 4 5]
[[ 1 0 4 5]
[ 2 1 -1 0]]
[-2 3 -1]
[[ 4 5]
[-1 0]]
[[ 1 0 5]
[ 2 1 0]
[-1 -2 -1]
[ 2 0 0]]
[[ 2 1 -1 0]
[-1 -2 3 -1]]

3.1.8 Random Numbers


We use the np.random module, a subclass of NumPy, to generate random numbers from
various statistical distributions.

So, we have to import this module from NumPy as follows:

from numpy import random

Now, we can use the functions contained into the module random

# generate a random integer number between 0 and 100


x = random.randint(100)
print(x)

87

To generate multiple random numbers, you can use a loop:


for k in "banana":
print(random.randint(100))

# which is not an efficient way

2
5
91
62
46
61

Alternatively, you can specify the number of random numbers you want using the size
parameter:

# vector of length 6 containing integer numbers between 0 and 100


x = random.randint(100, size=(6))
print(x)

[ 4 78 75 60 99 42]

# a 3x5 matrix containing integer numbers between 0 and 100


M = random.randint(100, size=(3, 5))
print(M)

[[67 59 19 25 91]
[43 79 70 87 66]
[91 29 6 59 28]]

To generate floating-point numbers, you can use the random.rand() function:

# generate 10 random numbers between 0 and 1


y = random.rand(10)
print(y)

[0.09041575 0.60218472 0.58680052 0.41114351 0.90086462 0.55540265


0.17285796 0.48200608 0.36962709 0.52841594]

# generate a 3x5 random matrix containing numbers between 0 and 1


M1 = random.rand(3,5)
print(M1)

[[0.38898298 0.70417632 0.76043501 0.0464563 0.53854603]


[0.00468154 0.31321421 0.84863038 0.04080821 0.74600119]
[0.27231848 0.33946231 0.2468026 0.76348799 0.17400441]]
3.1.9 Generating random numbers from probability distributions in
NumPy:
Uniform Distribution:
To generate random numbers from a uniform distribution you can use random.uniform():

# generate a 3x2 matrix containing the number between 26 and 52


MUnif = random.uniform(26, 52, size=(3, 2))
print(MUnif)

[[43.26410355 28.58972112]
[37.67737838 26.67180781]
[38.24220209 47.23658844]]

Normal Distribution:
To generate random numbers from a normal distribution, you can use random.normal().

# generate random numbers from a Standard Normal distribution


Z1 = random.normal(size=(2, 3))
print(Z1)

[[-0.62592445 0.73833493 -0.06126694]


[-1.16075312 -0.501207 1.21091304]]

# generate random numbers from a Normal distribution


# (Mean 100, Variance 4 (StD=2))
Z2 = random.normal(100, 4, size=(2, 3))
print(Z2)

[[ 97.07701323 107.35684864 93.09197125]


[104.01988233 99.40439815 101.22412243]]

3.1.10 Setting a generating seed when use random numbers


Setting the random seed is important when working with random number generators, especially
for tasks that involve randomization, simulations, or experiments.

You can set a random seed using the command:

numpy.random.seed()

and specifying a sequence of values (e.g., 123).

Here's why it's important:

• Reproducibility: setting the random seed ensures that you get the same sequence
of random numbers every time you run your code. It allows you and others to
recreate the same results.
• Comparability: you might want to compare the results of different algorithms or
models using randomization. Setting the random seed guarantees that you are
comparing the same set of random numbers, making your comparisons more
meaningful.

• Experiments and Simulations: In scientific experiments, simulations, or statistical


analysis, you might need to generate random data or perform randomized
experiments. Setting the random seed ensures that your results are consistent
across different runs, allowing you to draw accurate conclusions.

• In our case: it helps us obtain the same series of random numbers. Everyone in the
class will have the same sequence.
VERY IMPORTANT: Remember to specify the seed before each random generation.

# Set the random seed


np.random.seed(42)

# Generate random numbers


random_numbers = np.random.rand(5)
print(random_numbers)

[0.37454012 0.95071431 0.73199394 0.59865848 0.15601864]

3.1.11 Numpy Exercises


3.1 Calculate exp {x + 4 } and evaluate for x={− 4 , 0 , 4 }:

3.2 The following equation represents the probability density function (pdf) of a Normal
distribution with mean μ and standard deviation σ :

− ( )
2
1 x−μ
1 2 σ
f ( x , μ ,σ )= ⋅e
σ √2 π
Write a function called 'fNormal' that calculates this pdf and evaluate it for x=8, μ=-3, and σ =1.5.

Note: in Numpy, π=3.14 is defined as: np.pi

3.3 Given m = 2 and s = 1, create x as a vector of the first 100 integers between [ 1 ,100 ) and
evaluate the fNormal function defined in the previous step for each point of the vector x.

3.4 Focus on the built-in sd() function.


The np.std() built-in function computes the population standard deviation (SD), the biased
estimate of the SD.

Which is the difference with the standard deviation of the sample?

• Population Variance:
N
1
σ = ∑ ( x i − x́ )
2 2

N i=1

• Sample Variance:
n
1
s2= ∑
n −1 i=1
( xi − x́ )
2

• Population Standard Deviation:

σ =√(¿ σ 2) ¿
• Sample Standard Deviation:

s= √(¿ s 2) ¿
Compute the unbiased SD (Sample Standard Deviation) of the vector ω :

ω=( 1 ,− 1 ,5 , 6 , 1 ,− 6 , 8 , 9 ,1 , 3 )
Then, compare the output with the Population SD (σ ) provided by Python.

Alternative A:

import numpy as np

# Define the vector w


w = np.array([1, -1, 5, 6, 1, -6, 8, 9, 1, 3])

# Compute the arithmetic mean of w


m = np.mean(w)

# Define n as the length of w


n = len(w)

# Compute the sample standard deviation


unbiased_sd = np.sqrt(np.sum((w - m)**2) / (n-1))

# Compare the results with the standard deviation calculated using


np.std()
print("Sample SD (s):", round(unbiased_sd,3))
print("Population SD:", round(np.std(w),3))

Sample SD (s): 4.498


Population SD: 4.267

Alternative B: Adjust the Results of the Built-in Function

The two formulas differ only in the denominator.

Therefore, we can transition from one formula to the other by making the following adjustment:
σ=√
N −1
⋅s
√N
and then:

s= √ N ⋅σ
√ N −1
import numpy as np

# Define the vector w


w = np.array([1, -1, 5, 6, 1, -6, 8, 9, 1, 3])

# Define N as the length of w


N = len(w)

# Compute the sample standard deviation


sd_pop2 = np.std(w) * (np.sqrt(N) / np.sqrt(N-1))

# Compare the results with the standard deviation calculated using


np.std()
print("Sample SD (s):", round(sd_pop2,3))
print("Population SD (sigma):", round(np.std(w),3))

Sample SD (s): 4.498


Population SD (sigma): 4.267

3.5 Matrices

3.5 Given two matrices A and B:

\begin{matrix} &&1 & 1 & 1 \ A&=&0 & 1 & 2, \ &&1 & -1 & 1 \ \end{matrix}

\begin{matrix} &&1 & 4 & 7 \ B&=&2 & 5 & 8, \ &&3 & 6 & 9 \ \end{matrix}

compute:

1. A+ A
2. A⋅B
3. d e t (B)
4. A− 1
5. A′
3.6 Boolean

Given matrix A \begin{matrix} &&1 & 0 & 4 & 5 \ A&=&-1 & -3 & 4 &3, \ &&2 & 1 & -6 &2 \ &&0 &
0 & 2 &4 \ \end{matrix}

1. Is the element in position [1,1] greater than that in position [4,1]?


2. Is the element in position [3,2] different from that in position [2,2]?
3. Is the element in position [1,2] equal to that in position [4,2] and that in position [4,3]?
4. Is the element in position [3,1] lower than that in position [3,3] or than that in position
[4,2]?
5. Is the sum of the first row of A lower than that of the second column of A?
6. Is the minimum of the last column of A greater than or equal to the maximum of the last
row of A?

3.1.12 Take Home Exercise


Vectors
1. Write a function that takes as input two vectors x and y and returns their difference (x −
y).
2. Write a function that takes a vector as input and returns the product of its elements (hint:
np.prod(v) computes the product of the elements of v). (*)
3. Write a function that takes a number as input and return its cubic root (remember:
√3 x=x 1/ 3)
4. Write a function that takes a number x as input and returns y defined as follows:
y=x +2 x − 3l o g ( x ) − 2
5. Write a function that given the (biased) population standard deviation and the sample
size n as input returns the (unbiased) sample standard deviation (as we have seen in
class).
6. Define two vectors u and g both of length 10, where:
• u is the results of a draw from a Uniform distribution in the interval [ 0 , 10 ),
• g is the results of a draw from a Normal distribution with mean 5 and standard deviation
equal to 0.5. Then, use the function defined in the first exercise to compute the
difference between u and g. Set the random numbers generator seed equal to "123456".

Matrices

Given A, B and c: \begin{matrix} &&1 & 1 & 1 \ A&=&0 & 1 & 2, \ &&1 & -1 & 1 \ \end{matrix}

\begin{matrix} &&1 & 4 & 7 \ B&=&2 & 5 & 8, \ &&3 & 6 & 9 \ \end{matrix}
c=3 ,
compute:

1. A·B
2. d e t ( A · B)
−1
3. A⋅ A
4. A ⋅ B⋅ c
5. A ⋅ B′
A−B
6.
c
7. d i a g ( B)⋅ c
8. A [ 2 ,3 ) ⋅ B [ 3 , 3 )
3.2 The Pandas Library
Pandas is a powerful Python library designed for data manipulation and analysis. It provides data
structures like DataFrame and Series that allow you to easily work with structured and tabular
data.

With its comprehensive set of functions, Pandas enables data cleaning, transformation,
merging, and exploration, making it an essential tool for data professionals and analysts.

You can find any further information about the library here:

https://pandas.pydata.org/docs/user_guide/index.html

We import pandas as we did before with NumPy:

import pandas as pd

Let's define a dictionary; we will use it to construct our first DataFrame.

Dict = {
'Car': ["BMW", "Volvo", "Ford"],
'Cost x1000': [70, 47, 33]
}
print(Dict)

{'Car': ['BMW', 'Volvo', 'Ford'], 'Cost x1000': [70, 47, 33]}

Now, we transform this Dictionary into a DataFrame as follows:

Data = pd.DataFrame(Dict)
print(Data)

Car Cost x1000


0 BMW 70
1 Volvo 47
2 Ford 33

In Pandas, vectors (or columns) are called "Series".

# define a list containing numbers


a = [1, 7, 2]
# update the class of a, transforming it into a Pandas Series
myvar = pd.Series(a)
print(myvar)

0 1
1 7
2 2
dtype: int64
3.2.1 Working with Pandas objects
Deleting Rows and Columns from the Dataset
# Remove the first row and modify the original DataFrame
Data.drop([0], inplace=True)
print(Data)

Car Cost x1000


1 Volvo 47
2 Ford 33

# Remove the column named "Car"


Data.drop("Car",axis=1)

Cost x1000
1 47
2 33

# define a new dataset starting from a dictionary


data = {
"Age": [42,56,40,33,67],
"Masters": [True, False, False, True, False],
"City": ["New York", "Los Angeles", "Chicago", "Miami", "Houston"]
}

# transform data into a Pandas DataFrame


df = pd.DataFrame(data)

Sort a DataFrame according to a variable


sorted_df = df.sort_values(by="Age")
print(sorted_df)

Age Masters City


3 33 True Miami
2 40 False Chicago
0 42 True New York
1 56 False Los Angeles
4 67 False Houston

Learning by doing
We can create a dataframe starting from a list of vectors.

We select a class of 20 people coming from different countries of the world. For each one of
them we collect names, ages, heights, nationalities, gender, and the final grades in the math
exam.

import pandas as pd
# Define vectors
names = ["Andrew", "Anna", "Alice", "Antony",
"Barbara", "Brian", "Boris", "Barney",
"Claudia", "Cliff", "Cecilia", "Clara",
"David", "Dora", "Denise", "Donatello",
"Emma", "Elise", "Esteban", "Elon"]

ages = [20, 22, 27, 25, 18, 22, 26, 21, 19, 24,
27, 23, 22, 19, 23, 28, 22, 24, 25, 19]

heights = [180, 170, 155, 175, 150, 197, 178, 182, 183, 170,
175, 178, 170, 160, 175, 194, 180, 165, 172, 183]

nationalities = ["France", "Scotland", "Italy", "Poland",


"France", "India", "UK", "Poland",
"Italy", "Scotland", "UK", "France",
"Mexico", "USA", "France", "Germany",
"USA", "France", "Spain", "Poland"]

gender = [0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0]

grades = [16, 18, 19, 18, 15, 14, 15, 18, 17, 20,
20, 19, 15, 16, 18, 14, 20, 15, 19, 17]

# Create a dictionary
class_dic = {
"names": names,
"ages": ages,
"heights": heights,
"nationalities": nationalities,
"gender": gender,
"grades": grades
}

class_df = pd.DataFrame(class_dic)

# Display the DataFrame


class_df

names ages heights nationalities gender grades


0 Andrew 20 180 France 0 16
1 Anna 22 170 Scotland 1 18
2 Alice 27 155 Italy 1 19
3 Antony 25 175 Poland 0 18
4 Barbara 18 150 France 1 15
5 Brian 22 197 India 0 14
6 Boris 26 178 UK 0 15
7 Barney 21 182 Poland 0 18
8 Claudia 19 183 Italy 1 17
9 Cliff 24 170 Scotland 0 20
10 Cecilia 27 175 UK 1 20
11 Clara 23 178 France 1 19
12 David 22 170 Mexico 0 15
13 Dora 19 160 USA 1 16
14 Denise 23 175 France 1 18
15 Donatello 28 194 Germany 0 14
16 Emma 22 180 USA 1 20
17 Elise 24 165 France 1 15
18 Esteban 25 172 Spain 0 19
19 Elon 19 183 Poland 0 17

import pandas as pd

# Print the names of the units


names_column = class_df['names']
print(names_column)

# Return the type of a variable (e.g., nationalities)


variable_type = class_df['nationalities'].dtype
print("Variable Type:", variable_type)

# Print the first five rows of a DataFrame


print("First Five Rows:\n", class_df.head())

# Print the number of rows and columns


num_rows, num_columns = class_df.shape
print("Number of Rows:", num_rows)
print("Number of Columns:", num_columns)

# Print the names of the DataFrame columns


column_names = class_df.columns.tolist()
print("Column Names:", column_names)

# Returns the number of rows in the DataFrame


num_rows = len(class_df)
print("Number of Rows:", num_rows)

# Returns the number of columns in the DataFrame


num_columns = len(class_df.columns)
print("Number of Columns:", num_columns)

# Adding a new variable 'HStudy' to the DataFrame


HStudy = [34, 36, 39, 37, 31, 30, 32, 37, 35, 37, 39, 38, 31, 34, 37,
31, 40, 32, 39, 35]
class_df['HStudy'] = HStudy

0 Andrew
1 Anna
2 Alice
3 Antony
4 Barbara
5 Brian
6 Boris
7 Barney
8 Claudia
9 Cliff
10 Cecilia
11 Clara
12 David
13 Dora
14 Denise
15 Donatello
16 Emma
17 Elise
18 Esteban
19 Elon
Name: names, dtype: object
Variable Type: object
First Five Rows:
names ages heights nationalities gender grades
0 Andrew 20 180 France 0 16
1 Anna 22 170 Scotland 1 18
2 Alice 27 155 Italy 1 19
3 Antony 25 175 Poland 0 18
4 Barbara 18 150 France 1 15
Number of Rows: 20
Number of Columns: 6
Column Names: ['names', 'ages', 'heights', 'nationalities', 'gender',
'grades']
Number of Rows: 20
Number of Columns: 6

Note: the new variable must have the same number of elements of the others comprised in the
dataset.

We can also add new variables as transformation of pre-existing variables in the dataset.

For example, we can add a new variable called “MinStudy” which express the variable HStudy
(currently measured in hours) in minutes. In formula:
M i n S t u d y =H S t u d y ⋅60

# Adding a new variable 'MinStudy' to the DataFrame by multiplying


'HStudy' by 60
class_df['MinStudy'] = class_df['HStudy'] * 60
class_df

names ages heights nationalities gender grades HStudy


MinStudy
0 Andrew 20 180 France 0 16 34
2040
1 Anna 22 170 Scotland 1 18 36
2160
2 Alice 27 155 Italy 1 19 39
2340
3 Antony 25 175 Poland 0 18 37
2220
4 Barbara 18 150 France 1 15 31
1860
5 Brian 22 197 India 0 14 30
1800
6 Boris 26 178 UK 0 15 32
1920
7 Barney 21 182 Poland 0 18 37
2220
8 Claudia 19 183 Italy 1 17 35
2100
9 Cliff 24 170 Scotland 0 20 37
2220
10 Cecilia 27 175 UK 1 20 39
2340
11 Clara 23 178 France 1 19 38
2280
12 David 22 170 Mexico 0 15 31
1860
13 Dora 19 160 USA 1 16 34
2040
14 Denise 23 175 France 1 18 37
2220
15 Donatello 28 194 Germany 0 14 31
1860
16 Emma 22 180 USA 1 20 40
2400
17 Elise 24 165 France 1 15 32
1920
18 Esteban 25 172 Spain 0 19 39
2340
19 Elon 19 183 Poland 0 17 35
2100

# 1. How many students are there in the class?


num_students = len(class_df)
print("Number of students:", num_students)

# 2. The arithmetic mean of the variable "ages".


mean_age = class_df['ages'].mean()
print("Mean age:", mean_age)

# 3. The median grades.


median_grade = class_df['grades'].median()
print("Median grade:", median_grade)

# 4. The highest and the lowest grades.


highest_grade = class_df['grades'].max()
lowest_grade = class_df['grades'].min()
print("Highest grade:", highest_grade)
print("Lowest grade:", lowest_grade)

# 5. The absolute frequencies for the variable "nationalities".


nationality_frequencies = class_df['nationalities'].value_counts()
print("Nationality frequencies:")
print(nationality_frequencies)

# 6. Define two subsamples separating females and males.


female_subsample = class_df[class_df['gender'] == 1]
male_subsample = class_df[class_df['gender'] == 0]

# Calculate the mean grades for each group


mean_grade_female = female_subsample['grades'].mean()
mean_grade_male = male_subsample['grades'].mean()

print("The average grade in the Female subsample


is:",mean_grade_female)
print("The average grade in the Male subsample is:",mean_grade_male)

Number of students: 20
Mean age: 22.8
Median grade: 17.5
Highest grade: 20
Lowest grade: 14
Nationality frequencies:
France 5
Poland 3
Scotland 2
Italy 2
UK 2
USA 2
India 1
Mexico 1
Germany 1
Spain 1
Name: nationalities, dtype: int64
The average grade in the Female subsample is: 17.7
The average grade in the Male subsample is: 16.6

Note: Sometimes we need matrices and not dataframes (e.g., matrix calculus).

To transform a dataframe (or a subset of it) into a matrix:


import numpy as np

# 7. Extract the last three columns and convert to a NumPy array


grades_matrix = class_df.iloc[:, -3:].to_numpy()

# Convert the NumPy array to a matrix


Grades = np.matrix(grades_matrix)

# Display the matrix


print("Grades matrix:", Grades)

Grades matrix: [[ 16 34 2040]


[ 18 36 2160]
[ 19 39 2340]
[ 18 37 2220]
[ 15 31 1860]
[ 14 30 1800]
[ 15 32 1920]
[ 18 37 2220]
[ 17 35 2100]
[ 20 37 2220]
[ 20 39 2340]
[ 19 38 2280]
[ 15 31 1860]
[ 16 34 2040]
[ 18 37 2220]
[ 14 31 1860]
[ 20 40 2400]
[ 15 32 1920]
[ 19 39 2340]
[ 17 35 2100]]

3.2.2 Import/Export DataFrames


Pandas provides methods to import and export data from various file formats, including CSV and
Excel.

Import a .csv/.xlsx file


# .csv file
df = pd.read_csv('data.csv') # where 'data.csv' is the file name
# .xlsx (Excel) file
df = pd.read_excel('data.xlsx') # where 'data.xlsx' is the file name

Export to a .csv/.xlsx file


# to .csv file
df.to_csv("filename.csv") # where "filename" is the output file name

# to .xlsx file
df.to_excel("filename.xlsx") # where "filename" is the output file
name

3.2.3 Other Pandas useful functions:


# get info on the DataFrame Series (variables)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Age 5 non-null int64
1 Masters 5 non-null bool
2 City 5 non-null object
dtypes: bool(1), int64(1), object(1)
memory usage: 213.0+ bytes

# Count the DataFrame non missing values (NaN)


df.count()

Age 5
Masters 5
City 5
dtype: int64

## Descriptive Statistics
# define a new dictionary
data = {"Age" : [10, 14, 11,9],
"Height" : [100, 140, 120, 80],
"Weight" : [30, 42, 32, 28]
}
# transform it into a Pandas DataFrame
df = pd.DataFrame(data, columns = ["Age","Height","Weight"])

# sum of the vars


df.sum()
# cumulative sum of the vars
df.cumsum()
# min and max values
df.min()
df.max()

Age 14
Height 140
Weight 42
dtype: int64

# a function to have a summary of the dataframe statistics:


df.describe()
Age Height Weight
count 4.000000 4.000000 4.000000
mean 11.000000 110.000000 33.000000
std 2.160247 25.819889 6.218253
min 9.000000 80.000000 28.000000
25% 9.750000 95.000000 29.500000
50% 10.500000 110.000000 31.000000
75% 11.750000 125.000000 34.500000
max 14.000000 140.000000 42.000000

# average value
df.mean()
# median
df.median()
# correlation among variables
df.corr()

Age Height Weight


Age 1.000000 0.956183 0.992583
Height 0.956183 1.000000 0.913500
Weight 0.992583 0.913500 1.000000

3.2.4 Apply a function to a dataset using the function apply


We can apply a lambda function to all columns of the dataset.

For example, we can divide every column by a factor of 1.2.

# define a lambda
f = lambda x: x / 1.2

# Apply the lambda function element-wise to each column


df1 = df.apply(f)
df1

Age Height Weight


0 8.333333 83.333333 25.000000
1 11.666667 116.666667 35.000000
2 9.166667 100.000000 26.666667
3 7.500000 66.666667 23.333333

3.2.5 Take Home Exercises


Write the code to answer the following questions.

1. Which is the most represented country in the class?


2. Do the mean and the median of the variable "grades" coincide?
3. Which is the proportion of females in the class?
4. Our class has a median age over 24, isn’t it?
5. Where does Emma come frome?
6. Is David older than Brian?
7. Are the polish taller than the scottish, in mean?
8. Which is the name of the tallest person in the class?

3.3 The Matplotlib library


Matplotlib is a Python library for data visualization.

Data visualizations are a powerful means to communicate complex information in a clear and
intuitive manner, aiding in the exploration and presentation of data-driven findings.

With Matplotlib, you can produce engaging visual representations that enhance understanding
and facilitate decision-making across various fields, from scientific research to business
analytics.

Tip: never underestimate the power of a graph. Utilize graphs wisely to unlock the potential of
your data and convey its significance to others.

We start by importing the library matplotlib as follows:

import matplotlib

Most of the plots are produced using the pyplot module of Matplotlib. So we import also pyplot:

import matplotlib.pyplot as plt

3.3.1 Scatterplot
A scatter plot is a simple yet effective visualization tool used to display the relationship between
two variables.

Let's see an example

# import numpy
import numpy as np

# Fix the random generator seed '123'


np.random.seed(123)

# Generate random data


x = np.random.rand(10)
y = np.random.rand(10)

# Create a scatter plot


plt.scatter(x, y)

# Add labels and title


plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.title("Scatter Plot Example")

# Show the plot


plt.show()

You can specify the type of marker (points/lines) to use in the plot by using the marker
parameter in the scatter() function.

Here's how you can do it:

# Create a scatter plot with markers


plt.scatter(x, y, marker='o') # 'o' represents the circle marker

# Add labels and title


plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.title("Scatter Plot with Markers")

# Show the plot


plt.show()
Or, you can choose both line and points as follows:

# define two new vectors


ax = np.array([10, 14, 17, 20, 22])
ay = np.array([40, 16, 22, 37, 11])
# and change the marker
plt.plot(ax,ay,marker="*")

[<matplotlib.lines.Line2D at 0x12751d6a0>]
In double quotation marks (" "), we can add additional parameters such as color and line style.

Here's a list of parameters you can specify, such as color, linestyle, and marker:

• Color: You can specify colors using strings like 'red' (r), 'blue' (b), 'green' (g), or in
HTML/CSS color codes like '#FF5733'.
• Linestyle: You can choose the linestyle of the plot, such as solid ('-'), dashed ('--'), dotted
('.'), and more.
• Marker: You can select a marker for each data point, like 'o' for circle, 's' for square, '^'
for triangle, etc.
• Marker Size: Adjust the size of markers with the markersize parameter.

Feel free to experiment with these parameters to create visually appealing and informative
plots.

Let's see an example:

# Generate data
x = np.linspace(0, 10, 20)
y = np.sin(x)

# Create a plot with customized appearance


plt.plot(x, y, color='green', linestyle='--', marker='^',
markersize=8)

# Add labels and title


plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.title("Customized Plot Example")
# Show the plot
plt.show()

There is also the function scatter which works similarly:

# scatter plot: alternative command

x = np.array([5,7,8,7,2,17,2,9,4,11,12,9,6])
y = np.array([99,86,87,88,111,86,103,87,94,78,77,85,86])

plt.scatter(x, y, color="hotpink")
plt.show()
To compare two different data sets, you can simply overlay two plots:

# Plot 1
x1 = np.array([5,7,8,7,2,17,2,9,4,11,12,9,6])
y1 = np.array([99,86,87,88,111,86,103,87,94,78,77,85,86])
plot1 = plt.scatter(x1, y1, label='Dataset 1') # Add label for legend

# Plot 2
x2 = np.array([2,2,8,1,15,8,12,9,7,3,11,4,7,14,12])
y2 = np.array([100,105,84,105,90,99,90,95,94,100,79,112,91,80,85])
plot2 = plt.scatter(x2, y2, label='Dataset 2') # Add label for legend

# Add legend using the scatter plot objects


plt.legend(handles=[plot1, plot2])

# Show the plot


plt.show()
3.3.2 Bar plot
A bar plot, also known as a bar chart, is a widely used visualization tool for displaying categorical
data.

It represents data using rectangular bars, where the length or height of each bar corresponds to
the value (frequency) of a particular category or group.

Bar plots provide a clear visual representation of how categories differ from one another,
making them effective for conveying comparisons, trends, and distributions.

When to Use Bar Plots:

• Comparison between Categories: Use bar plots to compare the values of different
categories, such as comparing sales across different products or customer ratings
across different services.

• Frequency Distribution: Bar plots are effective for showing the frequency
distribution of categorical data, such as the distribution of student grades or the
distribution of responses to survey questions.

• Group Comparisons: When you want to compare values within subgroups of a


larger category, a grouped bar plot can visually represent the variations.

• Trends over Time: Stacked bar plots can be used to show how the composition of
categories changes over time, providing insights into evolving trends.

• Nominal or Ordinal Data: Bar plots are suitable for both nominal data (categories
with no inherent order) and ordinal data (categories with a defined order).
In Python, we will use the function 'plt.bar'.
# define two arrays
x = np.array(["Ita", "Fra", "Ger", "UK"])
y = np.array([3, 8, 1, 10])

# represent the barplot specifying colors


plt.bar(x,y, color =["green","red","pink","blue"])
plt.show()

Similarly to the scatter plot, there are various ways to customize the bar plot. Refer to the
available guides at:

https://matplotlib.org/stable/tutorials/index.html

3.3.3 Histogram
An histogram is a graphical representation that provides insights into the distribution and
frequency of continuous data. Unlike bar plots, which are suitable for categorical data,
histograms are used to display the distribution of numeric data over continuous intervals or bins.
Each bin represents a range of values, and the height of the bar over a bin corresponds to the
frequency or count of data points falling within that range. Histograms are particularly useful
for identifying patterns, central tendencies, and outliers in your data.

In Python, we will use the function "plt.hist()"

# set the random generation seed


np.random.seed(123)

# Define a vector of 100 normally distributed elements


# with a mean of 10 and a variance of 2
x = np.random.normal(10, 2, 100)

# Draw a histogram with the specified data and color


plt.hist(x, color="red")

# Display the histogram


plt.show()

Let me improve it a bit:

# set the random generation seed


np.random.seed(123)

# Generate data
x = np.random.normal(10, 2, 100)

# Create a histogram with customization


plt.hist(x, bins=15, color="skyblue", edgecolor="black", alpha=0.7)

# Add labels and title


plt.xlabel("Value")
plt.ylabel("Frequency")
plt.title("Histogram of Normally Distributed Data")

# Add grid lines


plt.grid(axis="y", alpha=0.5)

# Display the histogram


plt.show()
See? The power of the plot customization.

However, it does not look like a Normal distribution. Isnt'it?

Try to enlarge the data size:

# set the random generation seed


np.random.seed(123)

# Generate data (1000 obs)


x = np.random.normal(10, 2, 1000)

# Create a histogram with customization


plt.hist(x, bins=15, color="skyblue", edgecolor="black", alpha=0.7)

# Add labels and title


plt.xlabel("Value")
plt.ylabel("Frequency")
plt.title("Histogram of Normally Distributed Data (1000 data)")

# Add grid lines


plt.grid(axis="y", alpha=0.5)

# Display the histogram


plt.show()
and also...

# set the random generation seed


np.random.seed(123)

# Generate data (1'000'000 obs)


x = np.random.normal(10, 2, 1000000)

# Create a histogram with customization


plt.hist(x, bins=15, color="darkred", edgecolor="black", alpha=0.7)

# Add labels and title


plt.xlabel("Value")
plt.ylabel("Frequency")
plt.title("Histogram of Normally Distributed Data (1Mln data)")

# Add grid lines


plt.grid(axis="y", alpha=0.5)

# Display the histogram


plt.show()
3.3.4 Pie Plot
A pie plot, also known as a pie chart, is a circular chart used to display the distribution of a
categorical data set.

The chart is divided into sectors, where each sector represents a different category or group. The
size of each sector is proportional to the relative frequency or proportion of the corresponding
category within the data. Pie plots are particularly useful for visualizing parts of a whole and
comparing the contributions of different categories to the total.

Key Points about Pie Plots:

• Proportional Representation: The size of each sector corresponds to the proportion or


percentage of the category it represents. The entire pie represents 100% of the data.

Limited Number of Categories: Pie plots are best suited for representing a small number of
categories. Too many categories can make the chart difficult to interpret.

Visualizing Composition: Pie plots are effective for showing how a whole can be divided into
different parts, such as budget allocation, market share, or distribution of grades.

Labels and Legends: Labels are often added to each sector to indicate the category it represents
and the corresponding percentage or value. A legend is used to provide more detailed
information about each category.

Limitations: Pie plots can sometimes be misleading when it comes to comparing angles and
accurately interpreting small differences between categories.

When to Use Pie Plots:


Use pie plots when you want to:

• Display the distribution of categorical data in a visually appealing way.


• Highlight the composition of a whole and the relative sizes of its parts.
• Easily communicate the proportions or percentages of different categories.

In Python, we will use the command 'plt.pie()'

# define an array containing proportions


prop = np.array([35, 25, 25, 15])

# plot the pie


plt.pie(prop)
plt.show()

To customize the pie plot we can add:

• labels: Specifies the labels for each category.


• colors: Defines custom colors for each sector.
• explode: Determines how much to separate a specific sector from the rest.
• autopct = '%1.1f%%': Adds percentage labels to each sector.
• shadow = True: Adds shadow to the plot for depth.
• startangle = 0: Sets the starting angle for the first sector.
# Define an array containing proportions
prop = np.array([35, 25, 25, 15])

# Define labels for each category


labels = ['Alba', 'Bruno', 'Carlos', 'Dawson']

# Define custom colors for each sector


colors = ['skyblue', 'lightgreen', 'lightcoral', 'lightsalmon']

# Define the explode parameter to emphasize a specific sector (e.g.,


'Category C')
explode = (0.1, 0, 0, 0)

# Create the pie plot with customizations


plt.pie(prop, labels=labels, colors=colors, explode=explode,
autopct='%1.1f%%', shadow=True, startangle=0)

# Add a title
plt.title("Distribution of Categories")

# Display the pie plot


plt.show()

3.3.5 Take Home Exercises


Using the dataset class_df, answer the following questions:

1. Represent the distribution on the variable heights through a histogram.


2. Draw a pie plot of the variable "nationalities".
3. Make a scatter plot which shows the relation between 'heights' and 'ages'.

Feel free to provide basic plots or to customize them.

You might also like