You are on page 1of 48

ZAALI Mohamed

INDIA-1
ENSAM RABAT

NumPy, Pandas, Matplotlib, and Seaborn are essen al tools


in the data science toolkit. They provide the necessary
func onality for data manipula on, analysis, and
visualiza on, enabling data scien sts to extract insights, make
informed decisions, and communicate results effec vely.
Mastering these libraries is crucial for success in data science
projects.
This document represents :
 NumPy : Basic opera ons with NumPy
 Pandas : Most frequently used methods
 Matplotlib : Most frequently used methods
 Seaborn : Most frequently used methods
numpy-tutorial

March 28, 2024

NumPy - Numerical Python


Advantages of Numpy Arrays:
1. Allows several Mathematical Operations
2. Faster operations
[ ]: import numpy as np

List vs Numpy - Time Taken


[ ]: from time import process_time

Time taken by a list


[ ]: python_list = [i for i in range(10000)]

start_time = process_time()

python_list = [i+5 for i in python_list]

end_time = process_time()

print(end_time - start_time)

0.002204186999999802

[ ]: np_array = np.array([i for i in range(10000)])

start_time = process_time()

np_array += 5

end_time = process_time()

print(end_time - start_time)

0.00042786099999991833
Numpy Arrays

1
[ ]: # list
list1 = [1,2,3,4,5]
print(list1)
type(list1)

[1, 2, 3, 4, 5]

[ ]: list

[ ]: np_array = np.array([1,2,3,4,5])
print(np_array)
type(np_array)

[1 2 3 4 5]

[ ]: numpy.ndarray

[ ]: # creating a 1 dim array


a = np.array([1,2,3,4])
print(a)

[1 2 3 4]

[ ]: # shape : will gives you the number of rows and columns


a.shape

[ ]: (4,)

[ ]: # creating a 2 dimension array


b = np.array([(1,2,3,4),(5,6,7,8)])
print(b)

[[1 2 3 4]
[5 6 7 8]]

[ ]: b.shape

[ ]: (2, 4)

[ ]: # dtype parameter uses to specify the data type of that array


c = np.array([(1,2,3,4),(5,6,7,8)],dtype=float)
print(c)

[[1. 2. 3. 4.]
[5. 6. 7. 8.]]
Initial Placeholders in numpy arrays
initial placeholder means the initial values present in that particular numpy array

2
[ ]: # create a numpy array of Zeros
x = np.zeros((4,5))
print(x)

[[0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0.]]

[ ]: # create a numpy array of ones


y = np.ones((3,3))
print(y)

[[1. 1. 1.]
[1. 1. 1.]
[1. 1. 1.]]

[ ]: # array of a particular value


z = np.full((5,4),5)
print(z)

[[5 5 5 5]
[5 5 5 5]
[5 5 5 5]
[5 5 5 5]
[5 5 5 5]]

[ ]: # create an identity matrix


a = np.eye(5)
print(a)

[[1. 0. 0. 0. 0.]
[0. 1. 0. 0. 0.]
[0. 0. 1. 0. 0.]
[0. 0. 0. 1. 0.]
[0. 0. 0. 0. 1.]]

[ ]: # create a numpy array with random values


b = np.random.random((3,4))
print(b)

[[0.68489392 0.25352763 0.5164079 0.5846446 ]


[0.19119214 0.70512175 0.70097858 0.8374312 ]
[0.23455987 0.2631089 0.25602684 0.00590504]]

[ ]: # random integer values array within a specific range


c = np.random.randint(10,100,(3,5))
print(c)

3
[[94 97 81 94 41]
[86 14 96 50 77]
[61 87 91 55 93]]

[ ]: # array of evenly spaced values --> specifying the number of values required
d = np.linspace(10,30,5)
print(d)

[10. 15. 20. 25. 30.]

[ ]: # array of evenly spaced values --> specifying the step


e = np.arange(10,30,5)
print(e)

[10 15 20 25]

[ ]: # convert a list to a numpy array


list2 = [10,20,20,20,50]

np_array = np.asarray(list2)
print(np_array)
type(np_array)

[10 20 20 20 50]

[ ]: numpy.ndarray

Analysing a numpy array


[ ]: c = np.random.randint(10,90,(5,5))
print(c)

[[23 10 82 52 67]
[28 24 63 47 58]
[85 12 25 33 52]
[57 86 84 71 16]
[34 41 14 78 66]]

[ ]: # array dimension
print(c.shape)

(5, 5)

[ ]: # number of dimensions
print(c.ndim)

4
[ ]: # number of elements in an array
print(c.size)

25

[ ]: # checking the data type of the values in the array


print(c.dtype)

int64
Mathematical operations on a np array
[ ]: list1 = [1,2,3,4,5]
list2 = [6,7,8,9,10]

print(list1 + list2) # concatenate or joins two list

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

[ ]: a = np.random.randint(0,10,(3,3))
b = np.random.randint(10,20,(3,3))

[ ]: print(a)
print(b)

[[2 3 4]
[1 0 1]
[9 7 3]]
[[10 13 10]
[16 19 10]
[10 19 17]]

[ ]: print(a+b)
print(a-b)
print(a*b)
print(a/b)

[[12 16 14]
[17 19 11]
[19 26 20]]
[[ -8 -10 -6]
[-15 -19 -9]
[ -1 -12 -14]]
[[ 20 39 40]
[ 16 0 10]
[ 90 133 51]]
[[0.2 0.23076923 0.4 ]
[0.0625 0. 0.1 ]
[0.9 0.36842105 0.17647059]]

5
[ ]: a = np.random.randint(0,10,(3,3))
b = np.random.randint(10,20,(3,3))

[ ]: print(a)
print(b)

[[1 1 3]
[6 4 3]
[1 5 7]]
[[13 10 18]
[14 11 13]
[11 16 16]]

[ ]: print(np.add(a,b))
print(np.subtract(a,b))
print(np.multiply(a,b))
print(np.divide(a,b))

[[14 11 21]
[20 15 16]
[12 21 23]]
[[-12 -9 -15]
[ -8 -7 -10]
[-10 -11 -9]]
[[ 13 10 54]
[ 84 44 39]
[ 11 80 112]]
[[0.07692308 0.1 0.16666667]
[0.42857143 0.36363636 0.23076923]
[0.09090909 0.3125 0.4375 ]]
Array Manipulation
[ ]: array = np.random.randint(0,10,(2,3))
print(array)
print(array.shape)

[[9 2 4]
[0 2 0]]
(2, 3)

[ ]: # transpose
trans = np.transpose(array)
print(trans)
print(trans.shape)

[[9 0]
[2 2]

6
[4 0]]
(3, 2)

[ ]: array = np.random.randint(0,10,(2,3))
print(array)
print(array.shape)

[[6 5 9]
[8 2 3]]
(2, 3)

[ ]: # transpose other method


trans2 = array.T
print(trans2)
print(trans2.shape)

[[6 8]
[5 2]
[9 3]]
(3, 2)

[ ]: # reshaping a array
a = np.random.randint(0,10,(2,3))
print(a)
print(a.shape)

[[8 2 2]
[4 6 4]]
(2, 3)

[ ]: b = a.reshape(3,2)
print(b)
print(b.shape)

[[8 2]
[2 4]
[6 4]]
(3, 2)

[ ]: c = a.reshape(6)
print(c)
print(c.shape)

[8 2 2 4 6 4]
(6,)

[ ]: d = a.reshape((1,2,3))
print(d)

7
print(d.shape)

[[[8 2 2]
[4 6 4]]]
(1, 2, 3)

8
pandas-tutorial

March 28, 2024

Pandas Library:
Useful for Data Processing & Analysis
Pandas Data Frame:
Pandas DataFrame is two-dimensional tabular data structure with labeled axes (rows and columns).

[ ]: # importing the pandas library


import pandas as pd
import numpy as np

Creaating a Pandas DataFrame


[ ]: # importing the fetch california housing data
from sklearn.datasets import fetch_california_housing

[ ]: california_dataset = fetch_california_housing() #this will load␣


↪fetch_california_housing dataset to california_dataset variable

[ ]: type(california_dataset)
#Bunch is like a dictionary object , it contains a lot of data

[ ]: sklearn.utils._bunch.Bunch

[ ]: print(california_dataset)

{'data': array([[ 8.3252 , 41. , 6.98412698, …,


2.55555556,
37.88 , -122.23 ],
[ 8.3014 , 21. , 6.23813708, …, 2.10984183,
37.86 , -122.22 ],
[ 7.2574 , 52. , 8.28813559, …, 2.80225989,
37.85 , -122.24 ],
…,
[ 1.7 , 17. , 5.20554273, …, 2.3256351 ,
39.43 , -121.22 ],
[ 1.8672 , 18. , 5.32951289, …, 2.12320917,
39.43 , -121.32 ],
[ 2.3886 , 16. , 5.25471698, …, 2.61698113,

1
39.37 , -121.24 ]]), 'target': array([4.526, 3.585, 3.521,
…, 0.923, 0.847, 0.894]), 'frame': None, 'target_names': ['MedHouseVal'],
'feature_names': ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population',
'AveOccup', 'Latitude', 'Longitude'], 'DESCR': '..
_california_housing_dataset:\n\nCalifornia Housing
dataset\n--------------------------\n\n**Data Set Characteristics:**\n\n
:Number of Instances: 20640\n\n :Number of Attributes: 8 numeric, predictive
attributes and the target\n\n :Attribute Information:\n - MedInc
median income in block group\n - HouseAge median house age in block
group\n - AveRooms average number of rooms per household\n -
AveBedrms average number of bedrooms per household\n - Population
block group population\n - AveOccup average number of household
members\n - Latitude block group latitude\n - Longitude
block group longitude\n\n :Missing Attribute Values: None\n\nThis dataset was
obtained from the StatLib
repository.\nhttps://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html\n\nThe
target variable is the median house value for California districts,\nexpressed
in hundreds of thousands of dollars ($100,000).\n\nThis dataset was derived from
the 1990 U.S. census, using one row per census\nblock group. A block group is
the smallest geographical unit for which the U.S.\nCensus Bureau publishes
sample data (a block group typically has a population\nof 600 to 3,000
people).\n\nA household is a group of people residing within a home. Since the
average\nnumber of rooms and bedrooms in this dataset are provided per
household, these\ncolumns may take surprisingly large values for block groups
with few households\nand many empty houses, such as vacation resorts.\n\nIt can
be downloaded/loaded using
the\n:func:`sklearn.datasets.fetch_california_housing` function.\n\n.. topic::
References\n\n - Pace, R. Kelley and Ronald Barry, Sparse Spatial
Autoregressions,\n Statistics and Probability Letters, 33 (1997)
291-297\n'}
this type of data is not suitable for analysis so this is where pandas comes to play, it will help us
to import this data to more structured table
[ ]: # pandas DataFrame
california_df = pd.DataFrame(california_dataset.data, columns =␣
↪california_dataset.feature_names)

#we are creating a pandas data frame and inside this we need to give the data␣
↪we want

#and the name of each of the column we want

[ ]: california_df.head() #this function in will print you the first five rows of␣
↪that data frame

[ ]: MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude \


0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88
1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86
2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85

2
3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85
4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85

Longitude
0 -122.23
1 -122.22
2 -122.24
3 -122.25
4 -122.25

[ ]: california_df.shape #to check the shape of a data frame

[ ]: (20640, 8)

[ ]: type(california_df)

[ ]: pandas.core.frame.DataFrame

Importing the data from a CSV file to a pandas DataFrame


[ ]: # csv file to pandas df
diabetes_df = pd.read_csv('/content/diabetes.csv')
#read_csv funtion will read the csv file and store all the values in the csv␣
↪file to a data frame

[ ]: type(diabetes_df)

[ ]: pandas.core.frame.DataFrame

[ ]: diabetes_df.head()

[ ]: Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \


0 6 148 72 35 0 33.6
1 1 85 66 29 0 26.6
2 8 183 64 0 0 23.3
3 1 89 66 23 94 28.1
4 0 137 40 35 168 43.1

DiabetesPedigreeFunction Age Outcome


0 0.627 50 1
1 0.351 31 0
2 0.672 32 1
3 0.167 21 0
4 2.288 33 1

[ ]: diabetes_df.shape

3
[ ]: (768, 9)

Loading the data from a excel file to a Pandas DataFrame:


pd.read_excel(‘file path’)

[ ]: Financial_sample_df = pd.read_excel('/content/Financial Sample.xlsx')

[ ]: Financial_sample_df.head()

[ ]: Segment Country Product Discount Band Units Sold \


0 Government Canada Carretera None 1618.5
1 Government Germany Carretera None 1321.0
2 Midmarket France Carretera None 2178.0
3 Midmarket Germany Carretera None 888.0
4 Midmarket Mexico Carretera None 2470.0

Manufacturing Price Sale Price Gross Sales Discounts Sales COGS \


0 3 20 32370.0 0.0 32370.0 16185.0
1 3 20 26420.0 0.0 26420.0 13210.0
2 3 15 32670.0 0.0 32670.0 21780.0
3 3 15 13320.0 0.0 13320.0 8880.0
4 3 15 37050.0 0.0 37050.0 24700.0

Profit Date Month Number Month Name Year


0 16185.0 2014-01-01 1 January 2014
1 13210.0 2014-01-01 1 January 2014
2 10890.0 2014-06-01 6 June 2014
3 4440.0 2014-06-01 6 June 2014
4 12350.0 2014-06-01 6 June 2014

[ ]: Financial_sample_df.shape

[ ]: (700, 16)

Exporting a DataFrame to a csv file


[ ]: california_df.to_csv('california.csv')

Exporting the Pandas DataFrame to an excel File:


df.to_excel(‘filename’)

[ ]: california_df.to_excel('california.xlsx')

[ ]: # creating a DatFrame with random values


random_df = pd.DataFrame(np.random.rand(20,10))

4
[ ]: random_df.head()

[ ]: 0 1 2 3 4 5 6 \
0 0.334997 0.461948 0.798143 0.160828 0.469857 0.132035 0.973342
1 0.817427 0.134303 0.191498 0.020126 0.157262 0.308749 0.746255
2 0.786123 0.290734 0.773516 0.260323 0.970542 0.940605 0.751676
3 0.801180 0.993138 0.562503 0.524121 0.192244 0.506380 0.472183
4 0.859077 0.762377 0.853730 0.414529 0.000119 0.329558 0.166290

7 8 9
0 0.219995 0.408478 0.070123
1 0.649148 0.900201 0.726858
2 0.981982 0.536330 0.388127
3 0.234543 0.348499 0.024407
4 0.397130 0.356937 0.405396

[ ]: random_df.shape

[ ]: (20, 10)

Inspecting a DataFrame
[ ]: #finding the number of rows & columns
california_df.shape

[ ]: (20640, 8)

[ ]: # first 5 rows in a DataFrame


california_df.head()

[ ]: MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude \


0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88
1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86
2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85
3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85
4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85

Longitude
0 -122.23
1 -122.22
2 -122.24
3 -122.25
4 -122.25

[ ]: # with head function by default is 5 but you can add the number of rows that␣
↪you want as parameter of the funtion

california_df.head(30)

5
[ ]: MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude \
0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88
1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86
2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85
3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85
4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85
5 4.0368 52.0 4.761658 1.103627 413.0 2.139896 37.85
6 3.6591 52.0 4.931907 0.951362 1094.0 2.128405 37.84
7 3.1200 52.0 4.797527 1.061824 1157.0 1.788253 37.84
8 2.0804 42.0 4.294118 1.117647 1206.0 2.026891 37.84
9 3.6912 52.0 4.970588 0.990196 1551.0 2.172269 37.84
10 3.2031 52.0 5.477612 1.079602 910.0 2.263682 37.85
11 3.2705 52.0 4.772480 1.024523 1504.0 2.049046 37.85
12 3.0750 52.0 5.322650 1.012821 1098.0 2.346154 37.85
13 2.6736 52.0 4.000000 1.097701 345.0 1.982759 37.84
14 1.9167 52.0 4.262903 1.009677 1212.0 1.954839 37.85
15 2.1250 50.0 4.242424 1.071970 697.0 2.640152 37.85
16 2.7750 52.0 5.939577 1.048338 793.0 2.395770 37.85
17 2.1202 52.0 4.052805 0.966997 648.0 2.138614 37.85
18 1.9911 50.0 5.343675 1.085919 990.0 2.362768 37.84
19 2.6033 52.0 5.465455 1.083636 690.0 2.509091 37.84
20 1.3578 40.0 4.524096 1.108434 409.0 2.463855 37.85
21 1.7135 42.0 4.478142 1.002732 929.0 2.538251 37.85
22 1.7250 52.0 5.096234 1.131799 1015.0 2.123431 37.84
23 2.1806 52.0 5.193846 1.036923 853.0 2.624615 37.84
24 2.6000 52.0 5.270142 1.035545 1006.0 2.383886 37.84
25 2.4038 41.0 4.495798 1.033613 317.0 2.663866 37.85
26 2.4597 49.0 4.728033 1.020921 607.0 2.539749 37.85
27 1.8080 52.0 4.780856 1.060453 1102.0 2.775819 37.85
28 1.6424 50.0 4.401691 1.040169 1131.0 2.391121 37.84
29 1.6875 52.0 4.703226 1.032258 395.0 2.548387 37.84

Longitude
0 -122.23
1 -122.22
2 -122.24
3 -122.25
4 -122.25
5 -122.25
6 -122.25
7 -122.25
8 -122.26
9 -122.25
10 -122.26
11 -122.26
12 -122.26
13 -122.26

6
14 -122.26
15 -122.26
16 -122.27
17 -122.27
18 -122.26
19 -122.27
20 -122.27
21 -122.27
22 -122.27
23 -122.27
24 -122.27
25 -122.28
26 -122.28
27 -122.28
28 -122.28
29 -122.28

[ ]: # last 5 rows of the DataFrame


california_df.tail()

[ ]: MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude \


20635 1.5603 25.0 5.045455 1.133333 845.0 2.560606 39.48
20636 2.5568 18.0 6.114035 1.315789 356.0 3.122807 39.49
20637 1.7000 17.0 5.205543 1.120092 1007.0 2.325635 39.43
20638 1.8672 18.0 5.329513 1.171920 741.0 2.123209 39.43
20639 2.3886 16.0 5.254717 1.162264 1387.0 2.616981 39.37

Longitude
20635 -121.09
20636 -121.21
20637 -121.22
20638 -121.32
20639 -121.24

[ ]: # with tail function by default is 5 but you can add the number of rows that␣
↪you want as parameter of the funtion

california_df.tail(10)

[ ]: MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude \


20630 3.5673 11.0 5.932584 1.134831 1257.0 2.824719 39.29
20631 3.5179 15.0 6.145833 1.141204 1200.0 2.777778 39.33
20632 3.1250 15.0 6.023377 1.080519 1047.0 2.719481 39.26
20633 2.5495 27.0 5.445026 1.078534 1082.0 2.832461 39.19
20634 3.7125 28.0 6.779070 1.148256 1041.0 3.026163 39.27
20635 1.5603 25.0 5.045455 1.133333 845.0 2.560606 39.48
20636 2.5568 18.0 6.114035 1.315789 356.0 3.122807 39.49
20637 1.7000 17.0 5.205543 1.120092 1007.0 2.325635 39.43

7
20638 1.8672 18.0 5.329513 1.171920 741.0 2.123209 39.43
20639 2.3886 16.0 5.254717 1.162264 1387.0 2.616981 39.37

Longitude
20630 -121.32
20631 -121.40
20632 -121.45
20633 -121.53
20634 -121.56
20635 -121.09
20636 -121.21
20637 -121.22
20638 -121.32
20639 -121.24

[ ]: # informations about the DataFrame


california_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 MedInc 20640 non-null float64
1 HouseAge 20640 non-null float64
2 AveRooms 20640 non-null float64
3 AveBedrms 20640 non-null float64
4 Population 20640 non-null float64
5 AveOccup 20640 non-null float64
6 Latitude 20640 non-null float64
7 Longitude 20640 non-null float64
dtypes: float64(8)
memory usage: 1.3 MB

[ ]: # finding the missing values


# In pandas, the isnull() function is used to detect missing or null values in␣
↪a DataFrame or Series.

# It returns a boolean mask where True indicates missing values and False␣
↪indicates non-missing values.

california_df.isnull()

[ ]: MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude \


0 False False False False False False False
1 False False False False False False False
2 False False False False False False False
3 False False False False False False False
4 False False False False False False False

8
… … … … … … … …
20635 False False False False False False False
20636 False False False False False False False
20637 False False False False False False False
20638 False False False False False False False
20639 False False False False False False False

Longitude
0 False
1 False
2 False
3 False
4 False
… …
20635 False
20636 False
20637 False
20638 False
20639 False

[20640 rows x 8 columns]

[ ]: # finding the number of missing values


# you can count the number of missing values by add sum funtion
california_df.isnull().sum()

[ ]: MedInc 0
HouseAge 0
AveRooms 0
AveBedrms 0
Population 0
AveOccup 0
Latitude 0
Longitude 0
dtype: int64

[ ]: # diabetes dataframe
diabetes_df.head()

[ ]: Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \


0 6 148 72 35 0 33.6
1 1 85 66 29 0 26.6
2 8 183 64 0 0 23.3
3 1 89 66 23 94 28.1
4 0 137 40 35 168 43.1

DiabetesPedigreeFunction Age Outcome

9
0 0.627 50 1
1 0.351 31 0
2 0.672 32 1
3 0.167 21 0
4 2.288 33 1

[ ]: #in the outcome column ,"1" represents diabetic person and "0" represents␣
↪non-diabetic person

# counting the values based on the labels


diabetes_df.value_counts('Outcome') #to count specific values in a column

[ ]: Outcome
0 500
1 268
dtype: int64

[ ]: # group the values based on the mean


diabetes_df.groupby('Outcome').mean()

[ ]: Pregnancies Glucose BloodPressure SkinThickness Insulin \


Outcome
0 3.298000 109.980000 68.184000 19.664000 68.792000
1 4.865672 141.257463 70.824627 22.164179 100.335821

BMI DiabetesPedigreeFunction Age


Outcome
0 30.304200 0.429734 31.190000
1 35.142537 0.550500 37.067164

Statistical Measures
[ ]: # count or number of values
california_df.count() #count function gaves us the number of values in each␣
↪column

[ ]: MedInc 20640
HouseAge 20640
AveRooms 20640
AveBedrms 20640
Population 20640
AveOccup 20640
Latitude 20640
Longitude 20640
dtype: int64

[ ]: # mean value - column wise


california_df.mean() #the mean value of each column

10
[ ]: MedInc 3.870671
HouseAge 28.639486
AveRooms 5.429000
AveBedrms 1.096675
Population 1425.476744
AveOccup 3.070655
Latitude 35.631861
Longitude -119.569704
dtype: float64

[ ]: # standard deviation - column wise


california_df.std()

[ ]: MedInc 1.899822
HouseAge 12.585558
AveRooms 2.474173
AveBedrms 0.473911
Population 1132.462122
AveOccup 10.386050
Latitude 2.135952
Longitude 2.003532
dtype: float64

[ ]: # minimum value
california_df.min()

[ ]: MedInc 0.499900
HouseAge 1.000000
AveRooms 0.846154
AveBedrms 0.333333
Population 3.000000
AveOccup 0.692308
Latitude 32.540000
Longitude -124.350000
dtype: float64

[ ]: # maximum value
california_df.max()

[ ]: MedInc 15.000100
HouseAge 52.000000
AveRooms 141.909091
AveBedrms 34.066667
Population 35682.000000
AveOccup 1243.333333
Latitude 41.950000
Longitude -114.310000

11
dtype: float64

[ ]: # all the statistical measures about the dataframe


california_df.describe()

[ ]: MedInc HouseAge AveRooms AveBedrms Population \


count 20640.000000 20640.000000 20640.000000 20640.000000 20640.000000
mean 3.870671 28.639486 5.429000 1.096675 1425.476744
std 1.899822 12.585558 2.474173 0.473911 1132.462122
min 0.499900 1.000000 0.846154 0.333333 3.000000
25% 2.563400 18.000000 4.440716 1.006079 787.000000
50% 3.534800 29.000000 5.229129 1.048780 1166.000000
75% 4.743250 37.000000 6.052381 1.099526 1725.000000
max 15.000100 52.000000 141.909091 34.066667 35682.000000

AveOccup Latitude Longitude


count 20640.000000 20640.000000 20640.000000
mean 3.070655 35.631861 -119.569704
std 10.386050 2.135952 2.003532
min 0.692308 32.540000 -124.350000
25% 2.429741 33.930000 -121.800000
50% 2.818116 34.260000 -118.490000
75% 3.282261 37.710000 -118.010000
max 1243.333333 41.950000 -114.310000

25% row mean that 25% of values are less than that value, same for 50% and 75%
Manipulating a DataFrame
[ ]: # adding a column to a dataframe
california_df['Price'] = california_dataset.target

[ ]: california_df.head()

[ ]: MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude \


0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88
1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86
2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85
3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85
4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85

Longitude Price
0 -122.23 4.526
1 -122.22 3.585
2 -122.24 3.521
3 -122.25 3.413
4 -122.25 3.422

12
[ ]: # removing a row
california_df.drop(index=0, axis=0) #removing row with the index 0 , just␣
↪temporary if we want

#to remove it permanently we need to Store it into california_df


# california_df = california_df.drop(index=0, axis=0)

[ ]: MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude \


1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86
2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85
3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85
4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85
5 4.0368 52.0 4.761658 1.103627 413.0 2.139896 37.85
… … … … … … … …
20635 1.5603 25.0 5.045455 1.133333 845.0 2.560606 39.48
20636 2.5568 18.0 6.114035 1.315789 356.0 3.122807 39.49
20637 1.7000 17.0 5.205543 1.120092 1007.0 2.325635 39.43
20638 1.8672 18.0 5.329513 1.171920 741.0 2.123209 39.43
20639 2.3886 16.0 5.254717 1.162264 1387.0 2.616981 39.37

Longitude Price
1 -122.22 3.585
2 -122.24 3.521
3 -122.25 3.413
4 -122.25 3.422
5 -122.25 2.697
… … …
20635 -121.09 0.781
20636 -121.21 0.771
20637 -121.22 0.923
20638 -121.32 0.847
20639 -121.24 0.894

[20639 rows x 9 columns]

[ ]: # drop a column
california_df.drop(columns='MedInc', axis=1)

[ ]: HouseAge AveRooms AveBedrms Population AveOccup Latitude \


0 41.0 6.984127 1.023810 322.0 2.555556 37.88
1 21.0 6.238137 0.971880 2401.0 2.109842 37.86
2 52.0 8.288136 1.073446 496.0 2.802260 37.85
3 52.0 5.817352 1.073059 558.0 2.547945 37.85
4 52.0 6.281853 1.081081 565.0 2.181467 37.85
… … … … … … …
20635 25.0 5.045455 1.133333 845.0 2.560606 39.48
20636 18.0 6.114035 1.315789 356.0 3.122807 39.49
20637 17.0 5.205543 1.120092 1007.0 2.325635 39.43

13
20638 18.0 5.329513 1.171920 741.0 2.123209 39.43
20639 16.0 5.254717 1.162264 1387.0 2.616981 39.37

Longitude Price
0 -122.23 4.526
1 -122.22 3.585
2 -122.24 3.521
3 -122.25 3.413
4 -122.25 3.422
… … …
20635 -121.09 0.781
20636 -121.21 0.771
20637 -121.22 0.923
20638 -121.32 0.847
20639 -121.24 0.894

[20640 rows x 8 columns]

[ ]: # locating a row using the index value


california_df.iloc[2] #it will gives us all the values in the row index=2

[ ]: MedInc 7.257400
HouseAge 52.000000
AveRooms 8.288136
AveBedrms 1.073446
Population 496.000000
AveOccup 2.802260
Latitude 37.850000
Longitude -122.240000
Price 3.521000
Name: 2, dtype: float64

[ ]: # locating a particular column


#iloc() function use indices , we can use the name of the columns using loc
print(california_df.iloc[:,0]) # first column
print(california_df.iloc[:,1]) # second column
print(california_df.loc[:,'Latitude']) # Latitude column
print(california_df.iloc[:,-1]) # last column

0 8.3252
1 8.3014
2 7.2574
3 5.6431
4 3.8462

20635 1.5603
20636 2.5568

14
20637 1.7000
20638 1.8672
20639 2.3886
Name: MedInc, Length: 20640, dtype: float64
0 41.0
1 21.0
2 52.0
3 52.0
4 52.0

20635 25.0
20636 18.0
20637 17.0
20638 18.0
20639 16.0
Name: HouseAge, Length: 20640, dtype: float64
0 37.88
1 37.86
2 37.85
3 37.85
4 37.85

20635 39.48
20636 39.49
20637 39.43
20638 39.43
20639 39.37
Name: Latitude, Length: 20640, dtype: float64
0 4.526
1 3.585
2 3.521
3 3.413
4 3.422

20635 0.781
20636 0.771
20637 0.923
20638 0.847
20639 0.894
Name: Price, Length: 20640, dtype: float64
Correlation: is a statistical measure used to determine the strength and direction of the relationship
between two variables
1. Positive Correlation : as one variable increases, the other variable also tends to increase.
2. Negative Correlation: as one variable increases, the other variable tends to decrease, and vice
versa.

15
[ ]: california_df.corr()
#- all the columns will be compared to other columns
#- negative value means it is negatively correlated
#- positive value means it is positively correlated

[ ]: MedInc HouseAge AveRooms AveBedrms Population AveOccup \


MedInc 1.000000 -0.119034 0.326895 -0.062040 0.004834 0.018766
HouseAge -0.119034 1.000000 -0.153277 -0.077747 -0.296244 0.013191
AveRooms 0.326895 -0.153277 1.000000 0.847621 -0.072213 -0.004852
AveBedrms -0.062040 -0.077747 0.847621 1.000000 -0.066197 -0.006181
Population 0.004834 -0.296244 -0.072213 -0.066197 1.000000 0.069863
AveOccup 0.018766 0.013191 -0.004852 -0.006181 0.069863 1.000000
Latitude -0.079809 0.011173 0.106389 0.069721 -0.108785 0.002366
Longitude -0.015176 -0.108197 -0.027540 0.013344 0.099773 0.002476
Price 0.688075 0.105623 0.151948 -0.046701 -0.024650 -0.023737

Latitude Longitude Price


MedInc -0.079809 -0.015176 0.688075
HouseAge 0.011173 -0.108197 0.105623
AveRooms 0.106389 -0.027540 0.151948
AveBedrms 0.069721 0.013344 -0.046701
Population -0.108785 0.099773 -0.024650
AveOccup 0.002366 0.002476 -0.023737
Latitude 1.000000 -0.924664 -0.144160
Longitude -0.924664 1.000000 -0.045967
Price -0.144160 -0.045967 1.000000

16
matplotlib-tutorial

March 28, 2024

Matplotlib:
• Useful for making Plots
[ ]: # importing matplotlib library
import matplotlib.pyplot as plt

[ ]: # import numpy to get data for our plots


import numpy as np

[ ]: x = np.linspace(0,10,100)
y = np.sin(x)
z = np.cos(x)

Plotting the data


[ ]: # sin wave
plt.plot(x,y)
plt.show() #This line displays the plot on the screen.

1
[ ]: # cos wave
plt.plot(x,z)
plt.show()

[ ]: # adding title, x-axis & y-axis labels


plt.plot(x,y)
plt.xlabel('angle')
plt.ylabel('sine value')
plt.title('sine wave')
plt.show()

2
[ ]: # parabola
x = np.linspace(-10,10,20)
y = x**2
plt.plot(x,y)
plt.show()

3
[ ]: plt.plot(x, y, 'r+')
plt.show()

4
[ ]: plt.plot(x, y, 'g.')
plt.show()

[ ]: plt.plot(x, y, 'rx')
plt.show()

5
[ ]: x = np.linspace(-5,5,50)
plt.plot(x, np.sin(x), 'g-')
plt.plot(x, np.cos(x), 'r--')
plt.show()

Bar Plot :provides a clear and concise way to compare categorical data. Their simplicity allows
for easy interpretation, making them accessible to a wide audience, even those without extensive
statistical knowledge.
[ ]: fig = plt.figure() #This line initializes a new figure object. A figure is the␣
↪entire window or page that the plot is drawn on.

ax = fig.add_axes([0,0,1,1])#This line adds an axes object to the figure.


#The list [0,0,1,1] specifies the position and size of the axes within the␣
↪figure.

#Here, it means the axes spans the entire figure from left (0) to right (1) and␣
↪from bottom (0) to top (1).

languages = ['English','French','Spanish','Latin','German']
people = [100, 50, 150, 40, 70]
ax.bar(languages, people)#This line creates a bar plot on the axes ax.
#It takes the list of languages as the x-axis values and the list of people as␣
↪the corresponding y-axis values.

#Matplotlib automatically creates bars for each pair of x and y values.


plt.xlabel('LANGUAGES')
plt.ylabel('NUMBER OF PEOPLE')

6
plt.show()

Pie Chart : is very useful to find the distribution of the data in an entire dataset
[ ]: fig1 = plt.figure()
ax = fig1.add_axes([0,0,1,1])
languages = ['English','French','Spanish','Latin','German']
people = [100, 50, 150, 40, 70]
ax.pie(people, labels=languages, autopct='%1.1f%%')#This line creates a pie␣
↪chart .

# It takes the list of people as the data to be plotted,


#'labels=languages' assigns labels to each slice of the pie chart based on the␣
↪languages list,

# and autopct='%1.1f%%' formats the percentage display on each slice of the pie␣
↪chart.

plt.show()

7
Scatter Plot
[ ]: x = np.linspace(0,10,30)
y = np.sin(x)
z = np.cos(x)
fig2 = plt.figure()
ax = fig2.add_axes([0,0,1,1])
ax.scatter(x,y,color='g')
ax.scatter(x,z,color='b')
plt.show()

8
this is a scatter plot where the data point should be scattered so there won’t be any line joining
these functions,it is very useful in clustering applications
3D Scatter Plot
[ ]: fig3 = plt.figure()
ax = plt.axes(projection='3d') #This line creates a 3D axes object using the␣
↪projection='3d' parameter,indicating that the plot will be in 3D space.

z = 20 * np.random.random(100)
x = np.sin(z)
y = np.cos(z)
ax.scatter(x,y,z,c=z,cmap='Blues')
#This line creates a scatter plot in 3D space .
#It takes the x, y, and z coordinates of each point as inputs,
#along with the c=z parameter to specify the color of each point based on its␣
↪z-coordinate.

#The cmap='Blues' parameter sets the color map to use for mapping scalar values␣
↪to colors.

plt.show()

9
10
seaborn-tutorial

March 28, 2024

Seaborn: - Data Visualization Library


Importing the Libraries
[ ]: import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

Note : Seaborn has some built-in datasets


[ ]: # total bill vs tip dataset
tips = sns.load_dataset('tips') # this will be imported in the form of a pandas␣
↪dataframe

[ ]: tips.head()

[ ]: total_bill tip sex smoker day time size


0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4

[ ]: # setting a theme for the plots


sns.set_theme()

[ ]: # visualize the data


sns.relplot(data=tips, x␣
↪='total_bill',y='tip',col='time',hue='smoker',style='smoker',size='size')

'''
This code utilizes Seaborn's relplot() function to create a relational plot␣
↪(relplot) based on the provided data:

1. data=tips : This specifies the dataset to be used for plotting.

2. x='total_bill', y='tip': These parameters specify which columns from the␣


↪dataset will be used for the x-axis and y-axis (tip) .

1
3. col='time' : This parameter indicates that the plots will be organized into␣
↪columns based on the values in the time column of the dataset.

This means that separate plots will be generated for each␣


↪unique value in the time column.

4. hue='smoker' : This parameter assigns colors to the data points based on the␣
↪values in the smoker column.

Each unique value in the smoker column will be represented by␣


↪a different color.

5. style='smoker': This parameter determines the style of the markers used in␣
↪the plot.

meaning that different marker styles will be used for data␣


↪points based on their values in the smoker column.

6. size='size': This parameter specifies the size of the markers based on the␣
↪values in the size column of the dataset.

'''

[ ]: "\nThis code utilizes Seaborn's relplot() function to create a relational plot


(relplot) based on the provided data:\n\n1. data=tips : This specifies the
dataset to be used for plotting.\n\n2. x='total_bill', y='tip': These parameters
specify which columns from the dataset will be used for the x-axis and y-axis
(tip) .\n\n3. col='time' : This parameter indicates that the plots will be
organized into columns based on the values in the time column of the dataset.\n
This means that separate plots will be generated for each unique value in the
time column.\n\n4. hue='smoker' : This parameter assigns colors to the data
points based on the values in the smoker column. \n Each unique
value in the smoker column will be represented by a different color.\n\n5.
style='smoker': This parameter determines the style of the markers used in the
plot.\n meaning that different marker styles will be used for
data points based on their values in the smoker column.\n\n6. size='size': This
parameter specifies the size of the markers based on the values in the size
column of the dataset.\n"

2
and this is the advantage of seaborn over a matplotlib. in matplotlib you need to mention all steps
manually but when you are using seaborn you don’t need to do that so seaborn automatically finds
those differences and it will plot it
[ ]: # load the iris dataset
iris = sns.load_dataset('iris')

[ ]: iris.head()
'''
there are totally three species in iris they are iris cetosa, iris virginica␣
↪and iris versicolor

the idea here is to predict what species the particular iris flower belongs
based on their sepal_length, sepal_width, petal length and petal with
this is the problem statement for this particular data set
'''

[ ]: '\nthere are totally three species in iris they are iris cetosa, iris virginica
and iris versicolor\nthe idea here is to predict what species the particular
iris flower belongs\nbased on their sepal_length, sepal_width, petal length and
petal with \nthis is the problem statement for this particular data set\n'

Scatter Plot
[ ]: sns.scatterplot(x='sepal_length',y='petal_length',hue='species',data=iris)

[ ]: <Axes: xlabel='sepal_length', ylabel='petal_length'>

3
[ ]: sns.scatterplot(x='sepal_length',y='petal_width',hue='species',data=iris)

[ ]: <Axes: xlabel='sepal_length', ylabel='petal_width'>

4
[ ]: # loading the titanic dataset
titanic = sns.load_dataset('titanic')
'''
the idea behind this data set is to predict whether a person has survived this␣
↪titanic based on these

features and we will try to Build a predictive models to predict survival␣


↪outcomes based on passenger attributes.

These models are trained on historical data with known outcomes and then used␣
↪to make predictions on new, unseen data.

'''

[ ]: titanic.head()

[ ]: survived pclass sex age sibsp parch fare embarked class \


0 0 3 male 22.0 1 0 7.2500 S Third
1 1 1 female 38.0 1 0 71.2833 C First
2 1 3 female 26.0 0 0 7.9250 S Third
3 1 1 female 35.0 1 0 53.1000 S First
4 0 3 male 35.0 0 0 8.0500 S Third

5
who adult_male deck embark_town alive alone
0 man True NaN Southampton no False
1 woman False C Cherbourg yes False
2 woman False NaN Southampton yes True
3 woman False C Southampton yes False
4 man True NaN Southampton no True

[ ]: titanic.shape

[ ]: (891, 15)

Count Plot
[ ]: sns.countplot(x='class',data=titanic)

[ ]: <Axes: xlabel='class', ylabel='count'>

[ ]: sns.countplot(x='survived',data=titanic)

[ ]: <Axes: xlabel='survived', ylabel='count'>

6
Bar Chart
[ ]: sns.barplot(x='sex',y='survived',hue='class',data=titanic)

[ ]: <Axes: xlabel='sex', ylabel='survived'>

7
[ ]: # house price dataset
from sklearn.datasets import fetch_california_housing
house_california = fetch_california_housing()

house = pd.DataFrame(house_california.data, columns=house_california.


↪feature_names)

house['PRICE'] = house_california.target

[ ]: print(house_california)

{'data': array([[ 8.3252 , 41. , 6.98412698, …,


2.55555556,
37.88 , -122.23 ],
[ 8.3014 , 21. , 6.23813708, …, 2.10984183,
37.86 , -122.22 ],
[ 7.2574 , 52. , 8.28813559, …, 2.80225989,
37.85 , -122.24 ],
…,
[ 1.7 , 17. , 5.20554273, …, 2.3256351 ,
39.43 , -121.22 ],
[ 1.8672 , 18. , 5.32951289, …, 2.12320917,

8
39.43 , -121.32 ],
[ 2.3886 , 16. , 5.25471698, …, 2.61698113,
39.37 , -121.24 ]]), 'target': array([4.526, 3.585, 3.521,
…, 0.923, 0.847, 0.894]), 'frame': None, 'target_names': ['MedHouseVal'],
'feature_names': ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population',
'AveOccup', 'Latitude', 'Longitude'], 'DESCR': '..
_california_housing_dataset:\n\nCalifornia Housing
dataset\n--------------------------\n\n**Data Set Characteristics:**\n\n
:Number of Instances: 20640\n\n :Number of Attributes: 8 numeric, predictive
attributes and the target\n\n :Attribute Information:\n - MedInc
median income in block group\n - HouseAge median house age in block
group\n - AveRooms average number of rooms per household\n -
AveBedrms average number of bedrooms per household\n - Population
block group population\n - AveOccup average number of household
members\n - Latitude block group latitude\n - Longitude
block group longitude\n\n :Missing Attribute Values: None\n\nThis dataset was
obtained from the StatLib
repository.\nhttps://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html\n\nThe
target variable is the median house value for California districts,\nexpressed
in hundreds of thousands of dollars ($100,000).\n\nThis dataset was derived from
the 1990 U.S. census, using one row per census\nblock group. A block group is
the smallest geographical unit for which the U.S.\nCensus Bureau publishes
sample data (a block group typically has a population\nof 600 to 3,000
people).\n\nA household is a group of people residing within a home. Since the
average\nnumber of rooms and bedrooms in this dataset are provided per
household, these\ncolumns may take surprisingly large values for block groups
with few households\nand many empty houses, such as vacation resorts.\n\nIt can
be downloaded/loaded using
the\n:func:`sklearn.datasets.fetch_california_housing` function.\n\n.. topic::
References\n\n - Pace, R. Kelley and Ronald Barry, Sparse Spatial
Autoregressions,\n Statistics and Probability Letters, 33 (1997)
291-297\n'}

[ ]: house.head()

[ ]: MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude \


0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88
1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86
2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85
3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85
4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85

Longitude PRICE
0 -122.23 4.526
1 -122.22 3.585
2 -122.24 3.521
3 -122.25 3.413

9
4 -122.25 3.422

Distribution Plot
[ ]: sns.displot(house['PRICE'])

[ ]: <seaborn.axisgrid.FacetGrid at 0x79892e092410>

[ ]: sns.distplot(house['PRICE']) #this helps to get us the distribution of the␣


↪values and it will gave us a probability curve

<ipython-input-20-2d26162c18b9>:1: UserWarning:

`distplot` is a deprecated function and will be removed in seaborn v0.14.0.

Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).

For a guide to updating your code to use the new functions, please see

10
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751

sns.distplot(house['PRICE']) #this helps to get us the distribution of the


values and it will gave us a probability curve

[ ]: <Axes: xlabel='PRICE', ylabel='Density'>

Correlation: - Positive Correlation - Negative Correlation


Heat Map (correlation matrix)

[ ]: correlation = house.corr()

[ ]: # constructing a Heat Map


plt.figure(figsize=(10,10))
sns.heatmap(correlation, cbar=True, square=True, fmt='.1f', annot=True,␣
↪annot_kws={'size':8}, cmap='Blues')

'''
This code utilizes Seaborn's heatmap() function to create a heatmap␣
↪visualization of a correlation matrix.

11
1. correlation: This parameter represents the correlation matrix that will be␣
↪visualized as a heatmap.

2. cbar=True: This parameter specifies whether to include a color bar (or color␣
↪legend) alongside the heatmap.

3. square=True: This parameter ensures that the aspect ratio of the heatmap is␣
↪set to be square.

4. fmt='.1f': This parameter specifies the format of the values displayed on␣
↪the heatmap.

5.annot=True: This parameter determines whether to annotate the cells of the␣


↪heatmap with the numeric values.

When set to True, each cell will display the corresponding value␣
↪from the correlation matrix.

6. annot_kws={'size':8}: This parameter specifies additional keyword arguments␣


↪for controlling the appearance of the annotations.

In this case, it sets the font size of the annotations␣


↪to 8 points.

7. cmap='Blues': This parameter sets the colormap used to color the heatmap.
The 'Blues' colormap ranges from light to dark blue, with darker shades␣
↪indicating higher values and lighter shades indicating lower values.

'''

[ ]: <Axes: >

12
correlation matrix is very important because it tells us which columns are important for our pre-
diction which columns are not important

13

You might also like