You are on page 1of 23

11/2/2020 Machine Learning Using scikit-learn.

ipynb - Colaboratory

01 Machine Learning Project

Course - Introduction

Welcome to the course on scikit-learn.

In this course, you will understand all the practical aspects of fitting a Machine Learning Model.

You will learn:

The different steps involved in this process such as data acquisition, data transformation, data cleaning and model fitting.
How to perform each step using Python scikit-learn package?

Introduction to Machine Learning

According to Arthur Samuel,

Machine learning is a eld of computer science that gives computers the ability to learn without being explicitly
programmed.

A Machine learning project is typically classified into two categories, depending on its learning system.

Supervised Learning
Unsupervised Learning.

Supervised Learning

https://colab.research.google.com/drive/19TOvwTd4Xa2nNDZxm45dnq1JWpQ3hfO3#scrollTo=Rb6NJLxGnlYP&printMode=true 1/23
11/2/2020 Machine Learning Using scikit-learn.ipynb - Colaboratory

Here, we are illustrating a supervised learning approach.

Unsupervised Learning

https://colab.research.google.com/drive/19TOvwTd4Xa2nNDZxm45dnq1JWpQ3hfO3#scrollTo=Rb6NJLxGnlYP&printMode=true 2/23
11/2/2020 Machine Learning Using scikit-learn.ipynb - Colaboratory

Here, we are illustrating an unsupervised learning approach.

Steps in a Machine Learning Project

A Machine Learning Project involves the following steps:

Defining the Problem :

Define a problem statement, which addresses a business problem.


Obtaining the Source Data :

The raw data required to build a model can be presented in a single or multiple sources such as relational databases, and
social networking sites.

Understanding Data Through Visualization :

Look into data and understand important features such as its mean, and spread.
Preparing Data for Machine Learning Algorithms :

Mostly, the captured raw data cannot be used to train using a Machine learning algorithm. The raw datasets have to be
manipulated or transformed through one or more pre-processing steps.

Choosing an algorithm :

Based on features of data set, pick a suitable algorithm.


Building the Model :

Train the algorithm with considered training data set and verify its performance through a metric.
Fine-tuning the Model :

Identify values of vital parameters, associated with the chosen model for better performance.
Use the best model :

Use the model with better performance for addressing the defined problem.

https://colab.research.google.com/drive/19TOvwTd4Xa2nNDZxm45dnq1JWpQ3hfO3#scrollTo=Rb6NJLxGnlYP&printMode=true 3/23
11/2/2020 Machine Learning Using scikit-learn.ipynb - Colaboratory

02 Introduction to scikit-learn

Introduction to scikit-learn

scikit-learn is a Machine learning toolkit in Python . The package contains efficient tools used for Data Mining and
Data Analysis .

It is built on NumPy, SciPy, and matplotlib packages. It is opensource and also commercially usable under BSD license.

scikit-learn Utilities

scikit-learn library has many utilities that can be used to perform the following tasks involved in Machine Learning.

Preprocessing
Model Selection
Classification
Regression
Clustering
Dimensionality Reduction

Steps with scikit-learn

Mostly, one would perform the following steps while working on a Machine learning problem with scikit-learn :

1. Cleaning raw data set.


2. Further transforming with many scikit-learn pre-processing utilities.
3. Splitting data into train and test sets with train_test_split utility.
4. Creating a suitable model with default parameters.
5. Training the Model using fit function.
. Evaluating the Model and fine-tuning it.

03 Gathering Data from Multiple Sources

About the Topic

In this topic, you will learn how scikit-learn library can be used to get public datasets.

You will also understand how this library simpli es your tasks required in fitting Machine Learning models.

Reading Data for ML

Any Machine Learning Algorithm requires data for building a model.

The data can be obtained from Multiple sources such as http, ftp repositories, databases, local repositories, etc.

Many times raw data, read from a source, cannot be used directly by an ML algorithm for building a Model.

So, raw data has to be cleaned, processed, transformed (if required) and then passed to an ML algorithm always.

Example Data - Breast Cancer Dataset

Breast Cancer data set is a popular one, which contains details of 30 features obtained from 569 cancer patients.

We will be doing the following tasks and make cancer data set ready for ML.

Reading raw data from UCI archive


Extract features from Raw data.
Naming or Labelling features
Extract target values from Raw data
Naming or Labelling target values

Reading Data from UCI Archive

The raw data set from UCI archive can be read with the following code snippet.

import pandas as pd

https://colab.research.google.com/drive/19TOvwTd4Xa2nNDZxm45dnq1JWpQ3hfO3#scrollTo=Rb6NJLxGnlYP&printMode=true 4/23
11/2/2020 Machine Learning Using scikit-learn.ipynb - Colaboratory

cancer_set = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data'
header = None)
print(cancer_set.shape)

Output

(569, 32)

Read raw dataset contains 32 columns.


the 1st column has patient ID details, and the 2nd one has tumor type, i.e. malignant or benign .
The rest 30 columns represent various features obtained from each patient.

Extracting Features from Raw Set

All columns, representing features are extracted with the following code snippet.

cancer_features = cancer_set.iloc[:,2:]

print(cancer_features.shape)
print(type(cancer_features))

Output

(569, 30)
<class 'pandas.core.frame.DataFrame'>

cancer_features is a dataframe . It is converted to a numpy array with below code.

cancer_features = cancer_features.values
print(type(cancer_features))
print(cancer_features.shape)

Output

<class 'numpy.ndarray'>
(569, 30)

Naming features

The 30 features used associated with cancer_features dataset are labeled with the following listed names.

cancer_features_names = ['mean radius',


'mean texture', 'mean perimeter',
'mean area', 'mean smoothness',
'mean compactness', 'mean concavity',
'mean concave points', 'mean symmetry',
'mean fractal dimension','radius error',
'texture error','perimeter error',
'area error', 'smoothness error',
'compactness error','concavity error',
'concave points error','symmetry error',
'fractal dimension error','worst radius',
'worst texture', 'worst perimeter',
'worst area','worst smoothness',
'worst compactness', 'worst concavity',
'worst concave points','worst symmetry',
'worst fractal dimension']

Extracting target values from Raw data

Target values of each patient are extracted with below code snippet.

https://colab.research.google.com/drive/19TOvwTd4Xa2nNDZxm45dnq1JWpQ3hfO3#scrollTo=Rb6NJLxGnlYP&printMode=true 5/23
11/2/2020 Machine Learning Using scikit-learn.ipynb - Colaboratory

cancer_target = cancer_set.iloc[:, 1]

Replacing 'M' with 0 and 'B' with 1


cancer_target = cancer_target.replace(['M', 'B'], [0, 1])

Converting to numpy array


cancer_target = cancer_target.values

print(type(cancer_target))
print(cancer_target.shape)

Output

<class 'numpy.ndarray'>
(569,)

Thus obtained cancer_features and cancer_target can be used by a ML algorithm.

scikit-learn Datasets

scikit-learn by default comes with few popular datasets.

They can be loaded into your working environment and used.

You can know more about datasets fromscikit-learnin the following video.

Reading Cancer Data from scikit-learn

Previously, you have read breast cancer data from UCI archive and derived cancer_features and cancer_target arrays.

The same processed data is available in scikit-learn . The below code snippet illustrates accessing features and target
arrays.

import sklearn.datasets as datasets

cancer = datasets.load_breast_cancer()

print(cancer.data.shape)
print(cancer.target.shape)

The multiple steps explained earlier are simplified using the above set of commands.

Output

(569, 30)
(569,)

from sklearn import datasets

iris = datasets.load_iris()
type(iris)

sklearn.utils.Bunch

04 Preprocessing with scikit-learn

Preprocessing - Introduction

Preprocessing is a step, in which raw data is modi ed or transformed into a format, suitable for further downstream processing.

scikit-learn provides many preprocessing utilities such as,

Standardization mean removal


Scaling

https://colab.research.google.com/drive/19TOvwTd4Xa2nNDZxm45dnq1JWpQ3hfO3#scrollTo=Rb6NJLxGnlYP&printMode=true 6/23
11/2/2020 Machine Learning Using scikit-learn.ipynb - Colaboratory

Normalization
Binarization
One Hot Encoding
Label Encoding
Imputation

Standardization

Standardization or Mean Removal is the process of transforming each feature vector into a normal distribution with mean 0 and
variance 1.

This can be achieved using StandardScaler .


An example with its output is shown in the next two cards, which requires the following imports.

import sklearn.preprocessing as preprocessing

Standardization - Example

standardizer = preprocessing.StandardScaler()
standardizer = standardizer.fit(breast_cancer.data)
breast_cancer_standardized = standardizer.transform(breast_cancer.data)

print('Mean of each feature after Standardization :\n\n')


print(breast_cancer_standardized.mean(axis=0))
print('\nStd. of each feature after Standardization :\n\n')
print(breast_cancer_standardized.std(axis=0))

Scaling

Scaling transforms existing data values to lie between a minimum and maximum value.

MinMaxScaler transforms data to range 0 and 1.

MaxAbsScaler transforms data to range -1 and 1.

Transforming breast_cancer dataset through Scaling is shown in next three cards.

Using MinMaxScaler

MinMaxScaler with specified range

min_max_scaler = preprocessing.MinMaxScaler(feature_range=(0, 10)).fit(breast_cancer.data)

breast_cancer_minmaxscaled10 = min_max_scaler.transform(breast_cancer.data)

In the above example, data is transformed to range 0 and 10.

Using MaxAbsScaler

Using MaxAbsScaler , the maximum absolute value of each feature is scaled to unit size, i.e., 1. It is intended for data that is previously
centered at sparse or zero data.

Example for MaxAbsScaler

max_abs_scaler = preprocessing.MaxAbsScaler().fit(breast_cancer.data)

breast_cancer_maxabsscaled = max_abs_scaler.transform(breast_cancer.data)

By default, MaxAbsScaler transforms data to the range -1 and 1.

Normalization

Normalization scales each sample to have a unit norm.


Normalization can be achieved with 'l1', 'l2', and 'max' norms.
'l1' norm makes the sum of absolute values of each row as 1, and 'l2' norm makes the sum of squares of each row as 1.
https://colab.research.google.com/drive/19TOvwTd4Xa2nNDZxm45dnq1JWpQ3hfO3#scrollTo=Rb6NJLxGnlYP&printMode=true 7/23
11/2/2020 Machine Learning Using scikit-learn.ipynb - Colaboratory

'l1' norm is insensitive to outliers.


By default l2 norm is considered. Hence, removing outliers is recommended before applying l2 norm .

Normalization - Example

normalizer = preprocessing.Normalizer(norm='l1').fit(breast_cancer.data)

breast_cancer_normalized = normalizer.transform(breast_cancer.data)

In above example, l1 norm is used with norm parameter.

Binarization

Binarization is the process of transforming data points to 0 or 1 based on a given threshold.

Any value above the threshold is transformed to 1, and any value below the threshold is transformed to 0.
By default, a threshold of 0 is used.

Binarization - Example

binarizer = preprocessing.Binarizer(threshold=3.0).fit(breast_cancer.data)
breast_cancer_binarized = binarizer.transform(breast_cancer.data)
print(breast_cancer_binarized[:5,:5])

Output

[[ 1. 1. 1. 1. 0.]
[ 1. 1. 1. 1. 0.]
[ 1. 1. 1. 1. 0.]
[ 1. 1. 1. 1. 0.]
[ 1. 1. 1. 1. 0.]]

OneHotEncoder

OneHotEncoder converts categorical integer values into one-hot vectors. In an on-hot vector, every category is transformed into a
binary attribute having only 0 and 1 values.

An example creating two binary attributes for the categorical integers 1 and 2, is shown in the next slide.

OneHotEncoder - Example

onehotencoder = preprocessing.OneHotEncoder()
onehotencoder = onehotencoder.fit([[1], [1], [1], [2], [2], [1]])

Transforming category values 1 and 2 to one-hot vectors


print(onehotencoder.transform([[1]]).toarray())
print(onehotencoder.transform([[2]]).toarray())

Output

[[ 1. 0.]]
[[ 0. 1.]]

Imputation

Imputation replaces missing values with either median, mean, or the most common value of the column or row in which the missing
values exist.

Below example replaces missing values, represented by np.nan , with the mean of respective column (axis 0).

Example

imputer = preprocessing.Imputer(missing_values='NaN', strategy='mean')

https://colab.research.google.com/drive/19TOvwTd4Xa2nNDZxm45dnq1JWpQ3hfO3#scrollTo=Rb6NJLxGnlYP&printMode=true 8/23
11/2/2020 Machine Learning Using scikit-learn.ipynb - Colaboratory

imputer = imputer.fit(breast_cancer.data)
breast_cancer_imputed = imputer.transform(breast_cancer.data)

Label Encoding

Label Encoding is a step in which, in which categorical features are represented as categorical integers. An example of transforming
categorical values ["benign","malignant"] into [0, 1]` is shown below.

Example

labels = ['malignant', 'benign', 'malignant', 'benign']

labelencoder = preprocessing.LabelEncoder()

labelencoder = labelencoder.fit(labels)

bc_labelencoded = labelencoder.transform(breast_cancer.target_names)

Using MinMaxScaler

Example for MinMaxScaler

min_max_scaler = preprocessing.MinMaxScaler().fit(breast_cancer.data)

breast_cancer_minmaxscaled = min_max_scaler.transform(breast_cancer.data)

By default, transformation occurs to a range of 0 and 1. It can also be customized with feature_range argument as shown in
next example.

05 Preprocessing Exercises

Hands-On - Machine Learning Using Scikit-Learn | 1 | Preprocessing

Installation
Let's get the installations done, prior to creating tasks.

Run the command given below.

pip install --user numpy scipy scikit-learn

Task 1
Import two modules sklearn.datasets and sklearn.preprocessing .

Load popular iris data set from sklearn.datasets module and assign it to variable 'iris' .

Perform Normalization on iris.data with l2 norm and save the transformed data in variable iris_normalized .

Hint: Use Normalizer API.

Print the mean of every column using the below command. print(iris_normalized.mean(axis=0))

import sklearn.datasets as datasets


import sklearn.preprocessing as preprocessing
iris = datasets.load_iris()
normalizer = preprocessing.Normalizer(norm='l2').fit(iris.data)
iris_normalized = normalizer.transform(iris.data)
print(iris_normalized.mean(axis=0))

[0.75140029 0.40517418 0.45478362 0.14107142]

https://colab.research.google.com/drive/19TOvwTd4Xa2nNDZxm45dnq1JWpQ3hfO3#scrollTo=Rb6NJLxGnlYP&printMode=true 9/23
11/2/2020 Machine Learning Using scikit-learn.ipynb - Colaboratory

Task 2
Convert the categorical integer list iris.target into three binary attribute representation and store the result in variable
iris_target_onehot .

Hint: Use reshape(-1,1) on iris.target and OneHotEncoder .

Execute the following print statement print(iris_target_onehot.toarray()[[0,50,100]])

binarizer = preprocessing.Binarizer(threshold=3.0).fit(iris.target.reshape(-1,1))
iris_binarized = binarizer.transform(iris.target.reshape(-1,1))
print(iris_binarized)

iris_target_onehot = preprocessing.OneHotEncoder()
print(iris_target_onehot.fit_transform(iris.target.reshape(-1,1)).toarray()[[0,50,100]])

Task 3
Set first 50 row values of iris.data to Null values. Use numpy.nan

Perform Imputation on 'iris.data' and save the transformed data in variable 'iris_imputed' .

Hint : use Imputer API, Replace numpy.NaN values with mean of corresponding data.

Print the mean of every column using the below command. print(iris_imputed.mean(axis=0))

iris.data[:50] = np.nan
imputer = preprocessing.Imputer(missing_values='NaN', strategy='mean')
imputer = imputer.fit(iris.data)
iris_imputed = imputer.transform(iris.data)
print(iris_imputed.mean(axis=0))

[6.262 2.872 4.906 1.676]


/usr/local/lib/python3.6/dist-packages/sklearn/utils/deprecation.py:66: DeprecationWarning: Class Imputer is deprecated
warnings.warn(msg, category=DeprecationWarning)

import sklearn.preprocessing as preprocessing

x = [[0, 0], [0, 1], [2,0]]


enc = preprocessing.OneHotEncoder()
print(enc.fit(x).transform([[1, 1]]).toarray())

[[0. 0. 0. 1.]]
/usr/local/lib/python3.6/dist-packages/sklearn/preprocessing/_encoders.py:415: FutureWarning: The handling of integer d
If you want the future behaviour and silence this warning, you can specify "categories='auto'".
In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use t
warnings.warn(msg, FutureWarning)

import sklearn.preprocessing as preprocessing

regions = ['HYD', 'CHN', 'MUM', 'HYD', 'KOL', 'CHN']


print(preprocessing.LabelEncoder().fit(regions).transform(regions))

[1 0 3 1 2 0]

06 Nearest Neighbors Technique

About the Topic

From this topic, you will understand how to implement various Machine Learning Algorithms using scikit-learn.

You will be learning some supervised and unsupervised learning algorithms.

Nearest Neighbors

Nearest neighbors method is used to determine a predefined number of data points that are closer to a sample point and predict its
label.

sklearn.neighbors provides utilities for unsupervised and supervised neighbors-based learning methods.

scikit-learn implements two different nearest neighbors classifiers:

KNeighborsClassifier

https://colab.research.google.com/drive/19TOvwTd4Xa2nNDZxm45dnq1JWpQ3hfO3#scrollTo=Rb6NJLxGnlYP&printMode=true 10/23
11/2/2020 Machine Learning Using scikit-learn.ipynb - Colaboratory

RadiusNeighborsClassifier

Nearest Neighbor Classifiers

KNeighborsClassifier classifies based on k nearest neighbors of every query point, where k is an integer value specified by
the user.

RadiusNeighborsClassifier classifies based on the number of neighbors present in a fixed radius r of every training point.

Nearest Neighbors Regression

scikit-learn implements the following two regressors:

KNeighborsRegressor predicts based on the k nearest neighbors of each query point.


RadiusNeighborsRegressor predicts based on the neighbors present in a fixed radius r of the query point.

Demo of KNeighborsClassifier

The following code snippet illustrates importing required modules and loading cancer dataset.

import sklearn.datasets as datasets

from sklearn.model_selection import train_test_split

from sklearn.neighbors import KNeighborsClassifier

cancer = datasets.load_breast_cancer() # Loading the data set

Building a Model of KNN classifier

The following code creates training and test data sets, initializes a KNN classifier, and fits it with training data.

X_train, X_test, Y_train, Y_test = train_test_split(cancer.data, cancer.target,


stratify=cancer.target, random_state=42)

knn_classifier = KNeighborsClassifier()

knn_classifier = knn_classifier.fit(X_train, Y_train)

Determining Accuracy of the Model

The following code determines the accuracy of model on train and test data sets.

print('Accuracy of Train Data :', knn_classifier.score(X_train,Y_train))


print('Accuracy of Test Data :', knn_classifier.score(X_test,Y_test))

Output

Accuracy of Train Data : 0.946009389671


Accuracy of Test Data : 0.93006993007

Hands-On - KNN

Installation
Let's get the installations done, prior to creating tasks.

Run the command given below.

pip install --user numpy scipy scikit-learn

Task 1
https://colab.research.google.com/drive/19TOvwTd4Xa2nNDZxm45dnq1JWpQ3hfO3#scrollTo=Rb6NJLxGnlYP&printMode=true 11/23
11/2/2020 Machine Learning Using scikit-learn.ipynb - Colaboratory

Import two modules sklearn.datasets, and sklearn.model_selection.

Load popular iris data set from sklearn.datasets module and assign it to variable iris.

Split iris.data into two sets names X_train and X_test. Also, split iris.target into two sets Y_train and Y_test.

Hint: Use train_test_split method from sklearn.model_selection; set random_state to 30 and perform stratified sampling.

Print the shape of X_train dataset.

Print the shape of X_test dataset.

import sklearn.datasets as datasets


from sklearn.model_selection import train_test_split

iris = datasets.load_iris()
X_train, X_test, Y_train, Y_test = train_test_split(iris.data, iris.target, stratify=iris.target, random_state=30)
print(X_train.shape)
print(X_test.shape)

(112, 4)
(38, 4)

Task 2
Import required module from sklearn.neighbors

Fit K nearest neighbors model on X_train data and Y_train labels, with default parameters. Name the model as knn_clf .

Evaluate the model accuracy on training data set and print it's score.

Evaluate the model accuracy on testing data set and print it's score.

from sklearn.neighbors import KNeighborsClassifier


knn_clf = KNeighborsClassifier()
knn_clf = knn_clf.fit(X_train, Y_train)
print('Accuracy of Train Data :', knn_clf.score(X_train,Y_train))
print('Accuracy of Test Data :', knn_clf.score(X_test,Y_test))

Accuracy of Train Data : 0.9821428571428571


Accuracy of Test Data : 0.9473684210526315

Task 3
Fit multiple K nearest neighbors models on X_train data and Y_train labels with n_neighbors parameter value changing from
3 to 10.

Evaluate each model accuracy on testing data set.

Hint: Make use of for loop

Print the n_neighbors value of the model with highest accuracy.

for i in range(3,10):
knn_clf = KNeighborsClassifier(n_neighbors=i)
knn_clf = knn_clf.fit(X_train, Y_train)
print('Accuracy of Test Data :', knn_clf.score(X_test,Y_test))

print(6)

Accuracy of Test Data : 0.9473684210526315


Accuracy of Test Data : 0.9473684210526315
Accuracy of Test Data : 0.9473684210526315
Accuracy of Test Data : 0.9736842105263158
Accuracy of Test Data : 0.9473684210526315
Accuracy of Test Data : 0.9473684210526315
Accuracy of Test Data : 0.9473684210526315
6

07 Decision Trees Technique

Decision Trees

Decision Trees is another Supervised Learning method used for Classification and Regression .

Decision Trees learn simple decision rules from training data and build a Model.

https://colab.research.google.com/drive/19TOvwTd4Xa2nNDZxm45dnq1JWpQ3hfO3#scrollTo=Rb6NJLxGnlYP&printMode=true 12/23
11/2/2020 Machine Learning Using scikit-learn.ipynb - Colaboratory

DecisionTreeClassifier and DecisionTreeRegressor are the two utilities from sklearn.tree , which can be used for
classification and regression respectively.
Advantages of Decision Trees

Advantages

Decision Trees are easy to understand.


They often do not require any preprocessing.
Decision Trees can learn from both numerical and categorical data.

Disadvantages of Decision Trees

Decision trees sometimes become complex, which do not generalize well and leads to overfitting . Overfitting can be
addressed by placing the least number of samples needed at a leaf node or placing the highest depth of the tree.

A small variation in data can result in a completely different tree . This problem can be addressed by using decision
trees within an ensemble.

Building a Decision Tree Classifier Model

The subsequent code represents the building of a Decision Tree Classifier model.

Before executing this code, perform importing required modules, load cancer dataset, and create train and test data sets as
shown in Neighbors classifier example.

from sklearn.tree import DecisionTreeClassifier

dt_classifier = DecisionTreeClassifier()

dt_classifier = dt_classifier.fit(X_train, Y_train)

Determining Accuracy of the Model

Further the below code determines the model accuracy. You can observe that the model is overfitted .

print('Accuracy of Train Data :', dt_classifier.score(X_train,Y_train))

print('Accuracy of Test Data :', dt_classifier.score(X_test,Y_test))

Output

Accuracy of Train Data : 1.0


Accuracy of Test Data : 0.895104895105

Fine Tuning the Model

Further the model is improved with change in max_depth value to 2.

dt_classifier = DecisionTreeClassifier(max_depth=2)

dt_classifier = dt_classifier.fit(X_train, Y_train)

print('Accuracy of Train Data :', dt_classifier.score(X_train,Y_train))

print('Accuracy of Test Data :', dt_classifier.score(X_test,Y_test))

Output

Accuracy of Train Data : 0.946009389671


Accuracy of Test Data : 0.923076923077

Hands-On
https://colab.research.google.com/drive/19TOvwTd4Xa2nNDZxm45dnq1JWpQ3hfO3#scrollTo=Rb6NJLxGnlYP&printMode=true 13/23
11/2/2020 Machine Learning Using scikit-learn.ipynb - Colaboratory

Installation
Let's get the installations done, prior to creating tasks.

Run the command given below.

pip install --user numpy scipy scikit-learn

Task 1
Import two modules sklearn.datasets , and sklearn.model_selection .

Import numpy and set random seed to 100 .

Load popular Boston dataset from sklearn.datasets module and assign it to variable boston .

Split boston.data into two sets names X_train and X_test . Also, split boston.target into two sets Y_train and Y_test .

Hint: Use train_test_split method from sklearn.model_selection ; set random_state to 30 .

Print the shape of X_train dataset.

Print the shape of X_test dataset.

import sklearn.datasets as datasets


from sklearn.model_selection import train_test_split
import numpy as np

boston = datasets.load_boston()
X_train, X_test, Y_train, Y_test = train_test_split(boston.data, boston.target, random_state=30)
print(X_train.shape)
print(X_test.shape)

(379, 13)
(127, 13)

Task 2
Import required module from sklearn.tree .

Build a Decision tree Regressor model from X_train set and Y_train labels, with default parameters. Name the model as
dt_reg .

Evaluate the model accuracy on training data set and print it's score.

Evaluate the model accuracy on testing data set and print it's score.

Predict the housing price for first two samples of X_test set and print them.(Hint : Use predict() function)

from sklearn.tree import DecisionTreeRegressor


dt_reg = DecisionTreeRegressor()
dt_reg = dt_reg.fit(X_train, Y_train)
print('Accuracy of Train Data :', dt_reg.score(X_train,Y_train))
print('Accuracy of Test Data :', dt_reg.score(X_test,Y_test))
print(dt_reg.predict(X_test[:1]))

Accuracy of Train Data : 1.0


Accuracy of Test Data : 0.8317506437122799
[18.2]

Task 3
Fit multiple Decision tree regressors on X_train data and Y_train labels with max_depth parameter value changing from 2 to
5.

Evaluate each model accuracy on testing data set.

Hint: Make use of for loop

Print the max_depth value of the model with highest accuracy.

for i in range(2,5):
dt_reg = DecisionTreeRegressor(max_depth=i)
dt_reg = dt_reg.fit(X_train, Y_train)
print('Accuracy of Test Data :', dt_reg.score(X_test,Y_test))

print(4)
https://colab.research.google.com/drive/19TOvwTd4Xa2nNDZxm45dnq1JWpQ3hfO3#scrollTo=Rb6NJLxGnlYP&printMode=true 14/23
11/2/2020 Machine Learning Using scikit-learn.ipynb - Colaboratory
print(4)

Accuracy of Test Data : 0.6876109752166819


Accuracy of Test Data : 0.6962264524668584
Accuracy of Test Data : 0.7646580016271207
4

08 Ensemble Methods

Ensemble Methods

Ensemble methods combine predictions of other learning algorithms, to improve the generalization.

Ensemble methods are two types:

Averaging Methods : They build several base estimators independently and finally average their predictions.

E.g.: Bagging Methods, Forests of randomised trees


Boosting Methods : They build base estimators sequentially and trie to reduce the bias of the combined estimator.

E.g.: Adaboost, Gradient Tree Boosting

Bagging Methods

Bagging Methods draw random subsets of the original dataset, build an estimator and aggregate individual results to form a final one.

BaggingClassifier and BaggingRegressor are the utilities from sklearn.ensemble to deal with Bagging.

Randomized Trees

sklearn.ensemble offers two types of algorithms based on randomized trees: Random Forest and Extra randomness algorithms.

RandomForestClassifier and RandomForestRegressor classes are used to deal with random forests.
In random forests, each estimator is built from a sample drawn with replacement from the training set.

ExtraTreesClassifier and ExtraTreesRegressor classes are used to deal with extremely randomized forests.

In extremely randomized forests, more randomness is introduced, which further reduces the variance of the model.

Boosting Methods

Boosting Methods combine several weak models to create a improvised ensemble.

sklearn.ensemble also provides the following boosting algorithms:

AdaBoostClassifier
GradientBoostingClassifier

Demo of Random Forest Classifier

Example of creating a Random forest model is shown below.

from sklearn.ensemble import RandomForestClassifier

rf_classifier = RandomForestClassifier()

rf_classifier = rf_classifier.fit(X_train, Y_train)

print('Accuracy of Train Data :', rf_classifier.score(X_train,Y_train))

print('Accuracy of Test Data :', rf_classifier.score(X_test,Y_test))

Output

Accuracy of Train Data: 0.995305164319


Accuracy of Test Data : 0.951048951049

Hands-On - Ensemble Methods


https://colab.research.google.com/drive/19TOvwTd4Xa2nNDZxm45dnq1JWpQ3hfO3#scrollTo=Rb6NJLxGnlYP&printMode=true 15/23
11/2/2020 Machine Learning Using scikit-learn.ipynb - Colaboratory

Installation
Let's get the installations done, prior to creating tasks.

Run the command given below.

pip install --user numpy scipy scikit-learn

Task 1
Import two modules sklearn.datasets , and sklearn.model_selection .

Import numpy and set random seed to 100

Load popular Boston dataset from sklearn.datasets module and assign it to variable boston .

Split boston.data into two sets names X_train and X_test . Also, split boston.target into two sets Y_train and Y_test .

Hint: Use train_test_split method from sklearn.model_selection ; set random_state to 30 .

Print the shape of X_train dataset.

Print the shape of X_test dataset.

import sklearn.datasets as datasets


from sklearn.model_selection import train_test_split
import numpy as np
np.random.seed(100)

boston = datasets.load_boston()
X_train, X_test, Y_train, Y_test = train_test_split(boston.data, boston.target, random_state=30)
print(X_train.shape)
print(X_test.shape)

(379, 13)
(127, 13)

Task 2
Import required module from sklearn.ensemble .

Build a Random Forest Regressor model from X_train set and Y_train labels, with default parameters. Name the model as
rf_reg .

Evaluate the model accuracy on training data set and print it's score.

Evaluate the model accuracy on testing data set and print it's score.

Predict the housing price for first two samples of X_test set and print them.

from sklearn.ensemble import RandomForestRegressor


rf_reg = RandomForestRegressor()
rf_reg = rf_reg.fit(X_train, Y_train)
print('Accuracy of Train Data :', rf_reg.score(X_train,Y_train))
print('Accuracy of Test Data :', rf_reg.score(X_test,Y_test))
print(rf_reg.predict(X_test[:2]))

Accuracy of Train Data : 0.9800942972322105


Accuracy of Test Data : 0.8959231849959377
[19.302 9.397]

Task 3
Build multiple Random forest regressor on X_train set and Y_train labels with max_depth parameter value changing from 3 to
5 and also setting n_estimators to one of 50, 100, 200 values.

Evaluate each model accuracy on testing data set.

Hint: Make use of for loop

Print the max_depth and n_estimators values of the model with highest accuracy.

Note: Print the parameter values in the form of tuple (a, b) . a refers to max_depth value and b refers to n_estimators

all_scores = {}

n= 100
https://colab.research.google.com/drive/19TOvwTd4Xa2nNDZxm45dnq1JWpQ3hfO3#scrollTo=Rb6NJLxGnlYP&printMode=true 16/23
11/2/2020 Machine Learning Using scikit-learn.ipynb - Colaboratory
00

for m in range(3,6):
rf_reg = RandomForestRegressor(n_estimators=n, max_depth=m)
rf_reg = rf_reg.fit(X_train, Y_train)
print(m,n, rf_reg.score(X_test,Y_test))
all_scores[(m,n)] = rf_reg.score(X_test,Y_test)

max_score = max(all_scores, key=all_scores.get)


print(max_score)

09 Support Vector Machines Technique

Understanding SVM

Support Vector Machines (SVMs) separates data points based on decision planes, which separates objects belonging to different
classes in a higher dimensional space.

SVM algorithm uses the best suitable kernel, which is capable of separating data points into two or more classes.

Commonly used kernels are:

linear
polynomial
rbf
sigmoid

Support Vector Classification

scikit-learn provides the following three utilities for performing Support Vector Classification.

SVC ,
NuSVC : Same as SVC but uses a parameter to control the number of support vectors.
LinearSVC : Similar to SVC with parameter kernel taking linear value.

Support Vector Regression

scikit-learn provides the following three utilities for performing Support Vector Regression.

SVR
NuSVR
LinearSVR

Advantages of SVMs

SVM can distinguish the classes in a higher dimensional space.

SVM algorithms are memory efficient.

SVMs are versatile, and a different kernel can be used by a decision function.

Disadvantages of SVMs

SVMs do not perform well on high dimensional data with many samples.

SVMs work better only with Preprocessed data.

They are harder to visualize.

Demo of Support Vector Classification

An example of creating an SVM classifier is shown below.

The shown model overfits the training data.

from sklearn.svm import SVC

svm_classifier = SVC()

svm_classifier = svm_classifier.fit(X_train, Y_train)

print('Accuracy of Train Data :', svm_classifier.score(X_train,Y_train))

https://colab.research.google.com/drive/19TOvwTd4Xa2nNDZxm45dnq1JWpQ3hfO3#scrollTo=Rb6NJLxGnlYP&printMode=true 17/23
11/2/2020 Machine Learning Using scikit-learn.ipynb - Colaboratory

print('Accuracy of Test Data :', svm_classifier.score(X_test,Y_test))

Output

Accuracy of Train Data : 1.0


Accuracy of Test Data : 0.629370629371

Improving Accuracy Using Scaled Data

In the following example, scaled input data is used to improve the accuracy of SVM classifier.

import sklearn.preprocessing as preprocessing

standardizer = preprocessing.StandardScaler()
standardizer = standardizer.fit(cancer.data)
cancer_standardized = standardizer.transform(cancer.data)

svm_classifier = SVC()

svm_classifier = svm_classifier.fit(X_train, Y_train)

Determining Accuracy of New Model

print('Accuracy of Train Data :', svm_classifier.score(X_train,Y_train))

print('Accuracy of Test Data :', svm_classifier.score(X_test,Y_test))

Output

Accuracy of Train Data : 0.992957746479


Accuracy of Test Data : 0.979020979021

Viewing the Classification Report

from sklearn import metrics

Y_pred = svm_classifier.predict(X_test)

print('Classification report : \n',metrics.classification_report(Y_test, Y_pred))

Output

Classification report :
precision recall f1-score support
0 0.96 0.98 0.97 53
1 0.99 0.98 0.98 90

avg 0.98 0.98 0.98 143

Hands-On - SVM

Installation
Let's get the installations done, prior to creating tasks.

Run the command given below.

https://colab.research.google.com/drive/19TOvwTd4Xa2nNDZxm45dnq1JWpQ3hfO3#scrollTo=Rb6NJLxGnlYP&printMode=true 18/23
11/2/2020 Machine Learning Using scikit-learn.ipynb - Colaboratory

pip install --user numpy scipy scikit-learn

Task 1
Import two modules sklearn.datasets , and sklearn.model_selection .

Load popular digits dataset from sklearn.datasets module and assign it to variable digits .

Split digits.data into two sets names X_train and X_test . Also, split digits.target into two sets Y_train and Y_test .

Hint: Use train_test_split method from sklearn.model_selection ; set random_state to 30 ; and perform stratified sampling.

Print the shape of X_train dataset.

Print the shape of X_test dataset.

import sklearn.datasets as datasets


from sklearn.model_selection import train_test_split

digits = datasets.load_digits()
X_train, X_test, Y_train, Y_test = train_test_split(digits.data, digits.target, stratify=digits.target, random_state=30)
print(X_train.shape)
print(X_test.shape)

(1347, 64)
(450, 64)

Task 2
Import required module from sklearn.svm .

Build an SVM classifier from X_train set and Y_train labels, with default parameters. Name the model as svm_clf .

Evaluate the model accuracy on testing data set and print it's score.

from sklearn.svm import SVC

svm_clf = SVC()
svm_clf = svm_clf.fit(X_train, Y_train)
print('Accuracy of Test Data :', svm_clf.score(X_test,Y_test))

/usr/local/lib/python3.6/dist-packages/sklearn/svm/base.py:193: FutureWarning: The default value of gamma will change f


"avoid this warning.", FutureWarning)
Accuracy of Test Data : 0.6022222222222222

Task 3
Perform Standardization of digits.data and store the transformed data in variable digits_standardized .

Hint : Use required utility from sklearn.preprocessing .

Once again, split digits_standardized into two sets names X_train and X_test . Also, split digits.target into two sets
Y_train and Y_test .

Hint: Use train_test_split method from sklearn.model_selection ; set random_state to 30 ; and perform stratified sampling.

Build another SVM classifier from X_train set and Y_train labels, with default parameters. Name the model as svm_clf2 .

Evaluate the model accuracy on testing data set and print it's score.

import sklearn.preprocessing as preprocessing

standardizer = preprocessing.StandardScaler()
digits_standardized = standardizer.fit(digits.data)
digits_standardized = standardizer.transform(digits.data)

X_train, X_test, Y_train, Y_test = train_test_split(digits_standardized, digits.target, stratify=digits.target, random_state=

svm_clf2 = SVC()

svm_clf2 = svm_clf2.fit(X_train, Y_train)

print('Accuracy of Test Data :', svm_clf2.score(X_test,Y_test))

Accuracy of Test Data : 0.9733333333333334


/usr/local/lib/python3.6/dist-packages/sklearn/svm/base.py:193: FutureWarning: The default value of gamma will change f
https://colab.research.google.com/drive/19TOvwTd4Xa2nNDZxm45dnq1JWpQ3hfO3#scrollTo=Rb6NJLxGnlYP&printMode=true 19/23
11/2/2020 Machine Learning Using scikit-learn.ipynb - Colaboratory

"avoid this warning.", FutureWarning)

10 Clustering Technique

Introduction to Clustering

Clustering is one of the unsupervised learning technique.

The technique is typically used to group data points into clusters based on a specific algorithm.

Major clustering algorithms that can be implemented using scikit-learn are:

K-means Clustering
Agglomerative clustering
DBSCAN clustering
Mean-shift clustering
Affinity propagation
Spectral clustering

K-Means Clustering

In K-means Clustering entire data set is grouped into k clusters.

Steps involved are:

k centroids are chosen randomly.


The distance of each data point from k centroids is calculated. A data point is assigned to the nearest cluster.
Centroids of k clusters are recomputed.
The above steps are iterated till the number of data points a cluster reach convergence.

KMeans from sklearn.cluster can be used for K-means clustering.

Agglomerative Hierarchical Clustering

Agglomerative Hierarchical Clustering is a bottom-up approach.

Steps involved are:

Each data point is treated as a single cluster at the beginning.

The distance between each cluster is computed, and the two nearest clusters are merged together.

The above step is iterated till a single cluster is formed.

AgglomerativeClustering from sklearn.cluster can be used for achieving this.

Merging of two clusters can be any of the following linkage type: ward , complete or average .

Density Based Clustering

Now let's understand how density-based clustering is performed. DBSCAN from sklearn.cluster is used for this purpose. - video 1

Mean Shift Clustering

Mean Shift Clustering aims at discovering dense areas.

Steps Involved:

Identify blob areas with randomly guessed centroids.


Calculate the centroid of each blob area and shift to a new one, if there is a difference.
Repeat the above step till the centroids converge.

make_blobs from sklearn.cluster can be used to initialize the blob areas. MeanShift from sklearn.cluster can be used to
perform Mean Shift clustering.

Affinity Propagation

Affinity Propagation generates clusters by passing messages between pairs of data points, until convergence.

AffinityPropagation class from sklearn.cluster can be used.

The above class can be controlled with two major parameters:

preference : It controls the number of exemplars to be chosen by the algorithm.


damping : It controls numerical oscillations while updating messages.

https://colab.research.google.com/drive/19TOvwTd4Xa2nNDZxm45dnq1JWpQ3hfO3#scrollTo=Rb6NJLxGnlYP&printMode=true 20/23
11/2/2020 Machine Learning Using scikit-learn.ipynb - Colaboratory

Spectral Clustering

Spectral Clustering is ideal to cluster data that is connected, and may not be in a compact space.

In general, the following steps are followed:

Build an affinity matrix of data points.


Embed data points in a lower dimensional space.
Use a clustering method like k-means to partition the points on lower dimensional space.

spectral_clustering from sklearn.cluster can be used for achieving this.

Demo of KMeans

An example of performing KMeans clustering is shown below

from sklearn.cluster import KMeans

kmeans_cluster = KMeans(n_clusters=2)

kmeans_cluster = kmeans_cluster.fit(X_train)

kmeans_cluster.predict(X_test)

Output

array([0, 1, 0, ... , 0, 0, 0])

Evaluating a Clustering algorithm

A clustering algorithm is majorly evaluated using the following scores:

Homogeneity : Evaluates if each cluster contains only members of a single class.

Completeness : All members of a given class are assigned to the same cluster.

V-measure : Harmonic mean of Homogeneity and Completeness .

Adjusted Rand index : Measures similarity of two assignments.

Evaluation with scikit-learn

from sklearn import metrics

print(metrics.homogeneity_score(kmeans_cluster.predict(X_test), Y_test))

print(metrics.completeness_score(kmeans_cluster.predict(X_test), Y_test))

print(metrics.v_measure_score(kmeans_cluster.predict(X_test), Y_test))

print(metrics.adjusted_rand_score(kmeans_cluster.predict(X_test), Y_test))

Output

0.573236466834
0.483862796607
0.524771531969
0.54983994112

Hands-On - Clustering

Installation

https://colab.research.google.com/drive/19TOvwTd4Xa2nNDZxm45dnq1JWpQ3hfO3#scrollTo=Rb6NJLxGnlYP&printMode=true 21/23
11/2/2020 Machine Learning Using scikit-learn.ipynb - Colaboratory

Let's get the installations done, prior to creating tasks.

Run the command given below.

pip install --user numpy scipy scikit-learn

Task 1
Import three modules sklearn.datasets , sklearn.cluster , and sklearn.metrics .

Load popular iris dataset from sklearn.datasets module and assign it to variable iris .

Cluster iris.data set into 3 clusters using K-means with default parameters. Name the model as km_cls .

Hint : Import required utility from sklearn.cluster

Determine the homogeneity score of the model and print it.

Hint : Import required utility from sklearn.metrics

import sklearn.datasets as datasets


from sklearn.cluster import KMeans
from sklearn import metrics
from sklearn.model_selection import train_test_split

iris = datasets.load_iris()

X_train, X_test, Y_train, Y_test = train_test_split(iris.data, iris.target)

km_cls = KMeans(n_clusters=3)

km_cls = km_cls.fit(X_train)

print(metrics.homogeneity_score(km_cls.predict(X_test), Y_test))

0.7519188912745023

Task 2
Cluster iris.data set into 3 clusters using Agglomerative clustering. Name the model as agg_cls .

Hint : Import required utility from sklearn.cluster

Determine the homogeneity score of the model and print it.

Hint : Import required utility from sklearn.metrics

from sklearn.cluster import AgglomerativeClustering


agg_cls = AgglomerativeClustering(n_clusters=3)
print(metrics.homogeneity_score(agg_cls.fit_predict(X_test), Y_test))

0.829701532311426

Task 3
Cluster iris.data set using Affinity Propagation clustering method with default parameters. Name the model as af_cls .

Hint : Import required utility from sklearn.cluster

Determine the homogeneity score of the model and print it.

Hint : Import required utility from sklearn.metrics

from sklearn.cluster import AffinityPropagation


af_cls = AffinityPropagation()
print(metrics.homogeneity_score(af_cls.fit_predict(X_test), Y_test))

10 Course Summary

Scikit Learn Course Summary

In this course, you have studied the following,

Introduced to scikit-learn .

https://colab.research.google.com/drive/19TOvwTd4Xa2nNDZxm45dnq1JWpQ3hfO3#scrollTo=Rb6NJLxGnlYP&printMode=true 22/23
11/2/2020 Machine Learning Using scikit-learn.ipynb - Colaboratory

Used scikit-learn datasets for learning Machine Learning concepts.

Used scikit-learn for build models based on Supervised learning techniques like Nearest neigbors , Decision Trees , Random
forests , and SVMs .

Used scikit-learn for build models based on Supervised learning techniques such as K-means , Agglomerative clustering ,
and Density-based clustering .

import sklearn.preprocessing as preprocessing

x = [[7.8], [1.3], [4.5], [0.9]]


print(preprocessing.Binarizer().fit(x).transform(x))

[[1.]
[1.]
[1.]
[1.]]

Double-click (or enter) to edit

Double-click (or enter) to edit

https://colab.research.google.com/drive/19TOvwTd4Xa2nNDZxm45dnq1JWpQ3hfO3#scrollTo=Rb6NJLxGnlYP&printMode=true 23/23

You might also like