0% found this document useful (0 votes)
13 views10 pages

Sklearn Cheatsheet

Uploaded by

Tarif Hasan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views10 pages

Sklearn Cheatsheet

Uploaded by

Tarif Hasan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

AI Basics

Scikit-Learn
Cheat Sheet

from sklearn import *

#Python

/sumitkhanna
Python

Follow me on Sumit Khanna for more updates

Comprehensive Guide to Scikit-Learn Functions and Features


This notebook provides a comprehensive guide to various scikit-learn functions with examples.

Cheat Sheet Table of Scikit-Learn Functions


Feature/Function Brief Explanation Example

train_test_split Splitting data into training and testing sets Splitting the Iris dataset into training and testing sets

Standardizing features by removing the mean and scaling to unit


StandardScaler Standardizing the Iris dataset
variance
Transforming features by scaling each feature to a given range
MinMaxScaler Scaling the Iris dataset to range [0, 1]
Encoding target labels with value between 0 and n_classes-1
LabelEncoder Encoding the target variable in the Iris dataset
Encoding categorical features as a one-hot numeric array
OneHotEncoder One-hot encoding categorical features in a sample dataset
Generating polynomial and interaction features
PolynomialFeatures Generating polynomial features for the housing dataset
Linear regression model
LinearRegression Predicting house prices using Linear Regression
Logistic regression model
LogisticRegression Classifying digits using Logistic Regression
Ridge regression model
Ridge Predicting house prices using Ridge Regression
Lasso regression model
Lasso Predicting house prices using Lasso Regression
Support Vector Classification
SVC Classifying Iris species using Support Vector Classification
Random Forest Classification
RandomForestClassifier Classifying Iris species using Random Forest Classification
Gradient Boosting Classification
GradientBoostingClassifier Classifying Iris species using Gradient Boosting Classification
K-Nearest Neighbors Classification
KNeighborsClassifier Classifying Iris species using K-Nearest Neighbors Classification
Decision Tree Classification
DecisionTreeClassifier Classifying Iris species using Decision Tree Classification
Computing accuracy score
accuracy_score Calculating the accuracy score for a classification model
Computing confusion matrix
confusion_matrix Generating a confusion matrix for a classification model
Building classification report
classification_report Generating a classification report for a classification model
Mean squared error
mean_squared_error Calculating the mean squared error for a regression model
R^2 (coefficient of determination) regression score function
r2_score Calculating the R^2 score for a regression model
Principal Component Analysis
PCA Reducing dimensionality of the Iris dataset using PCA
Evaluate a score by cross-validation
cross_val_score Performing cross-validation on the Iris dataset

Performing grid search to find the best hyperparameters for a


GridSearchCV Exhaustive search over specified parameter values for an estimator classification model
Creating a pipeline for scaling and classifying the Iris dataset
Pipeline Pipeline of transforms with a final estimator

Introduction
Scikit-Learn is a powerful Python library for machine learning, providing a wide range of tools for data analysis and modeling. This guide will walk you through the
key functions and features of Scikit-Learn with practical examples.

In [ ]: # Import necessary libraries


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns from sklearn.datasets
import load_iris, fetch_california_housing, load_digits
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder, OneHotEncoder, PolynomialFeatures
from sklearn.linear_model import LinearRegression, LogisticRegression, Ridge, Lasso, SGDRegressor
from sklearn.svm import SVC, SVR, LinearSVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, RandomForestRegressor, GradientBoostingRegressor
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, mean_squared_error, r2_score
from sklearn.decomposition import PCA from sklearn.pipeline import Pipeline

train_test_split
Splitting data into training and testing sets.
In [ ]: # train_test_split
# Parameters: X (array-like), y (array-like), test_size (float), random_state (int)
# Returns: X_train, X_test, y_train, y_test (arrays)

# Example: Splitting the Iris dataset into training and testing sets
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
print(f'X_train shape: {X_train.shape}, X_test shape: {X_test.shape}')
print(f'y_train shape: {y_train.shape}, y_test shape: {y_test.shape}')

X_train shape: (105, 4), X_test shape: (45, 4)


y_train shape: (105,), y_test shape: (45,)

StandardScaler
Standardizing features by removing the mean and scaling to unit variance.

In [ ]: # StandardScaler
# Parameters: None
# Returns: Scaled data (array)

# Example: Standardizing the Iris dataset


scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
print(f'Mean: {X_scaled.mean(axis=0)}, Std Dev: {X_scaled.std(axis=0)}')

Mean: [-1.69031455e-15 -1.84297022e-15 -1.69864123e-15 -1.40924309e-15], Std Dev: [1. 1. 1. 1.]

MinMaxScaler
Transforming features by scaling each feature to a given range.

In [ ]: # MinMaxScaler
# Parameters: None
# Returns: Scaled data (array)

# Example: Scaling the Iris dataset to range [0, 1]


min_max_scaler = MinMaxScaler()
X_min_max_scaled = min_max_scaler.fit_transform(X)
print(f'Min: {X_min_max_scaled.min(axis=0)}, Max: {X_min_max_scaled.max(axis=0)}')

Min: [0. 0. 0. 0.], Max: [1. 1. 1. 1.]

LabelEncoder
Encoding target labels with value between 0 and n_classes-1.

In [ ]: # LabelEncoder
# Parameters: None
# Returns: Encoded labels (array)

# Example: Encoding the target variable in the Iris dataset


label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)
print(f'Classes: {label_encoder.classes_}')
print(f'Encoded labels: {y_encoded[:10]}')

Classes: [0 1 2]
Encoded labels: [0 0 0 0 0 0 0 0 0 0]

OneHotEncoder
Encoding categorical features as a one-hot numeric array.

In [ ]: # OneHotEncoder
# Parameters: None
# Returns: One-hot encoded array

# Example: One-hot encoding categorical features in a sample dataset


data = pd.DataFrame({
'feature': ['A', 'B', 'C', 'A', 'B', 'C']
})
one_hot_encoder = OneHotEncoder()
one_hot_encoded = one_hot_encoder.fit_transform(data)
print(f'One-hot encoded array:\n{one_hot_encoded}')
print(f'Categories: {one_hot_encoder.categories_}')

One-hot encoded array:


(0, 0) 1.0
(1, 1) 1.0
(2, 2) 1.0
(3, 0) 1.0
(4, 1) 1.0
(5, 2) 1.0
Categories: [array(['A', 'B', 'C'], dtype=object)]

PolynomialFeatures
Generating polynomial and interaction features.
In [ ]: # PolynomialFeatures
# Parameters: degree (int)
# Returns: Polynomial features (array)

# Example: Generating polynomial features for the housing housing dataset


housing = fetch_california_housing()
X_housing = housing.data
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X_housing)
print(f'Original shape: {X_housing.shape}, Polynomial shape: {X_poly.shape}')

Original shape: (20640, 8), Polynomial shape: (20640, 45)

LinearRegression
Linear regression model.

In [ ]: # LinearRegression #
Parameters: None #
Returns: Fitted model

# Example: Predicting house prices using Linear Regression


X_train_housing, X_test_housing, y_train_housing, y_test_housing = train_test_split(X_housing, housing.target, test_size=0.3, random_state=42)
lr = LinearRegression()
lr.fit(X_train_housing, y_train_housing)
y_pred_housing = lr.predict(X_test_housing)
print(f'Mean squared error: {mean_squared_error(y_test_housing, y_pred_housing)}')
print(f'R^2 score: {r2_score(y_test_housing, y_pred_housing)}')

Mean squared error: 0.5305677824766757


R^2 score: 0.595770232606166

Linear Regression Model Performance


In [ ]: # Plotting Linear Regression Model Performance
plt.figure(figsize=(8, 4))
plt.scatter(y_test_housing, y_pred_housing, color='blue', edgecolor='k', alpha=0.6)
plt.plot([min(y_test_housing), max(y_test_housing)], [min(y_test_housing), max(y_test_housing)], color='red', linewidth=2)
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title('Linear Regression Model Performance')
plt.show()

LogisticRegression
Logistic regression model.

In [ ]: # LogisticRegression
# Parameters: None
# Returns: Fitted model

# Example: Classifying digits using Logistic Regression


digits = load_digits()
X_digits, y_digits = digits.data, digits.target
X_train_digits, X_test_digits, y_train_digits, y_test_digits = train_test_split(X_digits, y_digits, test_size=0.3, random_state=42)
lr = LogisticRegression(max_iter=1000)
lr.fit(X_train_digits, y_train_digits)
y_pred_digits = lr.predict(X_test_digits)
print(f'Accuracy: {accuracy_score(y_test_digits, y_pred_digits)}')

Accuracy: 0.9685185185185186

Logistic Regression Model Performance


In [ ]: # Plotting Logistic Regression Model Performance
conf_matrix = confusion_matrix(y_test_digits, y_pred_digits)
plt.figure(figsize=(8, 4))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Logistic Regression Confusion Matrix')
plt.show()

Ridge
Ridge regression model.

In [ ]: # Ridge
# Parameters: alpha (float)
# Returns: Fitted model

# Example: Predicting house prices using Ridge Regression


ridge = Ridge(alpha=1.0)
ridge.fit(X_train_housing, y_train_housing)
y_pred_ridge = ridge.predict(X_test_housing)
print(f'Mean squared error: {mean_squared_error(y_test_housing, y_pred_ridge)}')
print(f'R^2 score: {r2_score(y_test_housing, y_pred_ridge)}')

Mean squared error: 0.5305052690933701


R^2 score: 0.5958178603951635

Ridge Regression Model Performance


In [ ]: # Plotting Ridge Regression Model Performance
plt.figure(figsize=(8, 4))
plt.scatter(y_test_housing, y_pred_ridge, color='blue', edgecolor='k', alpha=0.6)
plt.plot([min(y_test_housing), max(y_test_housing)], [min(y_test_housing), max(y_test_housing)], color='red', linewidth=2)
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title('Ridge Regression Model Performance')
plt.show()

Lasso
Lasso regression model.

In [ ]: # Lasso
# Parameters: alpha (float)
# Returns: Fitted model
# Example: Predicting house prices using Lasso Regression
lasso = Lasso(alpha=0.1)
lasso.fit(X_train_housing, y_train_housing)
y_pred_lasso = lasso.predict(X_test_housing)
print(f'Mean squared error: {mean_squared_error(y_test_housing, y_pred_lasso)}')
print(f'R^2 score: {r2_score(y_test_housing, y_pred_lasso)}')

Mean squared error: 0.5970512258509186


R^2 score: 0.545117728367666

Lasso Regression Model Performance


In [ ]: # Plotting Lasso Regression Model Performance
plt.figure(figsize=(8, 4))
plt.scatter(y_test_housing, y_pred_lasso, color='blue', edgecolor='k', alpha=0.6)
plt.plot([min(y_test_housing), max(y_test_housing)], [min(y_test_housing), max(y_test_housing)], color='red', linewidth=2)
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title('Lasso Regression Model Performance')
plt.show()

SVC
Support Vector Classification.

In [ ]: # SVC
# Parameters: kernel (string), C (float)
# Returns: Fitted model

# Example: Classifying Iris species using Support Vector Classification


svc = SVC(kernel='rbf', C=1.0)
svc.fit(X_train, y_train)
y_pred_svc = svc.predict(X_test)
print(f'Accuracy: {accuracy_score(y_test, y_pred_svc)}')

Accuracy: 1.0

SVC Model Performance


In [ ]: # Plotting SVC Model Performance
conf_matrix = confusion_matrix(y_test, y_pred_svc)
plt.figure(figsize=(8, 4))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('SVC Confusion Matrix')
plt.show()
RandomForestClassifier
Random Forest Classification.

In [ ]: # RandomForestClassifier
# Parameters: n_estimators (int), random_state (int)
# Returns: Fitted model

# Example: Classifying Iris species using Random Forest Classification


rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
print(f'Accuracy: {accuracy_score(y_test, y_pred_rf)}')

Accuracy: 1.0

Random Forest Model Performance


# Plotting Random Forest Model Performance
In [ ]: conf_matrix = confusion_matrix(y_test, y_pred_rf)
plt.figure(figsize=(8, 4))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Random Forest Confusion Matrix')
plt.show()

GradientBoostingClassifier
Gradient Boosting Classification.

In [ ]: # GradientBoostingClassifier
# Parameters: n_estimators (int), learning_rate (float)
# Returns: Fitted model

# Example: Classifying Iris species using Gradient Boosting Classification


gb = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, random_state=42)
gb.fit(X_train, y_train)
y_pred_gb = gb.predict(X_test)
print(f'Accuracy: {accuracy_score(y_test, y_pred_gb)}')

Accuracy: 1.0

Gradient Boosting Model Performance


In [ ]: # Plotting Gradient Boosting Model Performance
conf_matrix = confusion_matrix(y_test, y_pred_gb)
plt.figure(figsize=(8, 4))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Gradient Boosting Confusion Matrix')
plt.show()

KNeighborsClassifier
K-Nearest Neighbors Classification.

In [ ]: # KNeighborsClassifier
# Parameters: n_neighbors (int)
# Returns: Fitted model

# Example: Classifying Iris species using K-Nearest Neighbors Classification


knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred_knn = knn.predict(X_test)
print(f'Accuracy: {accuracy_score(y_test, y_pred_knn)}')

Accuracy: 1.0

K-Nearest Neighbors Model Performance


In [ ]: # Plotting K-Nearest Neighbors Model Performance
conf_matrix = confusion_matrix(y_test, y_pred_knn)
plt.figure(figsize=(8, 4))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('K-Nearest Neighbors Confusion Matrix')
plt.show()

DecisionTreeClassifier
Decision Tree Classification.

In [ ]: # DecisionTreeClassifier
# Parameters: criterion (string), max_depth (int)
# Returns: Fitted model
# Example: Classifying Iris species using Decision Tree Classification
dt = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=42)
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)
print(f'Accuracy: {accuracy_score(y_test, y_pred_dt)}')

Accuracy: 1.0

Decision Tree Model Performance


In [ ]: # Plotting Decision Tree Model Performance
conf_matrix = confusion_matrix(y_test, y_pred_dt)
plt.figure(figsize=(8, 4))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Decision Tree Confusion Matrix')
plt.show()

mean_squared_error
Mean squared error.

In [ ]: # mean_squared_error
# Parameters: y_true (array-like), y_pred (array-like)
# Returns: Mean squared error (float)

# Example: Calculating the mean squared error for a regression model


mse = mean_squared_error(y_test_housing, y_pred_housing)
print(f'Mean Squared Error: {mse}')

Mean Squared Error: 0.5305677824766757

r2_score
R^2 (coefficient of determination) regression score function.

In [ ]: # r2_score
# Parameters: y_true (array-like), y_pred (array-like)
# Returns: R^2 score (float)

# Example: Calculating the R^2 score for a regression model


r2 = r2_score(y_test_housing, y_pred_housing)
print(f'R^2 Score: {r2}')

R^2 Score: 0.595770232606166

PCA
Principal Component Analysis.

In [ ]: # PCA
# Parameters: n_components (int)
# Returns: Transformed data (array)

# Example: Reducing dimensionality of the Iris dataset using PCA


pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
print(f'Shape before PCA: {X.shape}, Shape after PCA: {X_pca.shape}')

Shape before PCA: (150, 4), Shape after PCA: (150, 2)

cross_val_score
Evaluate a score by cross-validation.
In [ ]: # cross_val_score
# Parameters: estimator (object), X (array-like), y (array-like), cv (int)
# Returns: Array of scores

# Example: Performing cross-validation on the Iris dataset


cross_val_scores = cross_val_score(rf, X, y, cv=5)
print(f'Cross-validation scores: {cross_val_scores}')

Cross-validation scores: [0.96666667 0.96666667 0.93333333 0.96666667 1. ]

GridSearchCV
Exhaustive search over specified parameter values for an estimator.

In [ ]: # GridSearchCV
# Parameters: estimator (object), param_grid (dict), cv (int)
# Returns: Best estimator

# Example: Performing grid search to find the best hyperparameters for a classification model
param_grid = {'n_estimators': [50, 100, 150], 'max_depth': [None, 3, 5, 10, 20, 30]}
grid_search = GridSearchCV(rf, param_grid, cv=5)
grid_search.fit(X_train, y_train)
print(f'Best parameters: {grid_search.best_params_}')
print(f'Best score: {grid_search.best_score_}')

Best parameters: {'max_depth': None, 'n_estimators': 100}


Best score: 0.9428571428571428

Pipeline
Pipeline of transforms with a final estimator.

In [ ]: # Pipeline
# Parameters: steps (list of tuples)
# Returns: Fitted pipeline

# Example: Creating a pipeline for scaling and classifying the Iris dataset
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])
pipeline.fit(X_train, y_train)
y_pred_pipeline = pipeline.predict(X_test)
print(f'Accuracy: {accuracy_score(y_test, y_pred_pipeline)}')

Accuracy: 1.0

Follow me on Sumit Khanna for more updates

You might also like