AI Basics
Scikit-Learn
Cheat Sheet
from sklearn import *
#Python
/sumitkhanna
Python
Follow me on Sumit Khanna for more updates
Comprehensive Guide to Scikit-Learn Functions and Features
This notebook provides a comprehensive guide to various scikit-learn functions with examples.
Cheat Sheet Table of Scikit-Learn Functions
Feature/Function Brief Explanation Example
train_test_split Splitting data into training and testing sets Splitting the Iris dataset into training and testing sets
Standardizing features by removing the mean and scaling to unit
StandardScaler Standardizing the Iris dataset
variance
Transforming features by scaling each feature to a given range
MinMaxScaler Scaling the Iris dataset to range [0, 1]
Encoding target labels with value between 0 and n_classes-1
LabelEncoder Encoding the target variable in the Iris dataset
Encoding categorical features as a one-hot numeric array
OneHotEncoder One-hot encoding categorical features in a sample dataset
Generating polynomial and interaction features
PolynomialFeatures Generating polynomial features for the housing dataset
Linear regression model
LinearRegression Predicting house prices using Linear Regression
Logistic regression model
LogisticRegression Classifying digits using Logistic Regression
Ridge regression model
Ridge Predicting house prices using Ridge Regression
Lasso regression model
Lasso Predicting house prices using Lasso Regression
Support Vector Classification
SVC Classifying Iris species using Support Vector Classification
Random Forest Classification
RandomForestClassifier Classifying Iris species using Random Forest Classification
Gradient Boosting Classification
GradientBoostingClassifier Classifying Iris species using Gradient Boosting Classification
K-Nearest Neighbors Classification
KNeighborsClassifier Classifying Iris species using K-Nearest Neighbors Classification
Decision Tree Classification
DecisionTreeClassifier Classifying Iris species using Decision Tree Classification
Computing accuracy score
accuracy_score Calculating the accuracy score for a classification model
Computing confusion matrix
confusion_matrix Generating a confusion matrix for a classification model
Building classification report
classification_report Generating a classification report for a classification model
Mean squared error
mean_squared_error Calculating the mean squared error for a regression model
R^2 (coefficient of determination) regression score function
r2_score Calculating the R^2 score for a regression model
Principal Component Analysis
PCA Reducing dimensionality of the Iris dataset using PCA
Evaluate a score by cross-validation
cross_val_score Performing cross-validation on the Iris dataset
Performing grid search to find the best hyperparameters for a
GridSearchCV Exhaustive search over specified parameter values for an estimator classification model
Creating a pipeline for scaling and classifying the Iris dataset
Pipeline Pipeline of transforms with a final estimator
Introduction
Scikit-Learn is a powerful Python library for machine learning, providing a wide range of tools for data analysis and modeling. This guide will walk you through the
key functions and features of Scikit-Learn with practical examples.
In [ ]: # Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns from sklearn.datasets
import load_iris, fetch_california_housing, load_digits
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder, OneHotEncoder, PolynomialFeatures
from sklearn.linear_model import LinearRegression, LogisticRegression, Ridge, Lasso, SGDRegressor
from sklearn.svm import SVC, SVR, LinearSVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, RandomForestRegressor, GradientBoostingRegressor
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, mean_squared_error, r2_score
from sklearn.decomposition import PCA from sklearn.pipeline import Pipeline
train_test_split
Splitting data into training and testing sets.
In [ ]: # train_test_split
# Parameters: X (array-like), y (array-like), test_size (float), random_state (int)
# Returns: X_train, X_test, y_train, y_test (arrays)
# Example: Splitting the Iris dataset into training and testing sets
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
print(f'X_train shape: {X_train.shape}, X_test shape: {X_test.shape}')
print(f'y_train shape: {y_train.shape}, y_test shape: {y_test.shape}')
X_train shape: (105, 4), X_test shape: (45, 4)
y_train shape: (105,), y_test shape: (45,)
StandardScaler
Standardizing features by removing the mean and scaling to unit variance.
In [ ]: # StandardScaler
# Parameters: None
# Returns: Scaled data (array)
# Example: Standardizing the Iris dataset
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
print(f'Mean: {X_scaled.mean(axis=0)}, Std Dev: {X_scaled.std(axis=0)}')
Mean: [-1.69031455e-15 -1.84297022e-15 -1.69864123e-15 -1.40924309e-15], Std Dev: [1. 1. 1. 1.]
MinMaxScaler
Transforming features by scaling each feature to a given range.
In [ ]: # MinMaxScaler
# Parameters: None
# Returns: Scaled data (array)
# Example: Scaling the Iris dataset to range [0, 1]
min_max_scaler = MinMaxScaler()
X_min_max_scaled = min_max_scaler.fit_transform(X)
print(f'Min: {X_min_max_scaled.min(axis=0)}, Max: {X_min_max_scaled.max(axis=0)}')
Min: [0. 0. 0. 0.], Max: [1. 1. 1. 1.]
LabelEncoder
Encoding target labels with value between 0 and n_classes-1.
In [ ]: # LabelEncoder
# Parameters: None
# Returns: Encoded labels (array)
# Example: Encoding the target variable in the Iris dataset
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)
print(f'Classes: {label_encoder.classes_}')
print(f'Encoded labels: {y_encoded[:10]}')
Classes: [0 1 2]
Encoded labels: [0 0 0 0 0 0 0 0 0 0]
OneHotEncoder
Encoding categorical features as a one-hot numeric array.
In [ ]: # OneHotEncoder
# Parameters: None
# Returns: One-hot encoded array
# Example: One-hot encoding categorical features in a sample dataset
data = pd.DataFrame({
'feature': ['A', 'B', 'C', 'A', 'B', 'C']
})
one_hot_encoder = OneHotEncoder()
one_hot_encoded = one_hot_encoder.fit_transform(data)
print(f'One-hot encoded array:\n{one_hot_encoded}')
print(f'Categories: {one_hot_encoder.categories_}')
One-hot encoded array:
(0, 0) 1.0
(1, 1) 1.0
(2, 2) 1.0
(3, 0) 1.0
(4, 1) 1.0
(5, 2) 1.0
Categories: [array(['A', 'B', 'C'], dtype=object)]
PolynomialFeatures
Generating polynomial and interaction features.
In [ ]: # PolynomialFeatures
# Parameters: degree (int)
# Returns: Polynomial features (array)
# Example: Generating polynomial features for the housing housing dataset
housing = fetch_california_housing()
X_housing = housing.data
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X_housing)
print(f'Original shape: {X_housing.shape}, Polynomial shape: {X_poly.shape}')
Original shape: (20640, 8), Polynomial shape: (20640, 45)
LinearRegression
Linear regression model.
In [ ]: # LinearRegression #
Parameters: None #
Returns: Fitted model
# Example: Predicting house prices using Linear Regression
X_train_housing, X_test_housing, y_train_housing, y_test_housing = train_test_split(X_housing, housing.target, test_size=0.3, random_state=42)
lr = LinearRegression()
lr.fit(X_train_housing, y_train_housing)
y_pred_housing = lr.predict(X_test_housing)
print(f'Mean squared error: {mean_squared_error(y_test_housing, y_pred_housing)}')
print(f'R^2 score: {r2_score(y_test_housing, y_pred_housing)}')
Mean squared error: 0.5305677824766757
R^2 score: 0.595770232606166
Linear Regression Model Performance
In [ ]: # Plotting Linear Regression Model Performance
plt.figure(figsize=(8, 4))
plt.scatter(y_test_housing, y_pred_housing, color='blue', edgecolor='k', alpha=0.6)
plt.plot([min(y_test_housing), max(y_test_housing)], [min(y_test_housing), max(y_test_housing)], color='red', linewidth=2)
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title('Linear Regression Model Performance')
plt.show()
LogisticRegression
Logistic regression model.
In [ ]: # LogisticRegression
# Parameters: None
# Returns: Fitted model
# Example: Classifying digits using Logistic Regression
digits = load_digits()
X_digits, y_digits = digits.data, digits.target
X_train_digits, X_test_digits, y_train_digits, y_test_digits = train_test_split(X_digits, y_digits, test_size=0.3, random_state=42)
lr = LogisticRegression(max_iter=1000)
lr.fit(X_train_digits, y_train_digits)
y_pred_digits = lr.predict(X_test_digits)
print(f'Accuracy: {accuracy_score(y_test_digits, y_pred_digits)}')
Accuracy: 0.9685185185185186
Logistic Regression Model Performance
In [ ]: # Plotting Logistic Regression Model Performance
conf_matrix = confusion_matrix(y_test_digits, y_pred_digits)
plt.figure(figsize=(8, 4))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Logistic Regression Confusion Matrix')
plt.show()
Ridge
Ridge regression model.
In [ ]: # Ridge
# Parameters: alpha (float)
# Returns: Fitted model
# Example: Predicting house prices using Ridge Regression
ridge = Ridge(alpha=1.0)
ridge.fit(X_train_housing, y_train_housing)
y_pred_ridge = ridge.predict(X_test_housing)
print(f'Mean squared error: {mean_squared_error(y_test_housing, y_pred_ridge)}')
print(f'R^2 score: {r2_score(y_test_housing, y_pred_ridge)}')
Mean squared error: 0.5305052690933701
R^2 score: 0.5958178603951635
Ridge Regression Model Performance
In [ ]: # Plotting Ridge Regression Model Performance
plt.figure(figsize=(8, 4))
plt.scatter(y_test_housing, y_pred_ridge, color='blue', edgecolor='k', alpha=0.6)
plt.plot([min(y_test_housing), max(y_test_housing)], [min(y_test_housing), max(y_test_housing)], color='red', linewidth=2)
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title('Ridge Regression Model Performance')
plt.show()
Lasso
Lasso regression model.
In [ ]: # Lasso
# Parameters: alpha (float)
# Returns: Fitted model
# Example: Predicting house prices using Lasso Regression
lasso = Lasso(alpha=0.1)
lasso.fit(X_train_housing, y_train_housing)
y_pred_lasso = lasso.predict(X_test_housing)
print(f'Mean squared error: {mean_squared_error(y_test_housing, y_pred_lasso)}')
print(f'R^2 score: {r2_score(y_test_housing, y_pred_lasso)}')
Mean squared error: 0.5970512258509186
R^2 score: 0.545117728367666
Lasso Regression Model Performance
In [ ]: # Plotting Lasso Regression Model Performance
plt.figure(figsize=(8, 4))
plt.scatter(y_test_housing, y_pred_lasso, color='blue', edgecolor='k', alpha=0.6)
plt.plot([min(y_test_housing), max(y_test_housing)], [min(y_test_housing), max(y_test_housing)], color='red', linewidth=2)
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title('Lasso Regression Model Performance')
plt.show()
SVC
Support Vector Classification.
In [ ]: # SVC
# Parameters: kernel (string), C (float)
# Returns: Fitted model
# Example: Classifying Iris species using Support Vector Classification
svc = SVC(kernel='rbf', C=1.0)
svc.fit(X_train, y_train)
y_pred_svc = svc.predict(X_test)
print(f'Accuracy: {accuracy_score(y_test, y_pred_svc)}')
Accuracy: 1.0
SVC Model Performance
In [ ]: # Plotting SVC Model Performance
conf_matrix = confusion_matrix(y_test, y_pred_svc)
plt.figure(figsize=(8, 4))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('SVC Confusion Matrix')
plt.show()
RandomForestClassifier
Random Forest Classification.
In [ ]: # RandomForestClassifier
# Parameters: n_estimators (int), random_state (int)
# Returns: Fitted model
# Example: Classifying Iris species using Random Forest Classification
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
print(f'Accuracy: {accuracy_score(y_test, y_pred_rf)}')
Accuracy: 1.0
Random Forest Model Performance
# Plotting Random Forest Model Performance
In [ ]: conf_matrix = confusion_matrix(y_test, y_pred_rf)
plt.figure(figsize=(8, 4))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Random Forest Confusion Matrix')
plt.show()
GradientBoostingClassifier
Gradient Boosting Classification.
In [ ]: # GradientBoostingClassifier
# Parameters: n_estimators (int), learning_rate (float)
# Returns: Fitted model
# Example: Classifying Iris species using Gradient Boosting Classification
gb = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, random_state=42)
gb.fit(X_train, y_train)
y_pred_gb = gb.predict(X_test)
print(f'Accuracy: {accuracy_score(y_test, y_pred_gb)}')
Accuracy: 1.0
Gradient Boosting Model Performance
In [ ]: # Plotting Gradient Boosting Model Performance
conf_matrix = confusion_matrix(y_test, y_pred_gb)
plt.figure(figsize=(8, 4))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Gradient Boosting Confusion Matrix')
plt.show()
KNeighborsClassifier
K-Nearest Neighbors Classification.
In [ ]: # KNeighborsClassifier
# Parameters: n_neighbors (int)
# Returns: Fitted model
# Example: Classifying Iris species using K-Nearest Neighbors Classification
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred_knn = knn.predict(X_test)
print(f'Accuracy: {accuracy_score(y_test, y_pred_knn)}')
Accuracy: 1.0
K-Nearest Neighbors Model Performance
In [ ]: # Plotting K-Nearest Neighbors Model Performance
conf_matrix = confusion_matrix(y_test, y_pred_knn)
plt.figure(figsize=(8, 4))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('K-Nearest Neighbors Confusion Matrix')
plt.show()
DecisionTreeClassifier
Decision Tree Classification.
In [ ]: # DecisionTreeClassifier
# Parameters: criterion (string), max_depth (int)
# Returns: Fitted model
# Example: Classifying Iris species using Decision Tree Classification
dt = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=42)
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)
print(f'Accuracy: {accuracy_score(y_test, y_pred_dt)}')
Accuracy: 1.0
Decision Tree Model Performance
In [ ]: # Plotting Decision Tree Model Performance
conf_matrix = confusion_matrix(y_test, y_pred_dt)
plt.figure(figsize=(8, 4))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Decision Tree Confusion Matrix')
plt.show()
mean_squared_error
Mean squared error.
In [ ]: # mean_squared_error
# Parameters: y_true (array-like), y_pred (array-like)
# Returns: Mean squared error (float)
# Example: Calculating the mean squared error for a regression model
mse = mean_squared_error(y_test_housing, y_pred_housing)
print(f'Mean Squared Error: {mse}')
Mean Squared Error: 0.5305677824766757
r2_score
R^2 (coefficient of determination) regression score function.
In [ ]: # r2_score
# Parameters: y_true (array-like), y_pred (array-like)
# Returns: R^2 score (float)
# Example: Calculating the R^2 score for a regression model
r2 = r2_score(y_test_housing, y_pred_housing)
print(f'R^2 Score: {r2}')
R^2 Score: 0.595770232606166
PCA
Principal Component Analysis.
In [ ]: # PCA
# Parameters: n_components (int)
# Returns: Transformed data (array)
# Example: Reducing dimensionality of the Iris dataset using PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
print(f'Shape before PCA: {X.shape}, Shape after PCA: {X_pca.shape}')
Shape before PCA: (150, 4), Shape after PCA: (150, 2)
cross_val_score
Evaluate a score by cross-validation.
In [ ]: # cross_val_score
# Parameters: estimator (object), X (array-like), y (array-like), cv (int)
# Returns: Array of scores
# Example: Performing cross-validation on the Iris dataset
cross_val_scores = cross_val_score(rf, X, y, cv=5)
print(f'Cross-validation scores: {cross_val_scores}')
Cross-validation scores: [0.96666667 0.96666667 0.93333333 0.96666667 1. ]
GridSearchCV
Exhaustive search over specified parameter values for an estimator.
In [ ]: # GridSearchCV
# Parameters: estimator (object), param_grid (dict), cv (int)
# Returns: Best estimator
# Example: Performing grid search to find the best hyperparameters for a classification model
param_grid = {'n_estimators': [50, 100, 150], 'max_depth': [None, 3, 5, 10, 20, 30]}
grid_search = GridSearchCV(rf, param_grid, cv=5)
grid_search.fit(X_train, y_train)
print(f'Best parameters: {grid_search.best_params_}')
print(f'Best score: {grid_search.best_score_}')
Best parameters: {'max_depth': None, 'n_estimators': 100}
Best score: 0.9428571428571428
Pipeline
Pipeline of transforms with a final estimator.
In [ ]: # Pipeline
# Parameters: steps (list of tuples)
# Returns: Fitted pipeline
# Example: Creating a pipeline for scaling and classifying the Iris dataset
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])
pipeline.fit(X_train, y_train)
y_pred_pipeline = pipeline.predict(X_test)
print(f'Accuracy: {accuracy_score(y_test, y_pred_pipeline)}')
Accuracy: 1.0
Follow me on Sumit Khanna for more updates