BCSL606 | Machine Learning Lab|
Visvesvaraya Technological University (VTU)
Subject Code: BCSL606
Subject: Machine Learning Laboratory
Laboratory Components
1. Histograms and Boxplots Analysis (California Housing)
2. Correlation Matrix and Pair Plot (California Housing)
3. PCA Dimensionality Reduction (Iris Dataset)
4. Find-S Algorithm for Hypothesis Generation
5. k-Nearest Neighbors Classification (Generated Data)
6. Locally Weighted Regression Algorithm
7. Linear and Polynomial Regression (Boston Housing & Auto MPG)
8. Decision Tree Classifier (Breast Cancer Dataset)
Page 1
BCSL606 | Machine Learning Lab|
9. Naive Bayes Classifier (Olivetti Face Dataset)
10.K-Means Clustering (Breast Cancer Dataset)
Experiment-01
Develop a program to create histograms for all numerical features and analyze the
distribution of each feature. Generate box plots for all numerical features and identify
any outliers. Use California Housing dataset.
Code:
!pip install pandas numpy matplotlib seaborn scikit-learn
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_california_housing
Page 2
BCSL606 | Machine Learning Lab|
# Set the style for better visualization
plt.style.use('tableau-colorblind10') # Using a built-in matplotlib style
def load_and_prepare_data():
"""Load California Housing dataset and convert to pandas DataFrame"""
housing = fetch_california_housing()
df = pd.DataFrame(housing.data, columns=housing.feature_names)
df['PRICE'] = housing.target
return df
def create_distribution_plots(df, save_plots=False):
"""Create histograms and box plots for all numerical features"""
numerical_features = df.columns
# Calculate number of rows needed for subplot grid
n_features = len(numerical_features)
n_rows = (n_features + 1) // 2 # 2 plots per row
# Create histograms
plt.figure(figsize=(15, 5*n_rows))
for idx, feature in enumerate(numerical_features, 1):
plt.subplot(n_rows, 2, idx)
sns.histplot(data=df, x=feature, kde=True)
Page 3
BCSL606 | Machine Learning Lab|
plt.title(f'Distribution of {feature}')
plt.xlabel(feature)
plt.ylabel('Count')
plt.tight_layout()
if save_plots:
plt.savefig('histograms.png')
plt.show()
# Create box plots
plt.figure(figsize=(15, 5*n_rows))
for idx, feature in enumerate(numerical_features, 1):
plt.subplot(n_rows, 2, idx)
sns.boxplot(data=df[feature])
plt.title(f'Box Plot of {feature}')
plt.tight_layout()
if save_plots:
plt.savefig('boxplots.png')
plt.show()
def analyze_distributions(df):
"""Generate statistical summary and identify outliers"""
stats_summary = df.describe()
Page 4
BCSL606 | Machine Learning Lab|
# Calculate IQR and identify outliers for each feature
outlier_summary = {}
for column in df.columns:
Q1 = df[column].quantile(0.25)
Q3 = df[column].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)][column]
outlier_summary[column] = {
'number_of_outliers': len(outliers),
'percentage_of_outliers': (len(outliers) / len(df)) * 100,
'outlier_range': f"< {lower_bound:.2f} or > {upper_bound:.2f}"
return stats_summary, outlier_summary
def main():
# Load the data
df = load_and_prepare_data()
# Create visualization plots
Page 5
BCSL606 | Machine Learning Lab|
create_distribution_plots(df)
# Analyze distributions and outliers
stats_summary, outlier_summary = analyze_distributions(df)
# Print statistical summary
print("\nStatistical Summary:")
print(stats_summary)
# Print outlier analysis
print("\nOutlier Analysis:")
for feature, summary in outlier_summary.items():
print(f"\n{feature}:")
print(f"Number of outliers: {summary['number_of_outliers']}")
print(f"Percentage of outliers: {summary['percentage_of_outliers']:.2f}%")
print(f"Outlier range: {summary['outlier_range']}")
if __name__ == "__main__":
main()
Output
Page 6
BCSL606 | Machine Learning Lab|
Page 7
BCSL606 | Machine Learning Lab|
Page 8
BCSL606 | Machine Learning Lab|
Page 9
BCSL606 | Machine Learning Lab|
Page 10
BCSL606 | Machine Learning Lab|
Explanation
Understanding California Housing Data Analysis
Introduction
The code performs an exploratory data analysis (EDA) on California housing data. EDA is a
crucial first step in understanding your dataset before performing any advanced analysis or
modeling. This analysis focuses on understanding the distribution of housing features and
prices across California.
Theory Behind Each Component
Data Loading and Preparation
The California Housing dataset is a standard dataset in scikit-learn containing housing prices
and related features. The data preparation step converts this into a panda DataFrame, which is
a table-like structure where:
Each row represents a different location in California
Each column represents a different feature (like house price, income, population)
The target variable (house price) is added as an additional column
Distribution Analysis
The code analyzes distributions through two main approaches:
1. Visual Analysis The distribution plots help understand how values are spread across
each feature:
o Histograms show the frequency distribution of values, revealing if data is
normally distributed, skewed, or has multiple peaks
o Kernel Density Estimation (KDE) smooths the histogram to show the
continuous probability distribution
o Box plots reveal the median, quartiles, and potential outliers in the data
2. Statistical Analysis The code calculates key statistical measures:
Page 11
BCSL606 | Machine Learning Lab|
o Descriptive statistics (mean, median, standard deviation) summarize central
tendency and spread
o Interquartile Range (IQR) measures variability by finding the range between
the 25th and 75th percentiles
o Outlier detection uses the 1.5 × IQR rule: any point beyond 1.5 times the IQR
from the quartiles is considered an outlier
Visualization System
The visualization system uses matplotlib and seaborn libraries because:
Matplotlib provides the foundation for creating plots
Seaborn adds statistical plotting functions and improves plot aesthetics
The tableau-colorblind10 style ensures accessibility and professional appearance
Statistical Methods Used
1. Descriptive Statistics
o Mean: Average value of each feature
o Standard deviation: Measure of data spread
o Quartiles: Values that divide data into four equal parts
o Min/Max: Range of values for each feature
2. Outlier Detection The IQR method is used because:
o It's resistant to extreme values
o Doesn't assume normal distribution
o Identifies values that are unusually high or low
o Formula: [Q1 - 1.5×IQR, Q3 + 1.5×IQR] defines the normal range
Significance of Each Feature
The dataset includes these meaningful features:
Page 12
BCSL606 | Machine Learning Lab|
Median Income: Indicates area's economic status
House Age: Represents property age
Average Rooms/Bedrooms: Indicates house size
Population and Occupancy: Shows area density
Location (Latitude/Longitude): Captures geographical factors
Price: Target variable showing house values
Purpose of Analysis Components
1. Distribution Plots
o Help identify patterns in data
o Show if variables are normally distributed
o Reveal potential data quality issues
o Highlight relationships between features
2. Statistical Summary
o Provides numerical understanding of data
o Helps identify unusual patterns
o Supports data-driven decisions
o Validates visual observations
3. Outlier Analysis
o Identifies unusual cases
o Helps understand extreme values
o Supports data cleaning decisions
o Reveals potential data errors
Expected Insights
Page 13
BCSL606 | Machine Learning Lab|
This analysis helps understand:
Typical housing prices in California
How features vary across locations
Unusual patterns or anomalies
Relationships between features
Data quality and reliability
The combination of visual and statistical analysis provides a comprehensive understanding of
California's housing market characteristics, essential for further modeling or decision-making
processes.
Page 14
BCSL606 | Machine Learning Lab|
Experiment-02
Develop a program to Compute the correlation matrix to understand the relationships
between pairs of features. Visualize the correlation matrix using a heatmap to know
which variables have strong positive/negative correlations. Create a pair plot to
visualize pairwise relationships between features. Use California Housing dataset.
Code:
!pip install pandas numpy matplotlib seaborn scikit-learn
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_california_housing
def load_and_prepare_data():
"""Load California Housing dataset and convert to pandas DataFrame"""
housing = fetch_california_housing()
df = pd.DataFrame(housing.data, columns=housing.feature_names)
df['PRICE'] = housing.target
return df
def compute_correlation_matrix(df):
Page 15
BCSL606 | Machine Learning Lab|
"""Compute and return the correlation matrix"""
correlation_matrix = df.corr()
return correlation_matrix
def plot_correlation_heatmap(correlation_matrix):
"""Create a heatmap visualization of the correlation matrix"""
plt.figure(figsize=(12, 10))
# Create heatmap with correlation values
sns.heatmap(correlation_matrix,
annot=True, # Show correlation values
cmap='coolwarm', # Red for positive, blue for negative correlations
vmin=-1, vmax=1, # Fix the range of correlation values
center=0, # Center the colormap at 0
square=True, # Make the plot square-shaped
fmt='.2f') # Round correlation values to 2 decimal places
plt.title('Correlation Matrix Heatmap')
plt.tight_layout()
plt.show()
Page 16
BCSL606 | Machine Learning Lab|
def create_pair_plot(df):
"""Create a pair plot to show relationships between all features"""
# Create pair plot
sns.pairplot(df, diag_kind='kde', plot_kws={'alpha': 0.6})
plt.tight_layout()
plt.show()
def analyze_correlations(correlation_matrix):
"""Analyze and print notable correlations"""
# Get upper triangle of the correlation matrix
upper_tri =
correlation_matrix.where(np.triu(np.ones(correlation_matrix.shape),
k=1).astype(bool))
# Find strong correlations (absolute value > 0.5)
strong_correlations = []
for col in upper_tri.columns:
for idx, value in upper_tri[col].items():
if value is not None and abs(value) > 0.5:
strong_correlations.append({
Page 17
BCSL606 | Machine Learning Lab|
'features': (idx, col),
'correlation': value
})
# Sort by absolute correlation value
strong_correlations.sort(key=lambda x: abs(x['correlation']), reverse=True)
return strong_correlations
def main():
# Load the data
print("Loading California Housing dataset...")
df = load_and_prepare_data()
# Compute correlation matrix
print("\nComputing correlation matrix...")
correlation_matrix = compute_correlation_matrix(df)
# Plot correlation heatmap
print("\nCreating correlation heatmap...")
Page 18
BCSL606 | Machine Learning Lab|
plot_correlation_heatmap(correlation_matrix)
# Create pair plot
print("\nCreating pair plot (this may take a moment)...")
create_pair_plot(df)
# Analyze and print notable correlations
print("\nAnalyzing strong correlations...")
strong_correlations = analyze_correlations(correlation_matrix)
# Print results
print("\nStrong correlations found (|correlation| > 0.5):")
for corr in strong_correlations:
feature1, feature2 = corr['features']
correlation = corr['correlation']
correlation_type = "positive" if correlation > 0 else "negative"
print(f"{feature1} and {feature2}: {correlation:.3f} ({correlation_type}
correlation)")
if __name__ == "__main__":
main()
Page 19
BCSL606 | Machine Learning Lab|
Output
Page 20
BCSL606 | Machine Learning Lab|
Page 21
BCSL606 | Machine Learning Lab|
Explanation
This code analyzes the California Housing dataset to understand how different
features in houses are related to each other.
The main purpose is to find correlations between different housing features. A
correlation shows how strongly two features are related. For example, it can tell
us if house prices tend to go up when the number of rooms increases.
Correlation values range from -1 to +1:
+1 means perfect positive correlation (when one goes up, the other goes
up)
0 means no correlation (no relationship)
-1 means perfect negative correlation (when one goes up, the other goes
down)
The code creates two main visualizations:
1. A Correlation Heatmap:
Shows all correlations in a color-coded matrix
Red colors show positive correlations
Blue colors show negative correlations
Darker colors mean stronger relationships
Numbers in each cell show the exact correlation value
2. A Pair Plot:
Shows scatter plots for every pair of features
Helps visualize relationships between variables
Page 22
BCSL606 | Machine Learning Lab|
Shows distribution of each feature on the diagonal
The code also automatically finds strong correlations (values above 0.5 or
below -0.5) and prints them, telling you which features are strongly related and
whether the relationship is positive or negative.
This analysis helps understand patterns in the housing market, like:
Which features most strongly affect house prices
Which features tend to occur together
Whether features have expected or surprising relationships
1. Function: load_and_prepare_data()
o Purpose: Loads California Housing dataset
o Steps:
Fetches data using sklearn's fetch_california_housing()
Converts to pandas DataFrame
Adds house prices as a target column
Returns complete dataset
2. Function: compute_correlation_matrix(df)
o Purpose: Calculates correlations between all features
o Uses pandas' df.corr() to compute Pearson correlation coefficients
o Returns a matrix where values range from -1 to 1
1: Perfect positive correlation
0: No correlation
Page 23
BCSL606 | Machine Learning Lab|
-1: Perfect negative correlation
3. Function: plot_correlation_heatmap(correlation_matrix)
o Purpose: Creates visual heatmap of correlations
o Settings:
Figure size: 12x10
Shows actual correlation values (annot=True)
Uses coolwarm color scheme (red=positive, blue=negative)
Range: -1 to 1
Formats numbers to 2 decimal places
4. Function: create_pair_plot(df)
o Purpose: Shows relationships between all pairs of features
o Uses seaborn's pairplot
o Settings:
Diagonal: Kernel Density Estimation (kde)
Alpha: 0.6 for transparency
Shows scatter plots for all feature combinations
5. Function: analyze_correlations(correlation_matrix)
o Purpose: Identifies strong correlations
o Steps:
Gets upper triangle of correlation matrix
Page 24
BCSL606 | Machine Learning Lab|
Finds correlations stronger than ±0.5
Sorts results by correlation strength
Returns list of strong correlations
6. Function: main()
o Purpose: Orchestrates the analysis workflow
o Process:
1. Loads housing data
2. Computes correlation matrix
3. Creates heatmap visualization
4. Generates pair plot
5. Analyzes strong correlations
6. Prints findings
7. Output Format
o Visual outputs:
Correlation heatmap
Pair plot matrix
o Text output:
Lists strong correlations
Shows correlation strength
Indicates if correlation is positive/negative
Page 25
BCSL606 | Machine Learning Lab|
Experiment-03
Develop a program to implement Principal Component Analysis (PCA) for
reducing the dimensionality of the Iris dataset from 4 features to 2
Code:
!pip install pandas numpy matplotlib scikit-learn
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
def load_and_prepare_data():
"""Load Iris dataset and prepare it for PCA"""
# Load the iris dataset
iris = load_iris()
# Create a DataFrame with feature names
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
Page 26
BCSL606 | Machine Learning Lab|
# Add target variable
df['target'] = iris.target
df['target_names'] = pd.Categorical.from_codes(iris.target, iris.target_names)
return df, iris.feature_names
def perform_pca(data, feature_names):
"""Perform PCA on the dataset"""
# Separate features
X = data[feature_names]
# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Apply PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
# Calculate explained variance ratio
Page 27
BCSL606 | Machine Learning Lab|
explained_variance_ratio = pca.explained_variance_ratio_
# Get component loadings
loadings = pca.components_
return X_pca, explained_variance_ratio, loadings, pca
def plot_pca_results(X_pca, data, explained_variance_ratio):
"""Plot the PCA results"""
# Create figure
plt.figure(figsize=(10, 8))
# Create scatter plot for each class
targets = sorted(data['target'].unique())
target_names = sorted(data['target_names'].unique())
for target, target_name in zip(targets, target_names):
mask = data['target'] == target
plt.scatter(X_pca[mask, 0], X_pca[mask, 1],
label=target_name, alpha=0.8)
Page 28
BCSL606 | Machine Learning Lab|
# Add labels and title
plt.xlabel(f'First Principal Component (Explains
{explained_variance_ratio[0]:.2%} of variance)')
plt.ylabel(f'Second Principal Component (Explains
{explained_variance_ratio[1]:.2%} of variance)')
plt.title('PCA of Iris Dataset')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
def plot_explained_variance(pca):
"""Plot cumulative explained variance ratio"""
plt.figure(figsize=(10, 6))
cumsum = np.cumsum(pca.explained_variance_ratio_)
plt.plot(range(1, len(cumsum) + 1), cumsum, 'bo-')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance Ratio')
plt.title('Explained Variance vs. Number of Components')
plt.grid(True, alpha=0.3)
Page 29
BCSL606 | Machine Learning Lab|
plt.show()
def visualize_feature_importance(loadings, feature_names):
"""Visualize feature importance in each principal component"""
plt.figure(figsize=(12, 6))
# Plot for PC1
plt.subplot(1, 2, 1)
plt.bar(feature_names, loadings[0])
plt.title('Feature Weights in First Principal Component')
plt.xticks(rotation=45)
# Plot for PC2
plt.subplot(1, 2, 2)
plt.bar(feature_names, loadings[1])
plt.title('Feature Weights in Second Principal Component')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
Page 30
BCSL606 | Machine Learning Lab|
def main():
# Load and prepare data
print("Loading Iris dataset...")
data, feature_names = load_and_prepare_data()
# Perform PCA
print("\nPerforming PCA...")
X_pca, explained_variance_ratio, loadings, pca = perform_pca(data,
feature_names)
# Print explained variance
print("\nExplained Variance Ratio:")
print(f"PC1: {explained_variance_ratio[0]:.2%}")
print(f"PC2: {explained_variance_ratio[1]:.2%}")
print(f"Total: {sum(explained_variance_ratio):.2%}")
# Plot results
print("\nCreating visualizations...")
plot_pca_results(X_pca, data, explained_variance_ratio)
plot_explained_variance(pca)
Page 31
BCSL606 | Machine Learning Lab|
visualize_feature_importance(loadings, feature_names)
# Print feature importance
print("\nFeature Weights in Principal Components:")
for i, component in enumerate(loadings):
print(f"\nPrincipal Component {i+1}:")
for fname, weight in zip(feature_names, component):
print(f"{fname}: {weight:.3f}")
if __name__ == "__main__":
main()
Output
Page 32
BCSL606 | Machine Learning Lab|
Page 33
BCSL606 | Machine Learning Lab|
Explanation
Basic Theory:
PCA is a technique that reduces the dimensionality of data while preserving as
much important information as possible. It transforms high-dimensional data
into a new set of features called principal components.
Code Functions:
1. load_and_prepare_data()
o Loads the famous Iris dataset (contains measurements of different
iris flowers)
Page 34
BCSL606 | Machine Learning Lab|
o Creates a DataFrame with flower measurements and their species
names
o Each row represents one flower with its features and species type
2. perform_pca()
o Standardizes the data (makes all features have same scale)
o Applies PCA to reduce data to 2 dimensions
o Returns:
Transformed data
How much information each component preserves
Feature weights in each component
3. plot_pca_results()
o Creates a scatter plot showing flowers in the new 2D space
o Different colors for different iris species
o Shows how well species are separated after PCA
o Labels show how much variance each component explains
4. plot_explained_variance()
o Shows how much total information is preserved as we add
components
o Helps decide how many components to keep
Page 35
BCSL606 | Machine Learning Lab|
5. visualize_feature_importance()
o Creates bar plots showing which original features contribute most
to each principal component
o Helps understand what each new component means
What the Code Does:
1. Takes 4-dimensional iris flower measurements
2. Reduces them to 2 dimensions while keeping most important patterns
3. Shows how well different iris species can be distinguished
4. Tells us which original measurements are most important
Why This is Useful:
Helps visualize high-dimensional data
Finds most important patterns in the data
Shows which original features matter most
Can help classify different types of iris flowers using fewer
measurements
Page 36
BCSL606 | Machine Learning Lab|
Experiment-04
For a given set of training data examples stored in a .CSV file, implement and
demonstrate the Find-S algorithm to output a description of the set of all hypotheses
consistent with the training examples
Code:
import pandas as pd
import numpy as np
class FindS:
def __init__(self):
self.hypothesis = None
self.features = None
def initialize_hypothesis(self, num_features):
"""Initialize the most specific hypothesis"""
return ['ϕ'] * num_features
def is_positive_example(self, target):
"""Check if the example is positive"""
return target == 'Yes'
Page 37
BCSL606 | Machine Learning Lab|
def generalize_hypothesis(self, example, current_hypothesis):
"""
Generalize the hypothesis to be consistent with the positive example
"""
new_hypothesis = []
for ex_val, hyp_val in zip(example, current_hypothesis):
# If hypothesis value is 'ϕ' (null), use the example value
if hyp_val == 'ϕ':
new_hypothesis.append(ex_val)
# If values match, keep the value
elif ex_val == hyp_val:
new_hypothesis.append(hyp_val)
# If values don't match, generalize to '?'
else:
new_hypothesis.append('?')
return new_hypothesis
def fit(self, data, target_column):
Page 38
BCSL606 | Machine Learning Lab|
"""
Find the most specific hypothesis consistent with the training examples
Parameters:
data: pandas DataFrame containing the training examples
target_column: name of the target column
"""
# Separate features and target
X = data.drop(columns=[target_column])
y = data[target_column]
# Store feature names
self.features = X.columns.tolist()
# Initialize hypothesis
self.hypothesis = self.initialize_hypothesis(len(self.features))
# Process each training example
for index, row in X.iterrows():
# Only consider positive examples
Page 39
BCSL606 | Machine Learning Lab|
if self.is_positive_example(y[index]):
self.hypothesis = self.generalize_hypothesis(
row.values.tolist(),
self.hypothesis
return self.hypothesis
def print_hypothesis(self):
"""Print the current hypothesis in a readable format"""
if self.hypothesis and self.features:
print("\nFinal Hypothesis:")
print("〈", end='')
for feature, value in zip(self.features, self.hypothesis):
print(f"{feature} = {value}, ", end='')
print("〉")
else:
print("No hypothesis found. Please run fit() first.")
def load_data(filename):
Page 40
BCSL606 | Machine Learning Lab|
"""Load data from CSV file"""
try:
return pd.read_csv(filename)
except FileNotFoundError:
print(f"Error: File '{filename}' not found.")
return None
except Exception as e:
print(f"Error loading data: {str(e)}")
return None
def main():
# Example usage with sample data
print("Creating sample training data...")
# Create sample data if no file is provided
sample_data = {
'Sky': ['Sunny', 'Sunny', 'Rainy', 'Sunny'],
'Temperature': ['Warm', 'Warm', 'Cold', 'Warm'],
'Humidity': ['High', 'High', 'High', 'High'],
'Wind': ['Weak', 'Strong', 'Weak', 'Weak'],
Page 41
BCSL606 | Machine Learning Lab|
'PlayTennis': ['Yes', 'Yes', 'No', 'Yes']
df = pd.DataFrame(sample_data)
print("\nTraining Data:")
print(df)
# Initialize and run Find-S algorithm
print("\nRunning Find-S algorithm...")
find_s = FindS()
find_s.fit(df, target_column='PlayTennis')
# Print results
find_s.print_hypothesis()
print("\nHypothesis Interpretation:")
print("- '?' means any value is acceptable for that attribute")
print("- 'ϕ' means no value has been observed (null)")
print("- Specific values indicate required values for that attribute")
Page 42
BCSL606 | Machine Learning Lab|
if __name__ == "__main__":
main()
Output
Explanation
Key Concepts of Find-S Algorithm:
1. Purpose
Find-S aims to find the most specific hypothesis that is consistent with
training examples
It particularly focuses on positive training examples while ignoring
negative ones
The algorithm tries to identify essential patterns in features that lead to
positive outcomes
Page 43
BCSL606 | Machine Learning Lab|
2. Hypothesis Space
Starts with the most specific hypothesis possible (null values)
Gradually generalizes this hypothesis as it processes positive examples
Uses three types of values in hypothesis:
o Specific values (required conditions)
o '?' (any value allowed)
o 'ϕ' (null/initial state)
3. Working Principle
Only processes positive examples in the training data
When a positive example is encountered, compares each attribute with
current hypothesis
Generalizes hypothesis only when necessary to accommodate new
positive examples
Never becomes more specific once generalized
4. Generalization Rules
If attribute matches current hypothesis: Keep current value
If current hypothesis is null (ϕ): Use the example's value
If mismatch occurs: Generalize to '?' (any value acceptable)
5. Advantages
Simple to understand and implement
Computationally efficient
Page 44
BCSL606 | Machine Learning Lab|
Works well with consistent data
Provides clear, interpretable results
6. Limitations
Ignores negative examples completely
Cannot handle inconsistent training data
May not find the most general hypothesis
Assumes noise-free training data
7. Applications
Concept learning problems
Pattern recognition
Simple classification tasks
Educational purposes to understand basic machine learning concepts
8. Example Scenario
Consider learning when to play tennis based on weather conditions
Features might include sky condition, temperature, humidity, wind
Algorithm learns which conditions must be present for playing tennis
Gradually generalizes conditions that aren't strictly necessary
Page 45
BCSL606 | Machine Learning Lab|
Experiment-05
Develop a program to implement k-Nearest Neighbour algorithm to classify the
randomly generated 100 values of x in the range of [0,1]. Perform the following based
on dataset generated.
a. Label the first 50 points {x1,……,x50} as follows: if (xi ≤ 0.5), then xi ∊ Class1,
else xi ∊ Class1
b. Classify the remaining points, x51,……,x100 using KNN. Perform this for
k=1,2,3,4,5,20,30
Code:
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter
class KNN:
def __init__(self, k):
self.k = k
self.X_train = None
self.y_train = None
def fit(self, X, y):
"""Store training data"""
self.X_train = X
self.y_train = y
Page 46
BCSL606 | Machine Learning Lab|
def predict(self, X):
"""Predict class for each input value"""
predictions = []
for x in X:
# Calculate distances to all training points
distances = np.abs(self.X_train - x)
# Get indices of k nearest neighbors
k_nearest_indices = np.argsort(distances)[:self.k]
# Get classes of k nearest neighbors
k_nearest_labels = self.y_train[k_nearest_indices]
# Perform majority voting
most_common = Counter(k_nearest_labels).most_common(1)
predictions.append(most_common[0][0])
return np.array(predictions)
Page 47
BCSL606 | Machine Learning Lab|
def generate_data():
"""Generate and label the dataset"""
# Generate 100 random points in [0,1]
np.random.seed(42) # For reproducibility
X = np.random.rand(100)
# Label first 50 points
y = np.zeros(100)
y[:50] = np.where(X[:50] <= 0.5, 1, 2)
return X, y
def plot_results(X_train, y_train, X_test, y_pred, k):
"""Plot the results for a given k value"""
plt.figure(figsize=(12, 4))
# Plot training data
plt.scatter(X_train[y_train == 1], np.zeros_like(X_train[y_train == 1]),
c='blue', label='Class 1 (Training)', marker='o')
Page 48
BCSL606 | Machine Learning Lab|
plt.scatter(X_train[y_train == 2], np.zeros_like(X_train[y_train == 2]),
c='red', label='Class 2 (Training)', marker='o')
# Plot test data predictions
plt.scatter(X_test[y_pred == 1], np.ones_like(X_test[y_pred == 1])*0.1,
c='lightblue', label='Class 1 (Predicted)', marker='^')
plt.scatter(X_test[y_pred == 2], np.ones_like(X_test[y_pred == 2])*0.1,
c='lightcoral', label='Class 2 (Predicted)', marker='^')
plt.title(f'KNN Classification Results (k={k})')
plt.xlabel('x')
plt.yticks([])
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
def analyze_boundary_points(X_test, y_pred, k):
"""Analyze and print details about boundary points"""
boundary_points = []
Page 49
BCSL606 | Machine Learning Lab|
# Find points where predictions change
for i in range(1, len(y_pred)):
if y_pred[i] != y_pred[i-1]:
boundary_points.append(X_test[i])
if boundary_points:
print(f"\nDecision boundaries for k={k}:")
for point in sorted(boundary_points):
print(f"x = {point:.3f}")
else:
print(f"\nNo clear decision boundaries found for k={k}")
def main():
# Generate data
print("Generating dataset...")
X, y = generate_data()
# Split into training and test sets
X_train, y_train = X[:50], y[:50]
X_test, y_test = X[50:], y[50:]
Page 50
BCSL606 | Machine Learning Lab|
# Sort test data for better visualization
sort_idx = np.argsort(X_test)
X_test = X_test[sort_idx]
# Try different k values
k_values = [1, 2, 3, 4, 5, 20, 30]
for k in k_values:
print(f"\nPerforming classification with k={k}")
# Create and train KNN classifier
knn = KNN(k=k)
knn.fit(X_train, y_train)
# Make predictions
y_pred = knn.predict(X_test)
# Plot results
plot_results(X_train, y_train, X_test, y_pred, k)
Page 51
BCSL606 | Machine Learning Lab|
# Analyze decision boundaries
analyze_boundary_points(X_test, y_pred, k)
# Calculate and print summary statistics
class1_pred = np.sum(y_pred == 1)
class2_pred = np.sum(y_pred == 2)
print(f"\nPrediction Summary for k={k}:")
print(f"Class 1: {class1_pred} points ({class1_pred/len(y_pred)*100:.1f}
%)")
print(f"Class 2: {class2_pred} points ({class2_pred/len(y_pred)*100:.1f}
%)")
if __name__ == "__main__":
main()
Page 52
BCSL606 | Machine Learning Lab|
Output
Page 53
BCSL606 | Machine Learning Lab|
Page 54
BCSL606 | Machine Learning Lab|
Page 55
BCSL606 | Machine Learning Lab|
Explanation
1. Core KNN Implementation
The KNN class implements the K-Nearest Neighbors algorithm with two
main methods:
o fit: Stores training data and labels
o predict: Makes predictions by finding k nearest neighbors and
using majority voting
The algorithm uses absolute distance (np.abs) to measure proximity
between points
For each test point, it finds k closest training points and takes a majority
vote
2. Data Generation
Creates a synthetic dataset with 100 random points in range [0,1]
First 50 points are labeled based on a simple rule:
Page 56
BCSL606 | Machine Learning Lab|
o Points ≤ 0.5 get label 1
o Points > 0.5 get label 2
Data is split into training (first 50 points) and testing (remaining 50
points)
3. Visualization Components
plot_results function creates visual representation showing:
o Training data points (blue for class 1, red for class 2)
o Predicted classifications (light blue/coral triangles)
o Clear legend and grid for better readability
o Uses different markers for training (circles) vs predictions
(triangles)
4. Decision Boundary Analysis
analyze_boundary_points function:
o Identifies points where predictions change from one class to
another
o Prints the x-coordinates of these boundary points
o Helps understand where the algorithm switches between classes
5. Main Execution Flow
Tests multiple k values: [1, 2, 3, 4, 5, 20, 30]
For each k value:
o Creates and trains KNN classifier
Page 57
BCSL606 | Machine Learning Lab|
o Makes predictions on test data
o Visualizes results
o Analyzes decision boundaries
o Prints summary statistics (percentage of each class)
6. Key Features
Uses numpy for efficient numerical computations
Implements Counter for majority voting
Includes comprehensive visualization
Provides detailed analysis of classification boundaries
Shows impact of different k values on predictions
7. Insights from Implementation
Smaller k values lead to more complex decision boundaries
Larger k values create smoother, more generalized boundaries
The choice of k significantly impacts classification results
Visualization helps understand algorithm behavior
Page 58
BCSL606 | Machine Learning Lab|
Experiment-06
Implement the non-parametric Locally Weighted Regression algorithm in order to fit
data points. Select appropriate data set for your experiment and draw graphs
Code:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_regression
def generate_sample_data(n_samples=100, noise=10):
"""Generate sample data with non-linear pattern"""
X = np.linspace(0, 10, n_samples)
y = 2 * np.sin(X) + X/2 + np.random.normal(0, noise/10, n_samples)
return X, y
def kernel(x, x_i, tau=0.5):
"""Gaussian kernel function for weight calculation"""
return np.exp(-(x - x_i)**2 / (2 * tau**2))
def lowess(X, y, x_pred, tau=0.5):
"""
Locally Weighted Regression implementation
Page 59
BCSL606 | Machine Learning Lab|
Parameters:
-----------
X : array-like
Training input features
y : array-like
Target values
x_pred : array-like
Points at which to make predictions
tau : float
Bandwidth parameter controlling smoothness
Returns:
--------
array-like
Predicted values at x_pred points
"""
# Ensure arrays are 1D
X = np.ravel(X)
y = np.ravel(y)
Page 60
BCSL606 | Machine Learning Lab|
x_pred = np.ravel(x_pred)
y_pred = []
for x in x_pred:
# Calculate weights for all points
weights = kernel(x, X, tau)
# Weighted least squares matrices
W = np.diag(weights)
X_aug = np.column_stack([np.ones_like(X), X]) # Add bias term
# Calculate weighted least squares parameters
theta = np.linalg.inv(X_aug.T @ W @ X_aug) @ X_aug.T @ W @ y
# Make prediction
x_aug = np.array([1, x])
y_pred.append(float(x_aug @ theta))
return np.array(y_pred)
Page 61
BCSL606 | Machine Learning Lab|
# Generate sample data
np.random.seed(42)
X, y = generate_sample_data(n_samples=100, noise=10)
# Generate points for prediction
X_pred = np.linspace(0, 10, 200)
# Fit LOWESS with different bandwidth parameters
y_pred_smooth = lowess(X, y, X_pred, tau=0.3) # More local fitting
y_pred_medium = lowess(X, y, X_pred, tau=0.8) # Medium smoothing
y_pred_rough = lowess(X, y, X_pred, tau=2.0) # More global fitting
# Plotting
plt.figure(figsize=(12, 6))
plt.scatter(X, y, color='blue', alpha=0.5, label='Data points')
plt.plot(X_pred, y_pred_smooth, 'r-', label='τ = 0.3 (More local)', linewidth=2)
plt.plot(X_pred, y_pred_medium, 'g-', label='τ = 0.8 (Medium)', linewidth=2)
plt.plot(X_pred, y_pred_rough, 'y-', label='τ = 2.0 (More global)', linewidth=2)
plt.xlabel('X')
Page 62
BCSL606 | Machine Learning Lab|
plt.ylabel('y')
plt.title('Locally Weighted Regression with Different Bandwidth Parameters')
plt.legend()
plt.grid(True)
plt.show()
Output
Explanation
1. Data Generation
def generate_sample_data(n_samples=100, noise=10):
Creates non-linear sample data using sine function
Adds random noise to make it more realistic
Pattern is: y = 2 * sin(x) + x/2 + noise
Page 63
BCSL606 | Machine Learning Lab|
2. Kernel Function
def kernel(x, x_i, tau=0.5):
return np.exp(-(x - x_i)**2 / (2 * tau**2))
Implements Gaussian kernel for weight calculation
Gives higher weights to nearby points
tau (bandwidth) controls how quickly weight decreases with distance
Smaller tau = more local fitting
Larger tau = more global smoothing
3. LOWESS Implementation
def lowess(X, y, x_pred, tau=0.5):
Key steps:
For each prediction point:
o Calculate weights for all training points using kernel function
o Create weight matrix (W) and augmented feature matrix (X_aug)
o Solve weighted least squares: θ = (X^T W X)^(-1) X^T W y
o Make prediction using calculated parameters
Page 64
BCSL606 | Machine Learning Lab|
4. Visualization Setup
Generates 100 sample points with noise
Creates 200 evenly spaced points for prediction curve
Tests three different bandwidth (tau) values:
o τ = 0.3: More local fitting (follows data closely)
o τ = 0.8: Medium smoothing
o τ = 2.0: More global fitting (smoother curve)
5. Key Characteristics of LOWESS
Non-parametric regression technique
Adapts to local structure of data
Bandwidth parameter controls smoothness:
o Small tau: More flexible, might overfit
o Large tau: Smoother, might underfit
Computationally intensive (calculates weights for each prediction)
6. Main Differences in Results
Red line (τ = 0.3): Follows local variations closely
Green line (τ = 0.8): Balanced between local and global
Yellow line (τ = 2.0): Shows general trend, ignores local variations
Page 65
BCSL606 | Machine Learning Lab|
7. Advantages and Disadvantages Advantages:
No assumption about global function shape
Handles non-linear relationships well
Flexible local fitting
Disadvantages:
Computationally expensive
Sensitive to bandwidth parameter
Can perform poorly at boundaries
Page 66
BCSL606 | Machine Learning Lab|
Experiment-07
Develop a program to demonstrate the working of Linear Regression and Polynomial
Regression. Use Boston Housing Dataset for Linear Regression and Auto MPG Dataset
(for vehicle fuel efficiency prediction) for Polynomial Regression.
Code:
import pandas as pd
# Load Boston Housing dataset
url =
"https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv
"
boston_df = pd.read_csv(url)
# Print column names
print("Available columns in the dataset:")
print(boston_df.columns.tolist())
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
Page 67
BCSL606 | Machine Learning Lab|
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.metrics import mean_squared_error, r2_score
import warnings
warnings.filterwarnings('ignore')
# Part 1: Linear Regression with Boston Housing Dataset
print("Part 1: Linear Regression - Boston Housing Dataset")
print("-" * 50)
# Load Boston Housing dataset
url =
"https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv
"
boston_df = pd.read_csv(url)
# Features and target (using correct column names)
X_boston = boston_df.drop('medv', axis=1) # All columns except target
y_boston = boston_df['medv'] # median house value
# Print dataset info
print("\nDataset Information:")
Page 68
BCSL606 | Machine Learning Lab|
print(f"Number of samples: {len(X_boston)}")
print(f"Number of features: {len(X_boston.columns)}")
print("\nFeatures:")
for name in X_boston.columns:
print(f"- {name}")
# Split the data
X_train_boston, X_test_boston, y_train_boston, y_test_boston =
train_test_split(
X_boston, y_boston, test_size=0.2, random_state=42
# Scale the features
scaler = StandardScaler()
X_train_boston_scaled = scaler.fit_transform(X_train_boston)
X_test_boston_scaled = scaler.transform(X_test_boston)
# Train Linear Regression model
lr_model = LinearRegression()
lr_model.fit(X_train_boston_scaled, y_train_boston)
Page 69
BCSL606 | Machine Learning Lab|
# Make predictions
y_pred_boston = lr_model.predict(X_test_boston_scaled)
# Calculate metrics
mse_boston = mean_squared_error(y_test_boston, y_pred_boston)
rmse_boston = np.sqrt(mse_boston)
r2_boston = r2_score(y_test_boston, y_pred_boston)
print("\nLinear Regression Results:")
print(f"Mean Squared Error: {mse_boston:.2f}")
print(f"Root Mean Squared Error: {rmse_boston:.2f}")
print(f"R² Score: {r2_boston:.2f}")
# Feature importance analysis
feature_importance = pd.DataFrame({
'Feature': X_boston.columns,
'Coefficient': lr_model.coef_
})
feature_importance['Abs_Coefficient'] = abs(feature_importance['Coefficient'])
feature_importance = feature_importance.sort_values('Abs_Coefficient',
ascending=False)
Page 70
BCSL606 | Machine Learning Lab|
print("\nFeature Importance:")
print(feature_importance[['Feature', 'Coefficient']].to_string(index=False))
# Visualize feature importance
plt.figure(figsize=(12, 6))
plt.bar(feature_importance['Feature'], feature_importance['Coefficient'])
plt.xticks(rotation=45)
plt.title('Feature Importance in Boston Housing Price Prediction')
plt.xlabel('Features')
plt.ylabel('Coefficient Value')
plt.tight_layout()
plt.show()
# Plot actual vs predicted values
plt.figure(figsize=(10, 6))
plt.scatter(y_test_boston, y_pred_boston, alpha=0.5)
plt.plot([y_test_boston.min(), y_test_boston.max()], [y_test_boston.min(),
y_test_boston.max()], 'r--', lw=2)
plt.xlabel('Actual Prices ($1000s)')
plt.ylabel('Predicted Prices ($1000s)')
Page 71
BCSL606 | Machine Learning Lab|
plt.title('Actual vs Predicted Housing Prices')
plt.tight_layout()
plt.show()
# Part 2: Polynomial Regression with Auto MPG Dataset
print("\nPart 2: Polynomial Regression - Auto MPG Dataset")
print("-" * 50)
# Load Auto MPG dataset
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-
mpg.data'
column_names = ['MPG', 'Cylinders', 'Displacement', 'Horsepower', 'Weight',
'Acceleration', 'Model Year', 'Origin', 'Car Name']
df = pd.read_csv(url, names=column_names, delim_whitespace=True)
# Clean the data
df = df.replace('?', np.nan)
df = df.dropna()
df['Horsepower'] = df['Horsepower'].astype(float)
# Select features for polynomial regression
Page 72
BCSL606 | Machine Learning Lab|
X_mpg = df[['Horsepower']].values
y_mpg = df['MPG'].values
# Scale features for polynomial regression
scaler_mpg = StandardScaler()
X_mpg_scaled = scaler_mpg.fit_transform(X_mpg)
# Split the data
X_train_mpg, X_test_mpg, y_train_mpg, y_test_mpg = train_test_split(
X_mpg_scaled, y_mpg, test_size=0.2, random_state=42
# Create and train models with different polynomial degrees
degrees = [1, 2, 3]
plt.figure(figsize=(15, 5))
for i, degree in enumerate(degrees, 1):
# Create polynomial features
poly_features = PolynomialFeatures(degree=degree)
X_train_poly = poly_features.fit_transform(X_train_mpg)
Page 73
BCSL606 | Machine Learning Lab|
X_test_poly = poly_features.transform(X_test_mpg)
# Train model
poly_model = LinearRegression()
poly_model.fit(X_train_poly, y_train_mpg)
# Make predictions
y_pred_poly = poly_model.predict(X_test_poly)
# Calculate metrics
mse_poly = mean_squared_error(y_test_mpg, y_pred_poly)
rmse_poly = np.sqrt(mse_poly)
r2_poly = r2_score(y_test_mpg, y_pred_poly)
print(f"\nPolynomial Regression (degree {degree}) Results:")
print(f"Mean Squared Error: {mse_poly:.2f}")
print(f"Root Mean Squared Error: {rmse_poly:.2f}")
print(f"R² Score: {r2_poly:.2f}")
# Plot results
Page 74
BCSL606 | Machine Learning Lab|
plt.subplot(1, 3, i)
plt.scatter(X_test_mpg, y_test_mpg, color='blue', alpha=0.5, label='Actual')
# Sort points for smooth curve
X_sort = np.sort(X_test_mpg, axis=0)
X_sort_poly = poly_features.transform(X_sort)
y_sort_pred = poly_model.predict(X_sort_poly)
plt.plot(X_sort, y_sort_pred, color='red', label='Predicted')
plt.xlabel('Horsepower (scaled)')
plt.ylabel('MPG')
plt.title(f'Polynomial Regression (degree {degree})')
plt.legend()
plt.tight_layout()
plt.show()
Page 75
BCSL606 | Machine Learning Lab|
Output
Page 76
BCSL606 | Machine Learning Lab|
Page 77
BCSL606 | Machine Learning Lab|
Explanation
1. Part 1: Linear Regression with Boston Housing Dataset
Key Components:
Uses the Boston Housing dataset to predict house prices
Features include various neighborhood characteristics
Target variable is 'medv' (median house value)
Page 78
BCSL606 | Machine Learning Lab|
Implementation Steps:
# Data Preparation
- Loads dataset from URL
- Splits features (X) and target (y)
- Uses train_test_split for data division
- Applies StandardScaler for feature normalization
# Model Training
- Creates LinearRegression model
- Fits model on scaled training data
- Makes predictions on test set
# Evaluation
- Calculates MSE, RMSE, and R² metrics
- Analyzes feature importance through coefficients
- Visualizes feature importance with bar plot
- Creates actual vs predicted scatter plot
2. Part 2: Polynomial Regression with Auto MPG Dataset
Key Components:
Uses Auto MPG dataset to predict fuel efficiency
Page 79
BCSL606 | Machine Learning Lab|
Focuses on Horsepower as main feature
Tests three polynomial degrees (1, 2, 3)
Implementation Steps:
# Data Preparation
- Loads and cleans MPG dataset
- Handles missing values ('?')
- Scales features using StandardScaler
# Model Training
- Creates polynomial features for each degree
- Trains separate models for each degree
- Makes predictions using each model
# Evaluation
- Calculates metrics for each polynomial degree
- Creates subplots showing fit for each degree
- Compares performance across degrees
3. Key Visualizations:
Feature importance bar chart for Boston Housing
Actual vs predicted scatter plot for house prices
Page 80
BCSL606 | Machine Learning Lab|
Three subplots showing polynomial fits of different degrees
4. Important Metrics Tracked:
Mean Squared Error (MSE)
Root Mean Squared Error (RMSE)
R² Score (coefficient of determination)
5. Key Insights:
Shows how feature scaling improves model performance
Demonstrates overfitting risk with higher polynomial degrees
Illustrates importance of different features in housing prices
Page 81
BCSL606 | Machine Learning Lab|
Experiment-08
Develop a program to demonstrate the working of the decision tree algorithm. Use
Breast Cancer Data set for building the decision tree and apply this knowledge to
classify a new sample.
Code:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
# Load the breast cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
Page 82
BCSL606 | Machine Learning Lab|
# Create and train the decision tree classifier
dt_classifier = DecisionTreeClassifier(max_depth=4, random_state=42)
dt_classifier.fit(X_train, y_train)
# Make predictions on the test set
y_pred = dt_classifier.predict(X_test)
# Print model performance metrics
print("Model Performance Metrics:")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Malignant',
'Benign']))
# Create confusion matrix visualization
plt.figure(figsize=(10, 8))
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
Page 83
BCSL606 | Machine Learning Lab|
# Visualize the decision tree
plt.figure(figsize=(20,10))
plot_tree(dt_classifier, feature_names=data.feature_names,
class_names=['Malignant', 'Benign'], filled=True, rounded=True)
plt.title('Decision Tree Visualization')
# Function to classify a new sample
def classify_new_sample(sample, feature_names=data.feature_names):
"""
Classify a new sample using the trained decision tree model.
Parameters:
sample (list or array): List of feature values in the same order as the training
data
feature_names (list): List of feature names for reference
Returns:
tuple: (prediction, probability)
"""
sample = np.array(sample).reshape(1, -1)
prediction = dt_classifier.predict(sample)
Page 84
BCSL606 | Machine Learning Lab|
probability = dt_classifier.predict_proba(sample)
print("\nClassification Results:")
print(f"Prediction: {'Benign' if prediction[0] == 1 else 'Malignant'}")
print(f"Probability: Malignant: {probability[0][0]:.2f}, Benign:
{probability[0][1]:.2f}")
# Print feature importance for this prediction
print("\nTop 5 Most Important Features:")
importances = dict(zip(feature_names, dt_classifier.feature_importances_))
sorted_importances = sorted(importances.items(), key=lambda x: x[1],
reverse=True)[:5]
for feature, importance in sorted_importances:
print(f"{feature}: {importance:.4f}")
return prediction[0], probability[0]
# Example of using the classifier with a new sample
# Using mean values from the dataset as an example
example_sample = X_train.mean(axis=0)
print("\nExample Classification:")
Page 85
BCSL606 | Machine Learning Lab|
classify_new_sample(example_sample)
Output
Page 86
BCSL606 | Machine Learning Lab|
Page 87
BCSL606 | Machine Learning Lab|
Explanation
1. Data Preparation and Model Setup
# Loads breast cancer dataset from sklearn
# Features: Various cell nucleus measurements
# Target: Binary (Malignant/Benign)
# Splits data: 80% training, 20% testing
2. Model Configuration
Uses DecisionTreeClassifier with:
o max_depth=4 (prevents overfitting)
o random_state=42 (reproducibility)
Fits model using training data
3. Performance Evaluation Components:
Classification Report shows:
o Precision: Accuracy of positive predictions
o Recall: Ability to find all positive cases
o F1-score: Balance between precision and recall
o Support: Number of samples per class
4. Visualization Elements:
Confusion Matrix Heatmap:
Page 88
BCSL606 | Machine Learning Lab|
o Shows true vs predicted labels
o Blue intensity indicates number of cases
o Numbers show exact count of predictions
Decision Tree Visualization:
o Shows complete tree structure
o max_depth=4 keeps it interpretable
o Color-coded nodes show class distribution
o Shows feature splits and thresholds
5. Sample Classification Function
def classify_new_sample(sample, feature_names):
Provides:
Binary prediction (Malignant/Benign)
Probability scores for each class
Top 5 most influential features
Feature importance scores
6. Key Features:
Binary Classification Task
Interpretable Model Structure
Feature Importance Analysis
Probability Estimates
Page 89
BCSL606 | Machine Learning Lab|
Visual Decision Path
7. Use Cases:
Medical Diagnosis Support
Feature Importance Understanding
Risk Assessment
Decision Process Visualization
Page 90
BCSL606 | Machine Learning Lab|
Experiment-09
Develop a program to implement the Naive Bayesian classifier considering Olivetti Face
Data set for training.
Compute the accuracy of the classifier, considering a few test data sets.
Code:
import numpy as np
from sklearn.datasets import fetch_olivetti_faces
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
# Load the Olivetti faces dataset
faces = fetch_olivetti_faces()
X = faces.data
y = faces.target
# Function to display sample faces
def display_sample_faces(X, y, num_samples=5):
"""Display sample faces from the dataset"""
Page 91
BCSL606 | Machine Learning Lab|
fig, axes = plt.subplots(1, num_samples, figsize=(12, 3))
for i, ax in enumerate(axes):
ax.imshow(X[i].reshape(64, 64), cmap='gray')
ax.set_title(f'Person {y[i]}')
ax.axis('off')
plt.tight_layout()
plt.show()
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
# Initialize and train the Naive Bayes classifier
nb_classifier = GaussianNB()
nb_classifier.fit(X_train, y_train)
# Make predictions
y_pred = nb_classifier.predict(X_test)
# Calculate accuracy
accuracy = nb_classifier.score(X_test, y_test)
Page 92
BCSL606 | Machine Learning Lab|
# Perform cross-validation
cv_scores = cross_val_score(nb_classifier, X, y, cv=5)
# Print performance metrics
print("Performance Metrics:")
print(f"\nAccuracy on test set: {accuracy:.4f}")
print("\nCross-validation scores:")
for i, score in enumerate(cv_scores, 1):
print(f"Fold {i}: {score:.4f}")
print(f"Mean CV Score: {cv_scores.mean():.4f} (+/- {cv_scores.std() * 2:.4f})")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
# Create confusion matrix visualization
plt.figure(figsize=(12, 8))
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
Page 93
BCSL606 | Machine Learning Lab|
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
# Function to test the classifier on specific samples
def test_specific_samples(classifier, X_test, y_test, num_samples=5):
"""Test the classifier on specific samples and display results"""
# Randomly select samples
indices = np.random.choice(len(X_test), num_samples, replace=False)
X_samples = X_test[indices]
y_true = y_test[indices]
# Make predictions
y_pred = classifier.predict(X_samples)
probabilities = classifier.predict_proba(X_samples)
# Display results
fig, axes = plt.subplots(2, num_samples, figsize=(15, 6))
for i in range(num_samples):
# Display the face
axes[0, i].imshow(X_samples[i].reshape(64, 64), cmap='gray')
Page 94
BCSL606 | Machine Learning Lab|
axes[0, i].axis('off')
# Display prediction information
axes[1, i].axis('off')
prediction_text = f'True: {y_true[i]}\nPred: {y_pred[i]}\n'
prediction_text += f'Prob: {probabilities[i][y_pred[i]]:.2f}'
axes[1, i].text(0.5, 0.5, prediction_text,
ha='center', va='center')
# Add color coding for correct/incorrect predictions
if y_true[i] == y_pred[i]:
axes[0, i].set_title('Correct', color='green')
else:
axes[0, i].set_title('Incorrect', color='red')
plt.tight_layout()
plt.show()
# Display sample faces from the dataset
print("\nDisplaying sample faces from the dataset:")
Page 95
BCSL606 | Machine Learning Lab|
display_sample_faces(X, y)
# Test the classifier on specific samples
print("\nTesting classifier on specific samples:")
test_specific_samples(nb_classifier, X_test, y_test)
# Function to analyze misclassifications
def analyze_misclassifications(X_test, y_test, y_pred):
"""Analyze and display misclassified samples"""
misclassified = X_test[y_test != y_pred]
true_labels = y_test[y_test != y_pred]
pred_labels = y_pred[y_test != y_pred]
print(f"\nTotal misclassifications: {len(misclassified)}")
# Display some misclassified examples
num_display = min(5, len(misclassified))
if num_display > 0:
fig, axes = plt.subplots(1, num_display, figsize=(12, 3))
for i in range(num_display):
Page 96
BCSL606 | Machine Learning Lab|
if num_display == 1:
ax = axes
else:
ax = axes[i]
ax.imshow(misclassified[i].reshape(64, 64), cmap='gray')
ax.set_title(f'True: {true_labels[i]}\nPred: {pred_labels[i]}')
ax.axis('off')
plt.tight_layout()
plt.show()
# Analyze misclassifications
print("\nAnalyzing misclassifications:")
analyze_misclassifications(X_test, y_test, y_pred)
Output
Page 97
BCSL606 | Machine Learning Lab|
Page 98
BCSL606 | Machine Learning Lab|
Page 99
BCSL606 | Machine Learning Lab|
Explanation
1. Dataset and Setup
Uses Olivetti faces dataset (400 images of 40 people)
Each image is 64x64 pixels in grayscale
Features are flattened pixel values
Target is person identifier (0-39)
2. Key Functions:
a) Display Sample Faces:
def display_sample_faces(X, y, num_samples=5):
Shows sample faces from dataset
Displays grayscale images with person ID
Helps visualize input data
b) Test Specific Samples:
def test_specific_samples(classifier, X_test, y_test, num_samples=5):
Tests classifier on random samples
Shows both image and predictions
Color codes correct (green) vs incorrect (red) predictions
Displays prediction probabilities
c) Analyze Misclassifications:
def analyze_misclassifications(X_test, y_test, y_pred):
Page 100
BCSL606 | Machine Learning Lab|
Identifies misclassified faces
Shows true vs predicted labels
Helps understand where model fails
3. Model Implementation
Uses GaussianNB (Gaussian Naive Bayes)
Performs 80-20 train-test split
Includes cross-validation (5 folds)
4. Performance Evaluation:
Accuracy on test set
Cross-validation scores
Detailed classification report
Confusion matrix visualization
Misclassification analysis
5. Visualization Components:
Sample face display
Confusion matrix heatmap
Test results with probability scores
Misclassified examples
6. Key Features:
Page 101
BCSL606 | Machine Learning Lab|
Face recognition capability
Probability estimation
Error analysis
Visual result presentation
Cross-validation performance
7. Notable Aspects:
Handles high-dimensional data (4096 pixels)
Provides probability estimates
Visual feedback for predictions
Comprehensive error analysis
Page 102
BCSL606 | Machine Learning Lab|
Experiment-10
Develop a program to implement k-means clustering using Wisconsin Breast Cancer
data set and visualize the clustering result.
Code:
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
import seaborn as sns
# Load the Wisconsin Breast Cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target # We'll use this only for evaluation
# Create a DataFrame with feature names
df = pd.DataFrame(X, columns=data.feature_names)
Page 103
BCSL606 | Machine Learning Lab|
# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Function to determine optimal k using elbow method
def plot_elbow_curve(X, max_k=10):
inertias = []
silhouette_scores = []
k_values = range(2, max_k + 1)
for k in k_values:
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(X)
inertias.append(kmeans.inertia_)
silhouette_scores.append(silhouette_score(X, kmeans.labels_))
# Plot elbow curve
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))
Page 104
BCSL606 | Machine Learning Lab|
# Inertia plot
ax1.plot(k_values, inertias, 'bo-')
ax1.set_xlabel('Number of Clusters (k)')
ax1.set_ylabel('Inertia')
ax1.set_title('Elbow Method')
# Silhouette score plot
ax2.plot(k_values, silhouette_scores, 'ro-')
ax2.set_xlabel('Number of Clusters (k)')
ax2.set_ylabel('Silhouette Score')
ax2.set_title('Silhouette Analysis')
plt.tight_layout()
plt.show()
return k_values[np.argmax(silhouette_scores)]
# Find optimal k
optimal_k = plot_elbow_curve(X_scaled)
print(f"\nOptimal number of clusters based on silhouette score: {optimal_k}")
Page 105
BCSL606 | Machine Learning Lab|
# Perform k-means clustering with optimal k
kmeans = KMeans(n_clusters=optimal_k, random_state=42)
cluster_labels = kmeans.fit_predict(X_scaled)
# Perform PCA for visualization
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
# Create visualization of clusters
plt.figure(figsize=(12, 8))
scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=cluster_labels, cmap='viridis')
plt.title('K-means Clustering Results (PCA-reduced data)')
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
plt.colorbar(scatter, label='Cluster')
plt.show()
# Compare clustering results with actual diagnosis
comparison_df = pd.DataFrame({
Page 106
BCSL606 | Machine Learning Lab|
'Cluster': cluster_labels,
'Actual_Diagnosis': y
})
print("\nCluster vs Actual Diagnosis Distribution:")
print(pd.crosstab(comparison_df['Cluster'], comparison_df['Actual_Diagnosis'],
values=np.zeros_like(cluster_labels), aggfunc='count'))
# Analyze cluster characteristics
def analyze_clusters(X, labels, feature_names):
"""Analyze and visualize characteristics of each cluster"""
# Create DataFrame with features and cluster labels
df_analysis = pd.DataFrame(X, columns=feature_names)
df_analysis['Cluster'] = labels
# Calculate mean values for each feature in each cluster
cluster_means = df_analysis.groupby('Cluster').mean()
# Create heatmap of cluster characteristics
plt.figure(figsize=(15, 8))
sns.heatmap(cluster_means, cmap='coolwarm', center=0, annot=True,
fmt='.2f',
Page 107
BCSL606 | Machine Learning Lab|
xticklabels=True, yticklabels=True)
plt.title('Cluster Characteristics (Feature Means)')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()
return cluster_means
# Analyze cluster characteristics
print("\nAnalyzing cluster characteristics:")
cluster_means = analyze_clusters(X_scaled, cluster_labels, data.feature_names)
# Visualize feature importance for clustering
def plot_feature_importance(kmeans, feature_names):
"""Plot feature importance based on cluster centroids"""
# Calculate the variance of centroids for each feature
centroid_variance = np.var(kmeans.cluster_centers_, axis=0)
# Create DataFrame for feature importance
feature_importance = pd.DataFrame({
Page 108
BCSL606 | Machine Learning Lab|
'Feature': feature_names,
'Importance': centroid_variance
}).sort_values('Importance', ascending=False)
# Plot feature importance
plt.figure(figsize=(12, 6))
sns.barplot(x='Importance', y='Feature', data=feature_importance.head(10))
plt.title('Top 10 Most Important Features for Clustering')
plt.tight_layout()
plt.show()
return feature_importance
# Plot feature importance
print("\nAnalyzing feature importance:")
feature_importance = plot_feature_importance(kmeans, data.feature_names)
# Function to predict cluster for new samples
def predict_cluster(sample, scaler, kmeans, feature_names):
"""Predict cluster for a new sample"""
Page 109
BCSL606 | Machine Learning Lab|
# Ensure sample is in correct format
if isinstance(sample, list):
sample = np.array(sample).reshape(1, -1)
# Scale the sample
sample_scaled = scaler.transform(sample)
# Predict cluster
cluster = kmeans.predict(sample_scaled)[0]
# Get distances to all cluster centers
distances = kmeans.transform(sample_scaled)[0]
print(f"\nPredicted Cluster: {cluster}")
print("\nDistances to cluster centers:")
for i, dist in enumerate(distances):
print(f"Cluster {i}: {dist:.2f}")
return cluster, distances
Page 110
BCSL606 | Machine Learning Lab|
# Example of using the prediction function
print("\nExample prediction for a new sample:")
example_sample = X[0:1] # Using first sample as example
predicted_cluster, distances = predict_cluster(example_sample, scaler, kmeans,
data.feature_names)
Output
Page 111
BCSL606 | Machine Learning Lab|
Page 112
BCSL606 | Machine Learning Lab|
Page 113
BCSL606 | Machine Learning Lab|
Explanation
1. Data Preparation:
# Loads breast cancer dataset
# Standardizes features using StandardScaler
# Creates DataFrame with feature names
2. Key Functions:
a) Elbow Method Analysis:
def plot_elbow_curve(X, max_k=10):
Determines optimal number of clusters
Plots inertia (within-cluster sum of squares)
Calculates silhouette scores
Returns optimal k based on silhouette analysis
b) Cluster Analysis:
def analyze_clusters(X, labels, feature_names):
Calculates mean values for each feature per cluster
Creates heatmap of cluster characteristics
Shows feature patterns in each cluster
c) Feature Importance:
def plot_feature_importance(kmeans, feature_names):
Calculates feature importance based on centroid variance
Page 114
BCSL606 | Machine Learning Lab|
Visualizes top 10 most important features
Helps understand which features drive clustering
3. Visualization Components:
Elbow curve and silhouette score plots
PCA-reduced cluster visualization
Cluster characteristics heatmap
Feature importance bar plot
4. Model Implementation:
Uses optimal k from silhouette analysis
Performs clustering on standardized data
Reduces dimensionality with PCA for visualization
Compares clusters with actual diagnosis
5. Cluster Prediction:
def predict_cluster(sample, scaler, kmeans, feature_names):
Predicts cluster for new samples
Shows distances to all cluster centers
Provides confidence measure through distances
6. Key Features:
Automatic optimal cluster selection
Dimensionality reduction for visualization
Page 115
BCSL606 | Machine Learning Lab|
Comprehensive cluster analysis
Feature importance ranking
New sample prediction capability
7. Analysis Components:
Cluster vs actual diagnosis comparison
Cluster characteristic analysis
Feature importance visualization
Distance-based prediction confidence
Page 116