100% found this document useful (13 votes)
32K views32 pages

MACHINE LEARNING R23 Material

good

Uploaded by

madhu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
100% found this document useful (13 votes)
32K views32 pages

MACHINE LEARNING R23 Material

good

Uploaded by

madhu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
  • Introduction to Machine Learning
  • Evolution of Machine Learning
  • Paradigms or Types of Machine Learning
  • Learning by Induction
  • Reinforcement Learning
  • Nearest Neighbor-Based Models in Machine Learning
  • Models Based on Decision Trees
  • Linear Discriminants
  • Clustering

MACHINE LEARNING R23

UNIT-1
Introduction to Machine Learning

Machine learning (ML) is a subset of artificial intelligence (AI) that focuses on enabling
systems to learn from data and make decisions or predictions without being explicitly
programmed. Rather than following predetermined rules, ML algorithms allow computers to
recognize patterns, make inferences, and improve their performance over time through
experience.

1. What is Machine Learning?

Machine learning refers to the field of study that gives computers the ability to learn from
data, identify patterns, and make decisions with minimal human intervention. This involves
using algorithms to analyze and model data to make predictions or decisions.

2. Types of Machine Learning

Machine learning can be categorized into three primary types:

1. Supervised Learning:
o In supervised learning, the model is trained on labeled data (data that has both
input and corresponding output). The algorithm learns by example and
generalizes patterns to make predictions on new, unseen data.
o Example: Classifying emails as "spam" or "not spam."
2. Unsupervised Learning:
o In unsupervised learning, the model is given data without labels. The goal is to
find hidden patterns or structures within the data.
o Example: Customer segmentation, where a model groups customers based on
their purchasing behavior without predefined categories.
3. Reinforcement Learning:
o In reinforcement learning, an agent learns to make decisions by performing
actions in an environment to maximize a cumulative reward. The agent
receives feedback in the form of rewards or penalties based on its actions.
o Example: Training a robot to navigate through a maze.
4. Semi-supervised Learning (Optional category):
o This is a hybrid between supervised and unsupervised learning. It uses a small
amount of labeled data and a large amount of unlabeled data.
o Example: Image recognition where only a few images are labeled, but the
model can learn from a large amount of unlabeled images.

3. Common Algorithms in Machine Learning

Some well-known machine learning algorithms include:

1. Linear Regression:
o Used for predicting continuous values, such as house prices based on features
like size, location, etc.
2. Decision Trees:
o A model that makes decisions based on answering questions (features) and
traversing a tree structure to reach a prediction.
3. Random Forest:
o An ensemble method that uses multiple decision trees to improve performance
and reduce overfitting.
4. K-Nearest Neighbors (KNN):
o A classification algorithm that makes predictions based on the majority class
of the nearest data points in the feature space.
5. Support Vector Machines (SVM):
o A powerful classifier that tries to find the optimal hyperplane separating
classes in a high-dimensional space.
6. Neural Networks:
o Inspired by the human brain, these networks consist of layers of
interconnected nodes (neurons) and are used in deep learning applications like
image and speech recognition.
7. K-means Clustering:
o A method for unsupervised learning where the algorithm groups data into k
clusters based on feature similarity.

4. Steps in Machine Learning

The typical workflow of a machine learning project involves several key stages:

1. Data Collection: Gathering the relevant data for the problem you're solving.
2. Data Preprocessing: Cleaning and preparing the data for modeling, including
handling missing values, normalizing, and encoding categorical variables.
3. Model Selection: Choosing the appropriate machine learning algorithm.
4. Training the Model: Feeding the training data into the model to allow it to learn
from the data.
5. Evaluation: Assessing the model's performance using metrics like accuracy,
precision, recall, or mean squared error (for regression).
6. Hyperparameter Tuning: Adjusting the model parameters for optimal performance.
7. Deployment: Integrating the trained model into a real-world application.

5. Applications of Machine Learning

Machine learning is already transforming various industries, including:

 Healthcare: Diagnosing diseases, personalized medicine, drug discovery.


 Finance: Fraud detection, algorithmic trading, credit scoring.
 Marketing: Customer segmentation, recommendation systems, targeted advertising.
 Transportation: Autonomous vehicles, traffic prediction.
 Retail: Inventory management, demand forecasting, chatbots.

6. Challenges in Machine Learning


Some challenges that arise in ML include:

 Data Quality: Poor or biased data can lead to inaccurate models.


 Overfitting and Underfitting: A model that fits the training data too well may not
generalize well to new data, and a model that underfits may miss important patterns.
 Interpretability: Some models, particularly deep learning models, can be "black
boxes," making it difficult to understand how they make decisions.
 Computational Resources: Large datasets and complex models require substantial
computing power.

7. Tools and Libraries

Popular libraries and frameworks for machine learning include:

 Scikit-learn: A Python library for traditional machine learning algorithms.


 TensorFlow and Keras: Libraries for building neural networks and deep learning
models.
 PyTorch: A popular deep learning library for research and production.
 XGBoost: A library used for gradient boosting algorithms in competitions.

1.Evolution of Machine Learning

Evolution of Machine Learning

The evolution of machine learning (ML) can be traced through various stages, each
representing a significant leap in our understanding and capability to make machines
"intelligent." From its early theoretical foundations to the modern-day applications powered
by deep learning, machine learning has undergone a profound transformation. Below is an
overview of the key milestones in the evolution of machine learning:

1. Early Foundations (1940s–1950s)

The roots of machine learning can be traced back to the early days of computing and artificial
intelligence (AI):

 Turing's Work (1936-1937): The British mathematician Alan Turing laid the
groundwork for theoretical computing with the concept of the Turing machine
 Neural Networks and Perceptrons (1950s): In the 1950s, early work on neural
networks began with the creation of the Perceptron by Frank Rosenblatt.

2. Symbolic AI and Rule-Based Systems (1950s–1970s)

 Rule-based AI: During the 1950s to 1970s, AI research was dominated by symbolic
approaches..Early ML Algorithms: Researchers began exploring algorithms like
decision trees and clustering methods, though the field was still in its infancy

3. The AI Winter (1970s–1990s)

Despite early successes, progress in AI and machine learning slowed significantly during this
period due to limited computational resources and overly optimistic expectations:

 Challenges in Data and Computing: The limitations of computers at the time, both
in terms of memory and processing power, constrained the development of more
advanced ML algorithms. Additionally, AI and machine learning models struggled to
perform well in real-world, noisy data scenarios.
 AI Winter: This term refers to a period of reduced funding and interest in AI research
during the late 1970s to early 1990s, as results from early ML models did not live up
to expectations.

4. Revival and Statistical Learning (1990s–2000s)

The 1990s saw a resurgence in machine learning, driven by the development of statistical
methods, the increase in computational power, and the availability of larger datasets:

 Introduction of Support Vector Machines (SVMs): In the 1990s, algorithms like


SVMs were developed, offering powerful methods for classification tasks
 Bayesian Networks and Probabilistic Models: Researchers developed new
approaches based on probabilistic reasoning.
 Neural Networks and Backpropagation: While neural networks had been explored
earlier, the backpropagation algorithm in the 1980s (further developed in the 1990s)
enabled multi-layer networks to learn more complex patterns and drove interest in
deep learning.
 Reinforcement Learning: The concept of learning by interacting with an
environment and maximizing rewards.

5. Data-Driven Approaches and Deep Learning (2010s–Present)

The 2010s saw significant breakthroughs in machine learning, particularly in the area of deep
learning:

 Big Data:.
 Rise of Deep Learning: Deep learning :
o ImageNet Breakthrough (2012): The ImageNet competition marked a
pivotal moment when deep learning models, especially convolutional neural
networks (CNNs), drastically outperformed traditional machine learning
algorithms in image classification tasks. This achievement sparked widespread
interest in deep learning.
 Natural Language Processing (NLP) and Transformer Models: In the field of
NLP, algorithms like Word2Vec and later transformers (such as BERT and GPT)
revolutionized language understanding and generation, allowing machines to achieve
human-level performance on tasks like translation, question answering, and text
generation.
 Reinforcement Learning Advancements: Reinforcement learning, notably through
deep Q-learning (DeepMind's AlphaGo playing Go), reached new heights, solving
complex decision-making problems.

2.Paradigms or Types of Machine Learning

Machine learning (ML) can be broadly categorized into different paradigms based on how the
algorithms learn from data and the nature of the problem they aim to solve. Each paradigm
has distinct approaches, techniques, and applications. The most common paradigms are
supervised learning, unsupervised learning, reinforcement learning, and semi-
supervised learning. Below is a detailed explanation of each paradigm:

1. Supervised Learning

Definition: In supervised learning, the model is trained using labeled data. The algorithm
learns the relationship between input data (features) and their corresponding outputs (labels)
during training, and then generalizes this knowledge to make predictions on new, unseen
data.

 How it Works:
o Training Data: Consists of input-output pairs where the output (label) is
known.
o Objective: The goal is to learn a mapping function f(x)f(x)f(x) that maps
inputs xxx to outputs yyy, so the model can predict the output for new, unseen
inputs.
 Key Algorithms:
o Linear Regression: For predicting continuous values.
o Logistic Regression: For binary classification problems.
o Decision Trees: Used for both classification and regression tasks.
o Support Vector Machines (SVM): Classification algorithm that finds the
optimal hyperplane.
o K-Nearest Neighbors (KNN): A simple algorithm that classifies data based
on the majority label of its nearest neighbors.
o Random Forests: An ensemble method that uses multiple decision trees for
more accurate predictions.
o Neural Networks: A powerful approach used in deep learning for tasks like
image and speech recognition.
 Applications:
o Image classification (e.g., identifying cats and dogs in pictures).
o Email spam detection.
o Sentiment analysis (classifying reviews as positive or negative).

2. Unsupervised Learning

Definition: In unsupervised learning, the model is provided with data that has no labels. The
goal is to identify underlying patterns or structures in the data without predefined outputs.

 How it Works:
o Training Data: Contains only input data, and the algorithm must find hidden
patterns or structures within this data.
o Objective: The model aims to explore the data and organize it into clusters,
dimensions, or other structures.
 Key Algorithms:
o Clustering:
 K-Means Clustering: Groups data into k clusters based on similarity.
 Hierarchical Clustering: Builds a tree of clusters.
 DBSCAN: Identifies clusters of varying shapes based on density.
o Dimensionality Reduction:
 Principal Component Analysis (PCA): Reduces the number of
features while retaining variance.
 t-Distributed Stochastic Neighbor Embedding (t-SNE): A technique
for visualizing high-dimensional data.
o Association Rules:
 Apriori: Used for market basket analysis, discovering items that
frequently co-occur in transactions.
 Applications:
o Customer segmentation (grouping customers based on purchasing behavior).
o Anomaly detection (detecting fraud or network intrusions).
o Topic modeling (grouping documents into topics based on word patterns).

3. Reinforcement Learning
Definition: Reinforcement learning (RL) is a paradigm where an agent learns to make
decisions by interacting with an environment to maximize a cumulative reward. The agent
does not know the right actions initially but learns from feedback after each action it takes.

 How it Works:
o Agent: The decision-making entity that interacts with the environment.
o Environment: The world in which the agent operates.
o State: The current situation or configuration of the environment.
o Action: The choices the agent can make in the environment.
o Reward: The feedback the agent receives after taking an action. The objective
is to maximize the total reward over time.
o Policy: A strategy the agent uses to determine actions based on states.
o Value Function: A function that estimates the expected return (reward) for
each state.
 Key Algorithms:
o Q-Learning: A model-free algorithm that learns the value of actions in
different states.
o Deep Q Networks (DQN): Combines Q-learning with deep neural networks
for complex environments.
o Policy Gradient Methods: Directly optimize the policy, used in algorithms
like Proximal Policy Optimization (PPO).
o Actor-Critic Methods: Combines policy-based and value-based methods.
 Applications:
o Game playing (e.g., AlphaGo, chess, and video games).
o Robotics (e.g., teaching robots to walk or pick objects).
o Autonomous vehicles (e.g., self-driving cars making navigation decisions).
o Healthcare (e.g., personalized treatment recommendations).

4. Semi-Supervised Learning

Definition: Semi-supervised learning is a paradigm that falls between supervised and


unsupervised learning. The model is trained with a small amount of labeled data and a large
amount of unlabeled data. The goal is to leverage the unlabeled data to improve the learning
process.

 How it Works:
o Training Data: Consists of both labeled and unlabeled data.
o Objective: The algorithm attempts to use the small amount of labeled data to
make sense of the unlabeled data and make more accurate predictions or
classifications.
 Key Algorithms:
o Semi-Supervised Support Vector Machines (S3VM): An extension of SVM
that works with both labeled and unlabeled data.
o Self-training: An iterative approach where the model initially trains on the
labeled data, and then iteratively labels the unlabeled data to improve
performance.
o Graph-based Methods: Use relationships (edges) between labeled and
unlabeled data points to propagate labels.
 Applications:
o Image recognition: Labeled images may be scarce, but large unlabeled
datasets can help improve accuracy.
o Speech recognition: Labeled audio data may be limited, but using large
amounts of unlabeled audio can enhance training.
o Medical diagnostics: Labeling medical images can be time-consuming and
expensive, so semi-supervised techniques help utilize the available data
effectively.

5. Self-Supervised Learning (Emerging Paradigm)

Definition: Self-supervised learning is a technique where a model generates its own labels
from the data itself. It creates a pretext task (an auxiliary task) to learn useful representations
from unlabeled data, which can then be fine-tuned for specific downstream tasks.

 How it Works:
o Pretext Task: A task that is designed so the model learns useful features from
unlabeled data. For instance, predicting missing parts of an image, filling in
missing words in a sentence, or predicting the next frame in a video.
o Transfer Learning: The representations learned from the pretext task are
transferred to solve other tasks that require supervised learning.
 Key Algorithms:
o Contrastive Learning: A method where the model learns by contrasting
positive and negative samples (e.g., SimCLR, MoCo).
o Masked Language Models (MLMs): In NLP, models like BERT are pre-
trained on tasks like predicting masked words to understand language
representations.
 Applications:
o Natural Language Processing: Pretraining language models (e.g., GPT,
BERT) using self-supervised learning.
o Computer Vision: Pretraining models for tasks like object detection using
unlabeled images.
o Robotics: Learning representations from raw sensory data without explicit
supervision.

3. Learning by Rote:

 Definition: Learning by rote refers to memorization or learning through repetition,


without necessarily understanding the underlying concepts or principles.
 Example: Memorizing the multiplication table, where the learner repeatedly goes
over the table until they can recall it without effort.
 Limitations: While rote learning can lead to quick recall, it doesn’t foster deeper
understanding or problem-solving skills.

4 Learning by Induction:
Learning by Induction in Machine Learning

In machine learning, learning by induction refers to the process where a model generalizes
from specific examples (data points) to broader principles or patterns. Inductive learning is
central to most machine learning techniques, where algorithms make predictions or
classifications based on previously observed data. This approach contrasts with deductive
learning, where conclusions are drawn from known rules or facts.

Key Concepts of Inductive Learning:

1. Generalization:
o The primary goal of inductive learning is to generalize from a set of training
data to make predictions on unseen examples. The idea is that patterns or
relationships found in the training data will hold for new, similar data points.
o Example: If a model learns that "birds typically fly" from a training dataset of
various bird species, it might generalize that all birds can fly, although there
might be exceptions like penguins.
2. Hypothesis Formation:
o Inductive learning involves forming a hypothesis or a model that captures the
general relationships in the data. This hypothesis is derived from specific
examples.
o Example: In supervised learning, a model might form a hypothesis (or rule)
like “if an animal has feathers and a beak, it is a bird” based on training data
containing labeled examples of animals.
3. Overfitting and Underfitting:
o Overfitting occurs when the model learns the noise or specific details of the
training data too well, resulting in poor performance on new, unseen data. It
implies the model has become too complex and is over-generalizing.
o Underfitting happens when the model is too simple to capture the underlying
patterns in the data, leading to poor performance even on training data.
4. Types of Inductive Learning Algorithms:
o Decision Trees: Algorithms like ID3 and C4.5 induce decision rules from
examples, which can be used for classification or regression tasks.
o Neural Networks: These models learn by adjusting weights based on training
data to generalize patterns for tasks like image classification or speech
recognition.
o Support Vector Machines (SVM): SVMs learn a decision boundary by
inducing patterns in the feature space to separate data into different classes.
o k-Nearest Neighbors (k-NN): This algorithm makes predictions based on the
majority class of the closest training data points, inducing general rules about
the relationships between data points in the feature space.
5. Inductive Bias:
o Inductive bias refers to the set of assumptions a learning algorithm makes to
generalize from the training data to unseen data. The nature of the inductive
bias affects how the model learns and how well it can generalize.
o Example: A decision tree algorithm assumes that the data can be split into
categories using binary decisions. This bias can work well for certain tasks but
may not perform well if the data cannot be easily split.
6. Inductive Learning Process:
o Training Phase: The model is provided with a set of examples, and it learns
from this data by creating a hypothesis (model).
o Testing Phase: The learned model is evaluated on unseen examples to check
how well it generalizes.
o Prediction: Once the model is trained and validated, it can be used to predict
the output for new inputs.
7. Example of Inductive Learning in Practice:
o Spam Email Classification:
 In a supervised inductive learning task, a model might be trained on a
set of emails labeled as "spam" or "not spam". The algorithm
generalizes patterns (e.g., words like "free" or "offer") to create a
hypothesis about which future emails are likely to be spam.
8. Advantages of Inductive Learning:
o Scalability: Inductive learning algorithms can scale to handle large datasets
and automatically infer patterns.
o Flexibility: These models can adapt to different types of data and problems,
whether the task is classification, regression, or clustering.
o Automation: Inductive learning models reduce the need for manual rule-
making by learning patterns automatically from data.
9. Challenges:
o Overfitting: As the model tries to generalize, it might learn irrelevant patterns
that don't apply to new data.
o Bias in the Data: If the training data is biased, the learned hypothesis will also
be biased and inaccurate when applied to real-world data.
o Complexity: Some inductive learning models, especially deep learning, can
become complex and computationally expensive.

 Definition: Inductive learning involves generalizing from specific examples to


broader principles. It’s the process of inferring a general rule or pattern from
particular instances or observations.
 Example: After observing several instances of birds flying, one might generalize the
rule that “all birds can fly.”
 Application: Inductive learning is widely used in machine learning, where algorithms
infer general patterns from training data.

5. Reinforcement Learning:

 Definition: Reinforcement learning (RL) is a type of machine learning where an agent


learns by interacting with an environment and receiving rewards or punishments. The
agent aims to maximize its cumulative reward over time.
 Key Concepts:
o Agent: The learner or decision maker.
o Environment: The external system the agent interacts with.
o Action: The decision made by the agent.
o Reward: Feedback from the environment.
o Policy: Strategy to determine the next action based on the current state.
 Example: A robot learning to navigate a maze by trial and error, receiving positive
feedback (reward) for moving closer to the goal and negative feedback (punishment)
for moving away from it.

6. Types of Data:

 Structured Data: Data that is organized in rows and columns, often found in
databases (e.g., spreadsheets, SQL databases).
 Unstructured Data: Data that lacks a predefined format (e.g., text, images, audio).
 Semi-Structured Data: Data that has some structure but is not fully organized (e.g.,
JSON, XML files).

7 Matching:

 Definition: Matching in machine learning refers to the task of comparing different


data points to identify similarities or differences.
 Examples:
o Pattern Matching: Identifying patterns in text, such as finding similar words
or phrases.
o Image Matching: Comparing features of images to find similarities or
objects.

8. Stages in Machine Learning:

Machine learning typically involves several stages:

 Data Collection: Gathering raw data that is relevant to the problem.


 Data Preprocessing: Cleaning and transforming data into a usable format.
 Feature Engineering: Selecting and creating features from raw data that will be used
by the model.
 Model Selection: Choosing the appropriate algorithm or model for the task.
 Model Training: Using training data to teach the model.
 Model Evaluation: Assessing the model’s performance on unseen data
(validation/test sets).
 Model Prediction: Using the trained model to make predictions on new data.

9. Data Acquisition:

 Definition: Data acquisition refers to the process of gathering data from various
sources, such as sensors, databases, or web scraping.
 Methods:
o Manual Collection: Collecting data by hand or using traditional methods.
o Automated Collection: Using software or scripts to gather data automatically.
o Public Datasets: Utilizing open-source datasets available for research or
development purposes.

10. Feature Engineering:


 Definition: Feature engineering is the process of selecting, modifying, or creating
new features from raw data to improve the performance of machine learning models.
 Techniques:
o Feature Selection: Choosing the most relevant features.
o Feature Transformation: Scaling, normalizing, or encoding features.
o Feature Extraction: Creating new features from existing ones, such as
extracting text sentiment or converting time into day parts.

11. Data Representation:

 Definition: Data representation refers to how data is structured and stored so that
machine learning models can interpret it effectively.
 Types:
o Numerical Representation: Using numbers to represent data, like integers or
floating-point numbers.
o Categorical Representation: Using categories or labels, often transformed
into numerical form using techniques like one-hot encoding.
o Vector Representation: Representing data in vectorized form, often used in
natural language processing or image processing.

12. Model Selection:

 Definition: Model selection involves choosing the right machine learning model or
algorithm based on the problem type and the characteristics of the data.
 Factors to consider:
o Type of Data: Structured, unstructured, time-series, etc.
o Task Type: Classification, regression, clustering, etc.
o Performance Metrics: Accuracy, precision, recall, F1 score, etc.
o Complexity: Simpler models may be more interpretable but less powerful,
whereas complex models may have better accuracy but be harder to interpret.

13. Model Learning:

 Definition: Model learning refers to the process of training a machine learning model
by feeding it data so that it can learn patterns or relationships in the data.
 Methods:
o Supervised Learning: Learning with labeled data (e.g., classification,
regression).
o Unsupervised Learning: Learning from unlabeled data (e.g., clustering,
dimensionality reduction).
o Semi-supervised Learning: A mix of labeled and unlabeled data.
o Reinforcement Learning: Learning through trial and error.

14. Model Evaluation:

 Definition: Model evaluation is the process of assessing the performance of a model


using various metrics and validation techniques.
 Methods:
o Cross-validation: Dividing data into subsets to train and test the model
multiple times.
o Confusion Matrix: A matrix to evaluate the performance of classification
algorithms.
o Accuracy, Precision, Recall, F1-score: Metrics used to evaluate
classification models.
o Mean Absolute Error (MAE), Mean Squared Error (MSE): Metrics used
for regression models.

15. Model Prediction:

 Definition: Model prediction refers to the process of using a trained model to make
predictions on new, unseen data.
 Types:
o Classification: Predicting the class label of an input (e.g., spam or not spam).
o Regression: Predicting a continuous value (e.g., house prices).
o Clustering: Grouping data into clusters based on similarities.

16. Search and Learning:

 Search: Involves exploring a problem space to find the best solution. It can be applied
in algorithms like search trees or optimization tasks.
o Types:
 Depth-first Search (DFS): Explores as far as possible down one
branch before backtracking.
 Breadth-first Search (BFS): Explores all nodes at the present depth
level before moving on to nodes at the next depth level.
 Learning: Machine learning algorithms improve through experience, refining their
models and predictions over time.

17. Data Sets:

 Definition: A data set is a collection of related data points organized for analysis,
training, or testing purposes.
 Types:
o Training Set: Used to train a model.
o Test Set: Used to evaluate the model’s performance.
o Validation Set: Used for tuning hyperparameters or for model selection.
 Examples:
o MNIST: A dataset of handwritten digits used for image recognition tasks.
o Iris: A classic dataset in machine learning for classification tasks.
UNIT-2
Nearest Neighbor-Based Models in Machine Learning

Nearest neighbor-based models, especially the k-nearest neighbors (k-NN) algorithm, are a
fundamental class of machine learning techniques that rely on the idea of proximity or
similarity between data points for making predictions. These models are used in both
classification and regression tasks and are based on the idea that similar points are likely to
have similar labels or values. Below is a detailed breakdown of the key concepts related to
nearest neighbor-based models.

Nearest Neighbor-Based Models are non-parametric machine learning techniques that


classify data points or predict values based on the similarity (or distance) to nearby data
points. Here’s a detailed explanation of key concepts related to these models:

1. Introduction to Nearest Neighbor Models

 Nearest Neighbor methods rely on the idea that similar instances have similar outputs.
 Predictions are based on the proximity of a query point to data points in the training set.
 Common distance metrics:
o Euclidean Distance: d(x,x′)=∑i=1n(xi−xi′)2d(x, x') = \sqrt{\sum_{i=1}^n (x_i -

o Manhattan Distance: d(x,x′)=∑i=1n∣xi−xi′∣d(x, x') = \sum_{i=1}^n |x_i - x'_i|d(x,x


x'_i)^2}d(x,x′)=∑i=1n(xi−xi′)2

′)=∑i=1n∣xi−xi′∣
o Cosine Similarity: Measures the cosine of the angle between two vectors.

2. K-Nearest Neighbors (K-NN) for Classification

 K-NN is a simple, instance-based algorithm that classifies a data point based on the majority
label of its kkk-nearest neighbors.
 Algorithm:
1. Choose a value for kkk (number of neighbors).
2. Compute the distances between the query point and all points in the training set.
3. Identify the kkk closest points.
4. Assign the class label most common among these kkk points.
 Key Considerations:

o Larger kkk smoothens the decision boundary (reduces variance but may increase
bias).
o Small kkk may lead to overfitting.

3. K-NN for Regression

 Predicts the value of a query point as the average (or weighted average) of the values of its
kkk-nearest neighbors: y^=1k∑i=1kyi\hat{y} = \frac{1}{k} \sum_{i=1}^k y_iy^=k1i=1∑kyi
o Weighted K-NN: Weights closer points more heavily:
y^=∑i=1kwiyi∑i=1kwi,wi=1d(x,xi)\hat{y} = \frac{\sum_{i=1}^k w_i y_i}{\sum_{i=1}^k
w_i}, \quad w_i = \frac{1}{d(x, x_i)}y^=∑i=1kwi∑i=1kwiyi,wi=d(x,xi)1

4. Distance Metrics

 The choice of distance metric can significantly impact model performance:


o Euclidean: Best for continuous, normalized data.
o Manhattan: Works well for high-dimensional or grid-like data.

o Minkowski: Generalized distance metric: d(x,x′)=(∑i=1n∣xi−xi′∣p)1/pd(x, x') = \left( \


o Hamming: Suitable for categorical data.

sum_{i=1}^n |x_i - x'_i|^p \right)^{1/p}d(x,x′)=(i=1∑n∣xi−xi′∣p)1/p Special cases: p=2p


= 2p=2 (Euclidean), p=1p = 1p=1 (Manhattan).

5. Strengths and Weaknesses of K-NN

Strengths:

 Simple and Intuitive: No training phase (lazy learning).


 Versatile: Can handle classification and regression tasks.
 Adaptable: Works with different distance metrics.

Weaknesses:

 High Computation Cost: Requires computing distances to all training points at inference.
 Memory Intensive: Stores the entire training set.
 Curse of Dimensionality: Performance deteriorates in high-dimensional spaces.
 Sensitive to Noise: Outliers can affect predictions.

6. Applications of Nearest Neighbor Models

 Pattern Recognition: Handwriting and facial recognition.


 Recommender Systems: Collaborative filtering.
 Anomaly Detection: Identifying outliers.
 Medical Diagnosis: Classifying symptoms.

7. Improving K-NN

Scaling Features:

 Normalize or standardize features to ensure fair distance calculations.

Dimensionality Reduction:

 Techniques like PCA (Principal Component Analysis) reduce noise and improve performance.

Introduction to Proximity Measures

Proximity measures refer to techniques that quantify the similarity or closeness between two
data points. The idea is that if two data points are close to each other in the feature space,
they are more likely to belong to the same class (in classification) or have similar values (in
regression). These proximity measures are essential for nearest neighbor algorithms.

Distance Measures

Distance measures are specific types of proximity measures that calculate the physical
distance between data points in a multi-dimensional feature space. The choice of distance
measure can have a significant impact on the performance of the nearest neighbor algorithm.

1. Euclidean Distance:
o Formula: d(x,y)=∑i=1n(xi−yi)2d(\mathbf{x}, \mathbf{y}) = \sqrt{\
sum_{i=1}^{n} (x_i - y_i)^2}d(x,y)=i=1∑n(xi−yi)2
o The most common and widely used distance measure.
o It calculates the straight-line distance between two points in Euclidean space.
2. Manhattan Distance (also known as L1 distance or Taxicab Distance):
o Formula: d(x,y)=∑i=1n∣xi−yi∣d(\mathbf{x}, \mathbf{y}) = \sum_{i=1}^{n} |
x_i - y_i|d(x,y)=i=1∑n∣xi−yi∣
o The sum of the absolute differences between corresponding components.
3. Minkowski Distance:
o A generalization of Euclidean and Manhattan distance.
o Formula: d(x,y)=(∑i=1n∣xi−yi∣p)1/pd(\mathbf{x}, \mathbf{y}) = \left( \
sum_{i=1}^{n} |x_i - y_i|^p \right)^{1/p}d(x,y)=(i=1∑n∣xi−yi∣p)1/p
o When p=2p = 2p=2, it’s Euclidean distance; when p=1p = 1p=1, it’s
Manhattan distance.
4. Cosine Similarity:
A measure of similarity between two non-zero vectors of an inner product
o
space.
o Formula: cosine similarity=x⋅y∥x∥∥y∥\text{cosine similarity} = \frac{\
mathbf{x} \cdot \mathbf{y}}{\|\mathbf{x}\| \|\
mathbf{y}\|}cosine similarity=∥x∥∥y∥x⋅y
o It measures the angle between the vectors, often used in text analysis or when
comparing documents.
5. Hamming Distance:
o Used for categorical data, particularly binary data.
o It counts the number of positions at which the corresponding elements of two
strings are different.

Non-Metric Similarity Functions

Not all similarity measures correspond to a "distance" in the strict mathematical sense. These
are non-metric similarity functions, often used in cases where the data is not naturally metric
(e.g., categorical or nominal data).

 Jaccard Similarity: Measures similarity between finite sample sets.

J(A,B)=∣A∩B∣∣A∪B∣J(A, B) = \frac{|A \cap B|}{|A \cup B|}J(A,B)=∣A∪B∣∣A∩B∣

o Used in binary and categorical data (e.g., comparing sets).


 Pearson Correlation Coefficient: Measures linear correlation between two variables.

ρ=∑(xi−xˉ)(yi−yˉ)∑(xi−xˉ)2∑(yi−yˉ)2\rho = \frac{\sum (x_i - \bar{x})(y_i - \


bar{y})}{\sqrt{\sum (x_i - \bar{x})^2 \sum (y_i - \bar{y})^2}}ρ=∑(xi−xˉ)2∑(yi−yˉ)2
∑(xi−xˉ)(yi−yˉ)

o Used in continuous data to measure similarity between variables.

Proximity Between Binary Patterns

For binary patterns (data with values of 0 or 1), the distance or similarity can be computed
using measures that specifically handle binary data:

 Hamming Distance: The number of differing bits between two binary patterns.
 Jaccard Similarity: Works well for comparing binary sets, focusing on the
intersection over the union of two sets.

Different Classification Algorithms Based on the Distance Measures

Several classification algorithms rely on proximity measures to classify data points:


1. k-Nearest Neighbors Classifier (k-NN):
o This is the most common classification algorithm based on distance measures.
o Algorithm:
1. Choose the number of neighbors kkk.
2. For a new data point, calculate the distance to all training data points.
3. Identify the kkk-nearest neighbors.
4. Assign the class label based on the majority vote of the kkk neighbors.
2. Radius Nearest Neighbor (RNN):
o Similar to k-NN but instead of selecting a fixed number of nearest neighbors, a
radius rrr is specified.
o Algorithm:
1. Choose a radius rrr.
2. For a new data point, find all training points within radius rrr.
3. Assign the class label based on the majority vote of all points within
the radius.
o This method is useful when data is sparse and you want to avoid having a
fixed kkk.
3. Weighted k-NN:
o A variation where closer neighbors have more influence on the decision.
o For example, use inverse distance weighting to weight closer neighbors
higher.

K-Nearest Neighbor Classifier

The k-Nearest Neighbors (k-NN) algorithm is one of the most straightforward and widely
used classification algorithms. It classifies new data points based on the majority class among
its kkk-nearest neighbors.

As an example, consider the following table of data points containing two


features:

Algorithm Steps
Input:

 Training dataset: D={(x1,y1),(x2,y2),…,(xn,yn)}D = \{(x_1, y_1), (x_2, y_2), \dots, (x_n,


y_n)\}D={(x1,y1),(x2,y2),…,(xn,yn)}
 Query point: xqx_qxq
 Number of neighbors: kkk
 Distance metric: d(xi,xj)d(x_i, x_j)d(xi,xj) (e.g., Euclidean, Manhattan).

Procedure:

1. Compute Distances:
o Calculate the distance d(xq,xi)d(x_q, x_i)d(xq,xi) between the query point xqx_qxq
and each data point xix_ixi in the training set.
2. Sort Neighbors:
o Sort the training points by their distance to xqx_qxq.
3. Select Nearest Neighbors:
o Select the kkk-nearest neighbors (smallest distances).
4. Majority Vote:
o Determine the majority class label among the kkk-nearest neighbors.
5. Output:
o Assign the query point xqx_qxq to the majority class.

3. Distance Metrics

1. Euclidean Distance (default for continuous features): d(xi,xj)=∑k=1n(xik−xjk)2d(x_i, x_j) = \


sqrt{\sum_{k=1}^n (x_{ik} - x_{jk})^2}d(xi,xj)=k=1∑n(xik−xjk)2
2. Manhattan Distance: d(xi,xj)=∑k=1n∣xik−xjk∣d(x_i, x_j) = \sum_{k=1}^n |x_{ik} - x_{jk}|d(xi,xj

3. Cosine Similarity (for text or high-dimensional data): Cosine similarity=x⃗i⋅x⃗j∥x⃗i∥∥x⃗j∥\


)=k=1∑n∣xik−xjk∣

text{Cosine similarity} = \frac{\vec{x}_i \cdot \vec{x}_j}{\|\vec{x}_i\| \|\


vec{x}_j\|}Cosine similarity=∥xi∥∥xj∥xi⋅xj Lower similarity corresponds to greater distance.
4. Hamming Distance (for categorical data):
d(xi,xj)=Number of differing attributes between xi and xjd(x_i, x_j) = \text{Number of
differing attributes between } x_i \text{ and } x_jd(xi,xj
)=Number of differing attributes between xi and xj

4. Choosing kkk

 Small kkk:
o Captures local patterns but may overfit (sensitive to noise).
 Large kkk:
o Smoothens predictions but may underfit (ignores local structures).
 Optimal kkk is usually determined via cross-validation.

Example
Dataset:

Training data:

D={(1,A),(2,A),(4,B),(5,B),(6,A)}D = \{(1, A), (2, A), (4, B), (5, B), (6, A)\}D={(1,A),(2,A),(4,B),(5,B),(6,A)}

Query point: xq=3x_q = 3xq=3, k=3k = 3k=3.

Steps:

1. Compute Distances:
o d(1,3)=2,d(2,3)=1,d(4,3)=1,d(5,3)=2,d(6,3)=3d(1, 3) = 2, d(2, 3) = 1, d(4, 3) = 1, d(5, 3)
= 2, d(6, 3) = 3d(1,3)=2,d(2,3)=1,d(4,3)=1,d(5,3)=2,d(6,3)=3.
2. Sort Neighbors:
o Sorted distances: (2,A),(4,B),(1,A),(5,B),(6,A)(2, A), (4, B), (1, A), (5, B), (6, A)(2,A),
(4,B),(1,A),(5,B),(6,A).
3. Select k=3k = 3k=3 Nearest Neighbors:
o Neighbors: (2,A),(4,B),(1,A)(2, A), (4, B), (1, A)(2,A),(4,B),(1,A).
4. Majority Vote:
o Class labels: A,B,AA, B, AA,B,A.
o Majority class: AAA.
5. Prediction:
o Assign xq=3x_q = 3xq=3 to class AAA.

Radius Distance Nearest Neighbor Algorithm

The Radius Nearest Neighbor (RNN) algorithm is an extension of the k-NN approach but
with a focus on the radius rather than the number of neighbors.

Algorithm Steps

Input:

 Training data: D={(x1,y1),(x2,y2),…,(xn,yn)}D = \{(x_1, y_1), (x_2, y_2), \dots, (x_n,


y_n)\}D={(x1,y1),(x2,y2),…,(xn,yn)}
 Query point: xqx_qxq
 Radius: rrr
 Distance metric: d(xi,xq)d(x_i, x_q)d(xi,xq) (e.g., Euclidean, Manhattan).

Procedure:

1. Compute Distances:
o Calculate the distance d(xi,xq)d(x_i, x_q)d(xi,xq) between the query point xqx_qxq
and each data point xix_ixi in the training set.
2. Identify Neighbors:
o Select all points xix_ixi such that d(xi,xq)≤rd(x_i, x_q) \leq rd(xi,xq)≤r.
3. Prediction:
o Classification:
 Use a majority vote among the labels of the selected neighbors.
 If no neighbors are found within rrr, a default class can be assigned, or a
larger radius can be used.
o Regression:
 Compute the mean (or weighted mean) of the target values yiy_iyi of the
selected neighbors.

Output:

 Predicted class or value for xqx_qxq.

3. Pseudocode
plaintext
Copy code
Algorithm RDNN:
Input: Training data D = {(x1, y1), (x2, y2), ..., (xn, yn)}, query point
x_q, radius r, distance metric d.

1. Initialize an empty list of neighbors: Neighbors = []


2. For each (x_i, y_i) in D:
Compute distance: dist = d(x_i, x_q)
If dist ≤ r:
Add (x_i, y_i) to Neighbors
3. If Neighbors is empty:
Return default prediction (or use larger radius)
4. If classification:
Return majority label among Neighbors
Else if regression:
Return mean(target values of Neighbors)

KNN Regression

In k-Nearest Neighbors Regression (k-NN regression), the output is continuous rather than
categorical. The prediction is made based on the average (or weighted average) of the target
values of the kkk-nearest neighbors.

Steps of the Algorithm

Input:

1. Training dataset D={(x1,y1),(x2,y2),…,(xn,yn)}D = \{(x_1, y_1), (x_2, y_2), \dots, (x_n,


y_n)\}D={(x1,y1),(x2,y2),…,(xn,yn)}:
o xix_ixi represents the feature vector of the ithi^{th}ith data point.
o yiy_iyi represents the target value of the ithi^{th}ith data point.
2. Query point xqx_qxq.
3. Number of neighbors kkk (a hyperparameter).
4. A distance metric (e.g., Euclidean distance, Manhattan distance).
Procedure:

1. Compute Distances:
o Calculate the distance d(xq,xi)d(x_q, x_i)d(xq,xi) between the query point xqx_qxq
and each training point xix_ixi.
2. Find kkk-Nearest Neighbors:
o Identify the kkk-nearest neighbors by selecting the kkk points with the smallest
distances to xqx_qxq.
3. Aggregate Neighbor Target Values:
o Simple Average:
 Predict the target value as the mean of the target values of the kkk-nearest
neighbors: y^q=1k∑i=1kyi\hat{y}_q = \frac{1}{k} \sum_{i=1}^k y_iy^q=k1
i=1∑kyi
o Weighted Average:
 Assign weights inversely proportional to the distances:
y^q=∑i=1kwiyi∑i=1kwi,wi=1d(xq,xi)+ϵ\hat{y}_q = \frac{\sum_{i=1}^k w_i y_i}
{\sum_{i=1}^k w_i}, \quad w_i = \frac{1}{d(x_q, x_i) + \epsilon}y^q=∑i=1kwi
∑i=1kwiyi,wi=d(xq,xi)+ϵ1 Here, ϵ\epsilonϵ is a small constant to prevent
division by zero.

Output:

 Predicted target value y^q\hat{y}_qy^q for the query point xqx_qxq.

Pseudocode
plaintext
Copy code
Algorithm KNN-Regression:
Input: Training data D = {(x1, y1), ..., (xn, yn)}, query point x_q, number
of neighbors k, distance metric d.

1. Compute distances:
For each (x_i, y_i) in D:
dist[i] = d(x_i, x_q)

2. Sort training data by distances:


Sort D based on ascending values of dist.

3. Select k-nearest neighbors:


Neighbors = first k points from sorted D.

4. Aggregate target values:


If using simple average:
Predicted value = Mean(target values of Neighbors)
If using weighted average:
Predicted value = Weighted Mean(target values of Neighbors)

5. Return Predicted value.

Performance of Classifiers
The performance of nearest neighbor classifiers depends on several factors:

1. Choice of kkk: Too small a kkk can lead to noisy predictions (overfitting), while too
large a kkk can make the model too smooth (underfitting).
2. Distance Measure: The choice of distance measure can heavily affect the accuracy of
the classifier, especially when features are on different scales.
3. Dimensionality: As the number of features (dimensions) increases, the model can
struggle due to the curse of dimensionality.

Performance of Regression Algorithms

Evaluating the performance of regression algorithms is essential to understand how well the model
predicts continuous target variables. The choice of metric depends on the problem's context and the
specific goals, such as minimizing large errors, understanding variability, or achieving interpretable
results.

1. Common Regression Metrics

1.1 Mean Absolute Error (MAE)

 Measures the average magnitude of absolute errors between predicted values (y^i\
hat{y}_iy^i) and actual values (yiy_iyi).
 Formula: MAE=1n∑i=1n∣yi−y^i∣\text{MAE} = \frac{1}{n} \sum_{i=1}^n |y_i - \hat{y}_i|
MAE=n1i=1∑n∣yi−y^i∣
 Advantages:
o Simple to interpret.
o Less sensitive to outliers than Mean Squared Error.
 Disadvantages:
o Doesn't penalize large errors as heavily as other metrics.

1.2 Mean Squared Error (MSE)

 Measures the average of the squared differences between predicted and actual values.
 Formula: MSE=1n∑i=1n(yi−y^i)2\text{MSE} = \frac{1}{n} \sum_{i=1}^n (y_i - \
hat{y}_i)^2MSE=n1i=1∑n(yi−y^i)2
 Advantages:
o Penalizes large errors more than small ones.
 Disadvantages:
o Can be heavily influenced by outliers.
o Units are squared, making interpretation less intuitive.

1.3 Root Mean Squared Error (RMSE)

 The square root of MSE; expresses error in the same units as the target variable.
 Formula: RMSE=1n∑i=1n(yi−y^i)2\text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^n (y_i - \
hat{y}_i)^2}RMSE=n1i=1∑n(yi−y^i)2
 Advantages:
o Penalizes large errors more than MAE.
o Easier to interpret than MSE.
 Disadvantages:
o Still sensitive to outliers.

1.4 R-Squared (R2R^2R2) or Coefficient of Determination

 Measures the proportion of variance in the target variable explained by the model.
 Formula: R2=1−∑i=1n(yi−y^i)2∑i=1n(yi−yˉ)2R^2 = 1 - \frac{\sum_{i=1}^n (y_i - \hat{y}_i)^2}{\
sum_{i=1}^n (y_i - \bar{y})^2}R2=1−∑i=1n(yi−yˉ)2∑i=1n(yi−y^i)2 Where yˉ\bar{y}yˉ is the
mean of the actual values.
 Range:
o R2=1R^2 = 1R2=1: Perfect fit.
o R2=0R^2 = 0R2=0: Model predicts as well as the mean.
o R2<0R^2 < 0R2<0: Model performs worse than the mean prediction.
 Advantages:
o Intuitive measure of goodness-of-fit.

1. .

UNIT-3
Models Based on Decision Trees

1. Decision Trees for Classification

 Decision Trees are supervised learning models used for classification and regression tasks.
 Working Principle:
o They split the data into subsets based on the feature that provides the most
homogeneous subsets (purity of classes).
o Each internal node represents a test on a feature, branches represent the outcome
of the test, and leaves represent class labels.

2. Impurity Measures

Impurity measures quantify how "mixed" the data is in terms of target classes:

 Gini Index: Measures the likelihood of incorrect classification.

G=1−∑i=1kpi2G = 1 - \sum_{i=1}^k p_i^2G=1−i=1∑kpi2


Where pip_ipi is the probability of class iii. Lower Gini indicates purer nodes.

 Entropy: Measures the amount of information needed to classify a data point.

H=−∑i=1kpilog⁡2(pi)H = -\sum_{i=1}^k p_i \log_2(p_i)H=−i=1∑kpilog2(pi)

Nodes with lower entropy are preferred.

3. Properties of Decision Trees

 Advantages:
o Easy to interpret and visualize.
o Handles both numerical and categorical data.
o Non-parametric (does not assume a distribution of the data).

 Disadvantages:
o Prone to overfitting, especially with deep trees.
o Sensitive to small changes in data (instability).

4. Regression Based on Decision Trees

 In regression, decision trees predict continuous values.


 Splits minimize the mean squared error (MSE) or mean absolute error (MAE) to achieve
homogeneous subsets.

5. Bias–Variance Trade-off

 Bias: Error due to overly simplistic models (underfitting).


 Variance: Error due to model sensitivity to small changes in training data (overfitting).
 Decision trees tend to have low bias but high variance. Regularization methods like limiting
tree depth and pruning help reduce variance.

6. Random Forests for Classification and Regression

 A Random Forest is an ensemble of decision trees where:


o Each tree is trained on a bootstrapped sample of the data.
o Features for splitting are randomly selected at each node.
 Advantages:
o Combines multiple trees to reduce overfitting (low variance) while maintaining
accuracy (low bias).
o Robust to noise and works well with large datasets.
The Bayes Classifier

1. Introduction to the Bayes Classifier

 The Bayes classifier is based on Bayes' theorem and predicts the class that maximizes the
posterior probability: Class=arg⁡max⁡kP(Ck∣X)\text{Class} = \arg\max_{k} P(C_k \mid
X)Class=argkmaxP(Ck∣X)

2. Bayes’ Rule and Inference

Bayes' rule provides a way to update probabilities based on evidence:

P(Ck∣X)=P(X∣Ck)P(Ck)P(X)P(C_k \mid X) = \frac{P(X \mid C_k) P(C_k)}{P(X)}P(Ck∣X)=P(X)P(X∣Ck)P(Ck)

 P(Ck∣X)P(C_k \mid X)P(Ck∣X): Posterior probability (class given data).


 P(X∣Ck)P(X \mid C_k)P(X∣Ck): Likelihood (data given class).
 P(Ck)P(C_k)P(Ck): Prior probability (class before data).
 P(X)P(X)P(X): Evidence (overall data distribution).

3. The Bayes Classifier and Its Optimality

 The Bayes classifier is optimal in minimizing the classification error if the true probabilities
P(Ck∣X)P(C_k \mid X)P(Ck∣X) are known.
 In practice, these probabilities are often estimated from the data.

4. Multi-Class Classification

 Extends the Bayes classifier to handle multiple classes by computing posterior probabilities
for each class and choosing the one with the highest value.

5. Class Conditional Independence and Naive Bayes Classifier (NBC)

 Class Conditional Independence: Assumes that features X1,X2,…,XnX_1, X_2, \ldots, X_nX1
,X2,…,Xn are independent given the class CCC: P(X∣C)=∏i=1nP(Xi∣C)P(X \mid C) = \
prod_{i=1}^n P(X_i \mid C)P(X∣C)=i=1∏nP(Xi∣C)
 Naive Bayes Classifier:
o Based on the independence assumption, it calculates posterior probabilities
efficiently.
o Works well for text classification and spam filtering, despite its simplicity.
o Advantages: Fast, scalable, and effective with high-dimensional data.
o Disadvantages: Assumption of feature independence may not hold in practice.

UNIT-4
1. Introduction to Linear Discriminants

 Linear discriminants separate data points using a hyperplane.


 The decision boundary is defined as: wTx+b=0w^T x + b = 0wTx+b=0 where
www is the weight vector, xxx is the input, and bbb is the bias term.
 Applications: Classification tasks where data is linearly separable.

2. Linear Discriminants for Classification

 A linear discriminant divides the feature space into regions corresponding to


different classes.
 For two classes, a simple discriminant function is: y=sign(wTx+b)y = \
text{sign}(w^T x + b)y=sign(wTx+b) y=+1y = +1y=+1 or y=−1y = -1y=−1,
depending on which side of the hyperplane a point lies.

3. Perceptron Classifier

 A perceptron is a linear classifier that assigns labels based on:


y=sign(wTx+b)y = \text{sign}(w^T x + b)y=sign(wTx+b)
 It works for linearly separable data and uses a simple learning algorithm to find
the hyperplane.

4. Perceptron Learning Algorithm

 Goal: Adjust weights www and bias bbb to correctly classify all training data.
 Algorithm:
1. Initialize w=0w = 0w=0, b=0b = 0b=0.
2. For each misclassified point xix_ixi:
 Update www: w=w+ηyixiw = w + \eta y_i x_iw=w+ηyixi
 Update bbb: b=b+ηyib = b + \eta y_ib=b+ηyi
3. Repeat until no misclassifications. η\etaη is the learning rate.
5. Support Vector Machines (SVMs)

 SVMs find the hyperplane that maximizes the margin (distance between the
hyperplane and the nearest points of each class).
 Objective: Minimize 12∥w∥2subject to yi(wTxi+b)≥1∀i\text{Minimize } \
frac{1}{2} \| w \|^2 \quad \text{subject to } y_i(w^T x_i + b) \geq 1 \quad \
forall iMinimize 21∥w∥2subject to yi(wTxi+b)≥1∀i
 Results in better generalization compared to perceptrons.

6. Linearly Non-Separable Case

 For non-separable data, SVMs introduce slack variables (ξi\xi_iξi) to allow


some misclassification: yi(wTxi+b)≥1−ξiand ξi≥0y_i(w^T x_i + b) \geq 1 - \
xi_i \quad \text{and } \xi_i \geq 0yi(wTxi+b)≥1−ξiand ξi≥0
 Minimization becomes: Minimize 12∥w∥2+C∑ξi\text{Minimize } \frac{1}
{2} \| w \|^2 + C \sum \xi_iMinimize 21∥w∥2+C∑ξi where CCC controls the
trade-off between margin width and misclassification.

7. Non-Linear SVM

 Non-linear data can be classified by mapping it to a higher-dimensional space


where it becomes linearly separable.
 Mapping function: ϕ(x):Rn→Rm\phi(x): \mathbb{R}^n \to \
mathbb{R}^mϕ(x):Rn→Rm, where m>nm > nm>n.

8. Kernel Trick

 Explicitly mapping xxx to ϕ(x)\phi(x)ϕ(x) can be computationally expensive.


 The kernel trick computes ϕ(xi)Tϕ(xj)\phi(x_i)^T \phi(x_j)ϕ(xi)Tϕ(xj) directly
using a kernel function K(xi,xj)K(x_i, x_j)K(xi,xj).
 Common kernels:
o Polynomial kernel: K(x,x′)=(xTx′+1)dK(x, x') = (x^T x' + 1)^dK(x,x
′)=(xTx′+1)d
o Gaussian (RBF) kernel: K(x,x′)=exp⁡(−∥x−x′∥2/2σ2)K(x, x') = \exp(-\|x -
x'\|^2 / 2\sigma^2)K(x,x′)=exp(−∥x−x′∥2/2σ2)

9. Logistic Regression

 Logistic regression models the probability of a binary outcome using the


sigmoid function: P(y=1∣x)=11+exp⁡(−wTx−b)P(y = 1 \mid x) = \frac{1}{1 + \
exp(-w^T x - b)}P(y=1∣x)=1+exp(−wTx−b)1
 Decision boundary: wTx+b=0w^T x + b = 0wTx+b=0.
 Optimized using maximum likelihood estimation.

10. Linear Regression

 Linear regression predicts a continuous output using a linear combination of


input features: y=wTx+by = w^T x + by=wTx+b
 Objective: Minimize the mean squared error (MSE): MSE=1n∑i=1n(yi−
(wTxi+b))2\text{MSE} = \frac{1}{n} \sum_{i=1}^n (y_i - (w^T x_i +
b))^2MSE=n1i=1∑n(yi−(wTxi+b))2

11. Multi-Layer Perceptrons (MLPs)

 An MLP is a type of neural network with:


o Input layer, one or more hidden layers, and an output layer.
o Each layer has neurons connected with weights and biases.
 Non-linear activation functions (e.g., ReLU, sigmoid) allow MLPs to model
complex relationships.

12. Backpropagation for Training an MLP

 Backpropagation adjusts weights to minimize the error (e.g., cross-entropy or


MSE):
o Forward Pass: Compute output and loss for given weights.
o Backward Pass:
 Compute gradients of the loss with respect to weights using the
chain rule.
 Update weights using gradient descent or its variants (e.g.,
Adam).
o Repeat for multiple epochs until convergence.

UNIT-5
1. Introduction to Clustering

 Clustering is an unsupervised machine learning technique used to group similar data points
into clusters based on their characteristics.
 Objective: Maximize similarity within clusters and minimize similarity between clusters.
 Applications: Market segmentation, image segmentation, document categorization, anomaly
detection.
2. Partitioning of Data

 Partitioning involves dividing a dataset into clusters based on similarity or distance metrics.
 Methods: Distance-based (e.g., Euclidean), density-based, or probabilistic.

3. Matrix Factorization

 Matrix factorization decomposes a matrix (e.g., data matrix) into two or more lower-
dimensional matrices to reveal hidden patterns or structure.
 Applications: Dimensionality reduction, recommendation systems, latent topic modeling.

4. Clustering of Patterns

 Identifies groups (clusters) in data where patterns within the same cluster are similar, and
patterns in different clusters are dissimilar.
 Uses distance metrics like Euclidean, Manhattan, or cosine similarity.

5. Divisive Clustering

 A top-down hierarchical clustering approach:


o Starts with all data in one cluster.
o Recursively splits clusters until each data point is in its cluster or a stopping criterion
is met.
 Computationally intensive but provides a global view.

6. Agglomerative Clustering

 A bottom-up hierarchical clustering approach:


o Starts with each data point as a separate cluster.
o Merges the two closest clusters iteratively based on a linkage criterion:
 Single Linkage: Closest pair of points.
 Complete Linkage: Farthest pair of points.
 Average Linkage: Mean distance between points.
o Stops when all data points form a single cluster or a criterion is met.

7. Partitional Clustering

 Divides the dataset into non-overlapping clusters, with each data point belonging to exactly
one cluster.
 Example algorithms: K-Means, K-Medoids.
 Focus is on optimizing a predefined objective function (e.g., minimizing intra-cluster
distance).
8. K-Means Clustering

 A partitional clustering algorithm that minimizes intra-cluster variance:


o Initialize kkk cluster centroids randomly.
o Assign each data point to the nearest centroid.
o Update centroids as the mean of points in each cluster.
o Repeat until centroids stabilize.
 Limitations:
o Sensitive to initialization.
o Assumes spherical clusters of equal size.

9. Soft Partitioning and Soft Clustering

 Soft Partitioning allows data points to belong to multiple clusters with membership
probabilities.
 Soft Clustering methods, such as Fuzzy C-Means, assign degrees of membership rather than
hard assignments.

10. Fuzzy C-Means Clustering

 Extends K-Means by allowing data points to belong to multiple clusters:


o Minimize the weighted sum of squared distances: Jm=∑i=1n∑j=1kuijm∥xi−cj∥2J_m = \
sum_{i=1}^n \sum_{j=1}^k u_{ij}^m \| x_i - c_j \|^2Jm=i=1∑nj=1∑kuijm∥xi−cj∥2
Where:
 uiju_{ij}uij is the membership of xix_ixi in cluster jjj.
 cjc_jcj is the centroid of cluster jjj.
 mmm controls the fuzziness (e.g., m>1m > 1m>1).
o Update centroids and memberships iteratively.

11. Rough Clustering

 Combines concepts of rough sets and clustering:


o Each cluster is represented by a lower approximation (definite members) and an
upper approximation (possible members).
 Useful when uncertainty exists in assigning points to clusters.

12. Rough K-Means Clustering Algorithm

 Extends K-Means by using rough approximations for clusters:


o Data points are assigned to either the lower approximation or the boundary.
o Reduces ambiguity in clustering while maintaining flexibility.
13. Expectation-Maximization (EM)-Based Clustering

 A probabilistic clustering algorithm used for Gaussian Mixture Models (GMMs):


o E-Step: Estimate membership probabilities based on current parameters.
o M-Step: Update parameters (e.g., means, variances) using membership
probabilities.
o Iterates until convergence.
 Advantages:
o Can model non-spherical clusters.
o Provides soft clustering.

14. Spectral Clustering

 Uses graph theory and eigenvalues for clustering:


o Constructs a similarity graph from the data.
o Computes the Laplacian matrix of the graph.
o Eigenvectors of the Laplacian are used to embed the data in a lower-dimensional
space.
o K-Means is applied to the embedded space.
 Advantages:
o Handles complex cluster shapes.
o Works well for non-convex clusters.

MACHINE LEARNING  R23
UNIT-1
Introduction to Machine Learning
Machine learning (ML) is a subset of artificial intelligence (A
o
Used for predicting continuous values, such as house prices based on features 
like size, location, etc.
2. Decision Trees:
Some challenges that arise in ML include:

Data Quality: Poor or biased data can lead to inaccurate models.

Overfitting an
decision trees and clustering methods, though the field was still in its infancy
3. The AI Winter (1970s–1990s)
Despite early

Natural Language Processing (NLP) and Transformer Models: In the field of 
NLP, algorithms like Word2Vec and later transfor
o
Linear Regression: For predicting continuous values.
o
Logistic Regression: For binary classification problems.
o
Decision
Definition: Reinforcement learning (RL) is a paradigm where an agent learns to make 
decisions by interacting with an environ
o
Graph-based Methods: Use relationships (edges) between labeled and 
unlabeled data points to propagate labels.

Applicatio
Learning by Induction in Machine Learning
In machine learning, learning by induction refers to the process where a model gene
o
Training Phase: The model is provided with a set of examples, and it learns 
from this data by creating a hypothesis (model

You might also like