You are on page 1of 423

Introduction to Machine Learning

Dr. Sugata Ghosal


CSIS Off-Campus Faculty
BITS Pilani
Course Overview

• Topics covered
• Textbooks and reference books
• Evaluation components: Midsem 30%, Assignment: 30%, Comprehensive 40%
• Teaching Assistant: Dr. Madhusudhan B, Omshree Baskaran
Introduction

• What is Machine Learning


• Why Machine Learning
• Connection with Other technical areas
What is Machine Learning (ML)?
What is Machine Learning

• The science (and art) of programming computers so they can learn from data

• More general definition


• Field of study that gives computers the ability to learn without being explicitly programmed

• Engineering-oriented definition
• Algorithms that improve their performance P at some task T with experience E

• A well-defined learning task is given by <P, T, E>

Traditional programming Machine Learning

Data Data Program


Computer Output Computer
Program Output
Defining the Learning Tasks
Improve on task T, with respect to performance metric P, based on experience E
• Example 1
• T: Playing checkers
• P: Percentage of games won against an arbitrary opponent
• E: Playing practice games against itself
• Example 2
• T: Recognizing hand-written words
• P: Percentage of words correctly classified
• E: Database of human-labeled images of handwritten words
• Example 3
• T: Driving on four-lane highways using vision sensors
• P: Average distance traveled before a human-judged error
• E: A sequence of images and steering commands recorded while observing a human driver.
• Example 4
• T: Categorize email messages as spam or legitimate.
• P: Percentage of email messages correctly classified.
• E: Database of emails, some with human-given labels
Traditional Approach to Spam Filtering
Spam typically uses words or phrases such as “4U,” “credit card,” “free,” and “amazing”

• Solution
• Write a detection algorithm for frequently appearing
patterns in spams
• Test and update the detection rules until it is good
enough.

• Challenge
• Detection algorithm likely to be a long list of complex
rules
• hard to maintain.
Machine Learning Approach
Automatically learns phrases that are good predictors of spam by detecting unusually
frequent patterns of words in spams compared to “ham”s

• The program is much shorter, easier to maintain, and most likely more accurate.
More ML Usage Scenarios
Tasks that are best solved by using a learning algorithm
• Recognizing Patterns in images, text
• Facial identities or facial expressions
• Handwritten or spoken words
• Medical images

• Recognizing Anomalies in structured dataset


• Unusual credit card transactions
• Unusual patterns of sensor readings in a nuclear power plant

• Prediction from time series data


• Future stock prices or currency exchange rates

• Generating new Patterns


• Generating images or motion sequences
When to Use Machine Learning
ML Usage Scenarios
• Problems for which existing solutions require a lot of hand-tuning or long lists of rules
• Complex problems for which there is no good solution at all using a traditional approach
• Human rationale cannot be codified
• Changing environments
• Get insights about complex problems and large amounts of data
Where does ML fit in?
Applications of Machine Learning

Dr. Sugata Ghosal


CSIS Off-Campus Faculty
BITS Pilani
Topics

• Types of Applications
• Application Domains
• State of the Art Applications
Two Classes of Application

• Assign/Predict an object/event to an element of a given set


• If set of integers, then it is a classification problem
• If set of reals, then it is a regression problem
• Predict a sequence of steps to achieve a goal
Application Domains

• Internet
• Computational biology
• Finance
• E-commerce
• Space exploration
• Robotics
• Information extraction
• Social networks
• Software engineering
• System management
• Creative Arts
Example: Classification
Assign object/event to one of a given finite set of categories
• Medical Diagnosis
• Credit card applications or transactions
• Fraud detection in e-commerce
• Worm detection in network packets
• Spam filtering in email
• Recommended articles in a newspaper
• Recommended books, movies, music, or jokes
• Financial investments
• DNA sequences
• Spoken words
• Handwritten letters
• Astronomical images
Example: Planning, Control, Problem Solving
Performing actions in an environment in order to achieve a goal
• Playing checkers, chess, or backgammon
• Driving a car or a jeep
• Flying a plane, helicopter, or rocket
• Controlling a character in a video game
• Controlling a mobile robot
Breakthrough in Automatic Speech Recognition

ML used to predict of phone states from the sound spectrogram

Deep learning has state-of-the-art results


# Hidden Layers 1 2 4 8 10 12

Word Error Rate % 16.0 12.8 11.4 10.9 11.0 11.1

Baseline (Gaussian Mixture Model) performance = 15.4%


[Reference: Zeiler et al., “On rectified linear units for
speech recognition” ICASSP 2013]
Impact of Deep Learning in Speech
Technology
Visual Question Answering
Answering questions about images
Autonomous Car Technology

Path
Planning

Laser Terrain
Mapping

Learning from Human


Adaptive
Drivers
Vision

Sebastian

Stanle
y
Contemporary ML Based Solutions

• Optical Character Recognition


• Product Image Classification on a production line
• Detecting tumors in brain scans
• Information extraction from document images
• Categorization of news articles
• Flagging offensive comments in discussion forums
• Document summarization
• Chatbots / Voice activated App
• Detecting Credit Card Fraud
• Customer Segmentation
• Recommendation System for products, movies, news
• Forecasting company revenue
Types of Machine Learning

Dr. Sugata Ghosal


CSIS Off-Campus Faculty
BITS Pilani
Topics

• Supervised Vs. Unsupervised Vs. Semi-Supervised Vs. Reinforcement


• Batch Vs. Online Learning
• Instance Based Vs. Model Based Learning
Types of Learning
Based on level of supervision
• Supervised (inductive) learning
• Given: training data, desired outputs (labels)
• Unsupervised learning
• Given: training data only (without desired outputs)
• Semi-supervised learning
• Given: training data and a few desired outputs
• Reinforcement learning
• Given: rewards from sequence of actions
Supervised Learning: Regression

• Given (x1, y1), (x2, y2), ..., (xn, yn)


• Learn a function f (x) to predict y given x
– y is real-valued
September Arctic Sea Ice Extent

9
8
7
(1,000,000 sq km)

6
5
4
3
2
1
0
1970 1980 1990 2000 2010 2020
Year Number
Supervised Learning: Classification

• Given (x1, y1), (x2, y2), ..., (xn, yn)


• Learn a function f (x) to predict y given x
– y is categorical

Cancer (benign / malignant)

y=1 (malignant)

y=0 (benign)

Tumor Size (x)


Supervised Learning: Classification

• Given (x1, y1), (x2, y2), ..., (xn, yn)


• Learn a function f (x) to predict y given x
– y is categorical

Cancer (benign / malignant)

1 (malignant)

0 (benign)
Tumor Size Learnt classifer
If x>T, malignant else benign
Predict benign Predict malignant

x=T
Increasing Feature Dimension

• x can be multi-dimensional
– Each dimension corresponds to an attribute

- Clump Thickness
- Uniformity of Cell Size
Age
- Uniformity of Cell Shape

Tumor Size
Example: Supervised Learning Techniques

• Linear Regression
• Logistic Regression
• Naïve Bayes Classifiers
• Support Vector Machines (SVMs)
• Decision Trees and Random Forests
• Neural networks
Unsupervised Learning

• Given x1, x2, ..., xn (without labels)


• Output hidden structure behind the x’s
• e.g., clustering
Example: Unsupervised Learning Techniques

• Clustering
• k-Means
• Hierarchical Cluster Analysis
• Expectation Maximization
• Visualization and dimensionality reduction
• Principal Component Analysis (PCA)
• Kernel PCA
• Locally-Linear Embedding (LLE)
• t-distributed Stochastic Neighbor Embedding (t-SNE)
• Association rule learning
• Apriori
• Eclat
Data Visualization
Visualize 2/3D representation of complex unlabelled training data
• Preserve as much structure as possible
• e.g., trying to keep separate clusters in the input space from overlapping in the visualization
• Understand how the data is organized
• Identify unsuspected patterns.

frog

cat
bird
dog
truck

automobile

deer

horse

ship airplane
Applications: Unsupervised Learning
Genomics application: Group individuals by genetic similarity
Genes

Individuals
Applications: Unsupervised Learning
Organize computing clusters Social network analysis

Market segmentation Astronomical data analysis


Semisupervised Learning
Partially labelled data – some labelled data and a lot of unlabelled data
• Combines unsupervised and supervised learning algorithms
• Photo hosting service, e.g., google photos
Reinforcement Learning
A learning agent
• observes the state of the environment, select
and perform actions
• gets +ve or -ve rewards in return
• learns the best strategy, aka a policy, to get
the most reward over time.
• policy is a mapping from states → actions

• Examples:
– Game playing, e.g., AlphaGo
– Robot in a maze
– Balance a pole on your hand
Types of Learning
Based on how training data is used
• Batch learning
• Uses all available data at a time during training
• Mini Batch learning
• Uses a subset of available at a time during training
• Online (incremental) learning
• Uses single training data instance at a time during training
Types of Learning
Based on how training data is used
• Instance Based Learning
• compare new data points to known data points
• Model Based learning
• detect patterns in the training data and build a predictive model

Instance Based Model Based


Challenges of Machine Learning

Dr. Sugata Ghosal


CSIS Off-Campus Faculty
BITS Pilani
Topics

• Training Data
• Insufficient
• Non representative
• Poor Quality
• Irrelevant attributes
• Model Selection
• Overfitting
• Underfitting
• Testing and Validation
• Hyperparameters
Insufficient Training Data
Consider trade-off Between Algorithm development & training data capture
Non-representative Training Data
Training Data be representative of the new cases we want to generalize
• Small sample size leads to sampling noise
• Missing data over emphasizes the role of wealth on happiness
• If sampling process is flawed, even large sample size can lead to sampling bias
Data Quality
Cleaning often needed for improving data quality
• Some instances have missing features
• e.g., 5% of customers did not specify their age
• Ignore the instances all together or the feature,
• fill in the missing values
• Train multiple ML models
• Some instances can be erroneous, noisy or outliers
• Human or machine generated
• Identify, discard or fix manually, as appropriate
Irrelevant Features
Feature engineering needed for coming up with a good set of features
• Feature selection
• more useful features to train on among existing features.
• Feature extraction
• combine existing features to produce a more useful one.
• Create new features by gathering new data
Model Selection
Overfitting or Underfitting
• Overfitting leads to high performance in training set but performs poorly on new data
• e.g., a high-degree polynomial life satisfaction model that strongly overfits the training data
• Small training set or sampling noise can lead to model following the noise than the underlying pattern
in the dataset
• Solution: Regularization Large slope

Smaller slope
High-order
polynomial

• Underfitting when the model is too simple to learn the underlying structure in the data
• Select a more powerful model, with more parameters
• Feed better features to the learning algorithm
• Reduce regularization
Testing and Validation
Performance of ML algorithms is statistical / predictive
• Good ML algorithms need to work well on test data
• But test data is often not accessible to the provider of the algorithm
• Common assumption is training data is representative of test data
• Randomly chosen subset of the training data is held out as validation set
• aka dev set
• Once ML model is trained, its performance is evaluated on validation data
• Expectation is ML model working well on validation set will work well on unknown test data
• Typically 20-30% of the data is randomly held out as validation data
Cross Validation
K-fold validation is often performed
• To reduce the bias of validation set selection process
• Often K is chosen as 10
• aka 10 fold cross validation
• 10 fold cross validation involves
• randomly selecting the validation set 10 times
• model generation with 10 resulting training set
• Evaluate the performance of each on that validation set
• averaging the performance over the validation sets
Choice of Hyperparameters
Modern ML models often use a lot of model parameters
• Known as hyperparameters
• Model performance depends on choice of parameters
• Each parameter can assume a number of values
• Real numbers or categories
• Exponential number of hyperparameter combinations possible
• Best model correspond to best cross validation performance over the set of hyperparameter
combinations
• Expensive to perform
• Some empirical frameworks available for hyperparameter optimization
End to End Machine Learning

Dr. Sugata Ghosal


CSIS Off-Campus Faculty
BITS Pilani
Topics
Frame the Machine Learning Problem
• Business Context Understanding
• Model Selection
• Performance Metric Selection
• Check the assumptions
Key Elements of a Machine Learning Project

• Framing a machine learning problem


• Data types and representation
• Data Pre processing
• Data Visualization and Analysis
• Feature Engineering
• Model Building and testing
An Example Problem Statement
Start with the Big Picture

• Build a model of housing prices using the census data.


• Data attributes <population, median income, median housing price, ….> and so on for each district

• Districts are the smallest geographical unit (population ~600 – 3000)

• The model to predict the median housing price in any district, given all the other metrics.

• Goodness of the model is determined by how close the model output is w.r.t. actual price for
unseen district data
Framing the Problem
Understand the Business Objective and Context

• What is the expected usage and benefit?


• impacts the choice of algorithms, goodness measure, and effort in lifecycle management of the model

• What is the baseline method and its performance?


Choice of Model
Choose between Different techniques

• Supervised or unsupervised? Prediction or classification? Online Vs. Batch? Instance-based or


Model-based?

• Analyze the dataset


• Each instance comes with the expected output, i.e., the district’s median housing price.
• supervised

• Goal is to predict a real valued price based on multiple variables like population, income etc.
c
• regression
v
• Output is based on input data at rest, not rapidly changing data rapidly.

• Dataset small enough to fit in memory


c
• batch
v
• So, it’s a supervised multivariate batch regression problem
Choice of Performance Metrics
A typical Choice for Regression: Root Mean Square Error (RMSE)

• h is the model, X is the training dataset, m is number of instances, x(i) is i-th instance, y(i) is the
actual price for the i-th instance.
Mean Absolute Error (MAE)

• MAE is preferred for a large number of outliers. aka L1 norm or Manhattan distance/norm.
General Form
Data Types and Representation

Dr. Sugata Ghosal


CSIS Off-Campus Faculty
BITS Pilani
Topics
Data Types and Representations
• What is Data?
• Types and properties of attributes of data
• Categorization of attributes
• Characteristics of Data
• Types of Data
What is Data?
Collection of data objects and attributes Attributes
• An attribute is a property or characteristic of an object
• Examples: eye color of a person, temperature, etc. Tid Refund Marital Taxable
• aka variable, field, characteristic, dimension, or feature Status Income Cheat
• A collection of attributes describe an object 1 Yes Single 125K No
• aka record, point, case, sample, entity, or instance
2 No Married 100K No
• Attribute values are numbers or symbols assigned to an
attribute for a particular object 3 No Single 70K No

• Distinction between attributes and attribute values 4 Yes Married 120K No

Objects
• Same attribute can be mapped to different attribute values 5 No Divorced 95K Yes
• Example: height can be measured in feet or meters
6 No Married 60K No
• Different attributes can be mapped to the same set of values
• Example: Attribute values for ID and age are integers
7 Yes Divorced 220K No
8 No Single 85K Yes
• But properties of attribute can be different than the properties of
the values used to represent the attribute 9 No Married 75K No
10 No Single 90K Yes
10
Discrete and Continuous Attributes

• Discrete Attribute
• Has only a finite or countably infinite set of values
• Examples: zip codes, counts, or the set of words in a collection of documents
• Often represented as integer variables.
• Note: binary attributes are a special case of discrete attributes
• Continuous Attribute
• Has real numbers as attribute values
• Examples: temperature, height, or weight.
• Practically, real values can only be measured and represented using a finite number of digits.
• Continuous attributes are typically represented as floating-point variables.
Types and properties of Attributes
Types
• Nominal
• ID numbers, eye color, zip codes
• Ordinal
• rankings (e.g., taste of potato chips on a scale from 1-10), grades, height {tall, medium, short}
• Interval
• calendar dates, temperatures in Celsius or Fahrenheit.
• Ratio
• temperature in Kelvin, length, counts, elapsed time (e.g., time to run a race)
Properties
• Distinctness: = 
• Order: < >
• Differences are + -
meaningful :
• Ratios are * /
meaningful :
Difference Between Ratio and Interval

• Is it physically meaningful to say that a temperature of 10 ° is twice that of 5° on


• the Celsius scale?
• the Fahrenheit scale?
• the Kelvin scale?
• Consider measuring the height above average
• If Bill’s height is three inches above average and Bob’s height is six inches above average, then would
we say that Bob is twice as tall as Bill?
• Is this situation analogous to that of temperature?
Categorization of Attributes
Attribute Description Examples Operations
Type
Nominal Nominal attribute zip codes, employee mode, entropy,
values only ID numbers, eye contingency
distinguish. (=, ) color, sex: {male, correlation, 2
Categorical
Qualitative

female} test

Ordinal Ordinal attribute hardness of minerals, median,


values also order {good, better, best}, percentiles, rank
objects. grades, street correlation, run
(<, >) numbers tests, sign tests
Interval For interval calendar dates, mean, standard
attributes, temperature in deviation,
differences between Celsius or Fahrenheit Pearson's
Quantitative
Numeric

values are correlation, t and


meaningful. (+, - ) F tests
Ratio For ratio variables, temperature in Kelvin, geometric mean,
both differences and monetary quantities, harmonic mean,
ratios are counts, age, mass, percent variation
meaningful. (*, /) length, current
Transformation of Attributes

Attribute Transformation Comments


Type
Nominal Any permutation of values If all employee ID numbers
were reassigned, would it
make any difference?
Categorical
Qualitative

Ordinal An order preserving change of An attribute encompassing


values, i.e., the notion of good, better best
new_value = f(old_value) can be represented equally
where f is a monotonic function well by the values {1, 2, 3} or
by { 0.5, 1, 10}.

Interval new_value = a * old_value + b Thus, the Fahrenheit and


where a and b are constants Celsius temperature scales
Quantitative
Numeric

differ in terms of where their


zero value is and the size of a
unit (degree).
Ratio new_value = a * old_value Length can be measured in
meters or feet.
Asymmetric Attributes

• Only presence (a non-zero attribute value) is regarded as important


• Words present in documents
• Items present in customer transactions
Key Messages for Attribute Types

• The types of operations you choose should be “meaningful” for the type of data you have
• Distinctness, order, meaningful intervals, and meaningful ratios are only four (among many possible) properties of data

• The data type you see – often numbers or strings – may not capture all the properties or may suggest properties that are not present

• Analysis may depend on these other properties of the data


•Many statistical analyses depend only on the distribution

• In the end, what is meaningful can be specific to domain


Important Characteristics of Data

• Dimensionality (number of attributes)


• High dimensional data brings a number of challenges
• Sparsity
• Only presence counts
• Resolution
• Patterns depend on the scale
• Size
• Type of analysis may depend on size of data
Types of data sets

• Record
• Data Matrix
• Document Data
• Transaction Data
• Graph
• World Wide Web
• Molecular Structures
• Ordered
• Spatial Data
• Temporal Data
• Sequential Data
• Genetic Sequence Data
Record Data

• Data that consists of a collection of records, each of which consists of a fixed set of attributes

Tid Refund Marital Taxable


Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Data Matrix

• If data objects have the same fixed set of numeric attributes, then the data objects can be
thought of as points in a multi-dimensional space, where each dimension represents a distinct
attribute
• Such a data set can be represented by an m by n matrix, where there are m rows, one for each
object, and n columns, one for each attribute

Projection Projection Distance Load Thickness


of x Load of y load

10.23 5.27 15.22 2.7 1.2


12.65 6.25 16.22 2.2 1.1
Document Data

• Each document becomes a ‘term’ vector


• Each term is a component (attribute) of the vector
• The value of each component is the number of times the corresponding term occurs in the document.

timeout

season
coach

game
score
play
team

win
ball

lost
Document 1 3 0 5 0 2 6 0 2 0 2

Document 2 0 7 0 2 1 0 0 3 0 0

Document 3 0 1 0 0 1 2 2 0 3 0
Transaction Data

• A special type of data, where


• Each transaction involves a set of items.
• For example, consider a grocery store. The set of products purchased by a customer during one
shopping trip constitute a transaction, while the individual products that were purchased are the items.
• Can represent transaction data as record data

TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Graph Data

2
5 1
2
5

Benzene Molecule: C6H6


Ordered Data
Sequence of transactions

Items/Events

An element of
the sequence
Ordered Data
Genomic sequence data

GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
Ordered Data

SpatioTemporal Data
Average Monthly Temperature of land and ocean
Data Preprocessing

Dr. Sugata Ghosal


CSIS Off-Campus Faculty
BITS Pilani
Topics
Data Preprocessing
• Data Quality
• Noise, Outlier, Missing attribute, Duplicate record
• Data Transformation
• Aggregation
• Sampling
• Discretization and Binarization
• Attribute Transformation
• Dimensionality Reduction
• Feature subset selection
• Feature creation
Data Quality

• Poor data quality negatively affects many data processing efforts


• For example, a classification model for detecting people who are loan risks is built using poor
data
• Some credit-worthy candidates are denied loans
• More loans are given to individuals that default
Data Quality …

• What kinds of data quality problems?


• How can we detect problems with the data?
• What can we do about these problems?

• Examples of data quality problems:


• Noise and outliers
• Wrong data
• Fake data
• Missing values
• Duplicate data
Noise

• For objects, noise is an extraneous object


• For attributes, noise refers to modification of original values
• Examples: distortion of a person’s voice when talking on a poor phone and “snow” on television screen
• The figures below show two sine waves of the same magnitude and different frequencies, the waves
combined, and the two sine waves with random noise
• The magnitude and shape of the original signal is distorted

Two sine waves Observed signal (sum of the two sine waves) Observed signal with noise
1 3 3

0.8
2 2
0.6

0.4
1 1
0.2
magnitude

magnitude

magnitude
0 0 0

-0.2
-1 -1
-0.4

-0.6
-2 -2
-0.8

-1 -3 -3
0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5
time (seconds) time (seconds) time (seconds)
Outliers

• Outliers are data objects with characteristics that are considerably different than most of the
other data objects in the data set
• Case 1: Outliers are noise that interferes
with data analysis

• Case 2: Outliers are the goal of our analysis


• Credit card fraud
• Intrusion detection
Missing Values

• Reasons for missing values


• Information is not collected
(e.g., people decline to give their age and weight)
• Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)
• Handling missing values
• Eliminate data objects or variables
• Estimate missing values
• Example: time series of temperature
• Example: census results
• Ignore the missing value during analysis
Duplicate Data

• Data set may include data objects that are duplicates, or almost duplicates of one another
• Major issue when merging data from heterogeneous sources

• Examples:
• Same person with multiple email addresses

• Data cleaning
• Process of dealing with duplicate data issues
Data Preprocessing

• Aggregation
• Sampling
• Discretization and Binarization
• Attribute Transformation
• Dimensionality Reduction
• Feature subset selection
• Feature creation
Aggregation
Combining two or more attributes (or objects) into a single attribute (or object)
• Purpose
• Data reduction
• Change of scale
• More “stable” data

Standard Deviation of Standard Deviation of


Average Monthly Average Yearly
Precipitation Precipitation
Sampling
Primary technique used for data reduction
• It is often used for both the preliminary investigation of the data and the final data analysis.
• Statisticians often sample because obtaining the entire set of data of interest is too expensive or
time consuming.
• Sampling is typically used in machine learning because processing the entire set of data of
interest may be too expensive or time consuming.
• Key Principle
• Using a sample will work almost as well as using the entire data set, if the sample is representative
• A sample is representative if it has approximately the same properties as the original dataset

8000 points 2000 Points 500 Points


Sampling …

• The key principle for effective sampling is the following:


• Using a sample will work almost as well as using the entire data set, if the sample is representative
• A sample is representative if it has approximately the same properties (of interest) as the original set of
data
• Simple Random Sampling
• There is an equal probability of selecting any particular item
• Sampling without replacement
• As each item is selected, it is removed from the population
• Sampling with replacement
• Objects are not removed from the population as they are selected for the sample.
• In sampling with replacement, the same object can be picked up more than once
• Stratified sampling
• Split the data into several partitions; then draw random samples from each partition
Discretization
Process of converting a continuous attribute into an ordinal attribute
• A potentially infinite number of values are mapped into a small number of categories
• Discretization is used in both unsupervised and supervised settings

Data consists of four groups of points and two outliers.


Unsupervised Discretization

Equal interval width approach Equal frequency approach K-means approach


Discretization in Supervised Settings

• Many classification algorithms work best if both the independent and dependent variables have
only a few values
• Example: Iris Plant data set.
• http://www.ics.uci.edu/~mlearn/MLRepository.html
• Three flower types (classes):
• Setosa
• Versicolour
• Virginica
• Four (non-class) attributes
• Sepal width and length
• Petal width and length
Supervised Discretization Example …
How can we tell what the best discretization is?
• Supervised discretization: Use class labels to find breaks
50

40

30

Counts
20

10

0
0 2 4 6 8
Petal Length

• Petal width low or petal length low implies Setosa.


• Petal width medium or petal length medium implies Versicolour.
• Petal width high or petal length high implies Virginica.
Binarization
Maps a continuous or categorical attribute into one or more binary variables
• Often convert a continuous attribute to a categorical attribute and then convert a categorical
attribute to a set of binary attributes
• Examples: eye color and height measured as {low, medium, high}
Attribute Transformation
Maps the entire set of values of a given attribute to a new set of replacement values
• Each original value can be identified with one of the new values
• Simple functions: xk, log(x), ex, |x|
• Normalization
• Refers to various techniques to adjust to differences among attributes in terms of frequency of occurrence,
mean, variance, range
• Take out unwanted, common signal, e.g., seasonality
• In statistics, standardization refers to subtracting off the means and dividing by the standard deviation
Curse of Dimensionality

• When dimensionality increases, data becomes increasingly sparse in the space that it occupies
• Definitions of density and distance between points, which are critical for clustering and outlier
detection, become less meaningful
Dimensionality Reduction
Purpose
• Avoid curse of dimensionality x2
• Reduce amount of time and memory required by ML algorithms
• Allow data to be more easily visualized
• May help to eliminate irrelevant features or reduce noise e
• Popular Techniques
• Principal Components Analysis (PCA)
• Singular Value Decomposition
• Find a projection that captures the largest amount of variation in data

x1
Dimensionality Reduction: PCA
Increasing # of components improve quality of reconstruction
Feature Subset Selection
Another way to reduce dimensionality of data
• Redundant features
• Duplicate much or all of the information contained in one or more other attributes
• Example: purchase price of a product and the amount of sales tax paid
• Irrelevant features
• Contain no information that is useful for the ML task at hand
• Example: students' ID is often irrelevant to the task of predicting students' GPA
• Many techniques developed, especially for classification
Feature Creation
Create new attributes that can capture the important information in a data set
• More efficiently than the original attributes
• Three general methodologies
• Feature extraction
• Example: extracting edges from images
• Feature construction
• Example: dividing mass by volume to get density
• Mapping data to new space
• Example: Fourier and wavelet analysis
Data Analysis

Dr. Sugata Ghosal


CSIS Off-Campus Faculty
BITS Pilani
Topics
Data Analysis
• Summary Statistics
• Categorical
• Numerical
• Distance Measures
• Euclidean
• Minkowski
• Mahalanobis
• Cosine
• Correlation
What is data exploration?
A preliminary exploration of the data to better understand its characteristics
• Key motivations of data exploration include
• Helping to select the right tool for preprocessing or analysis
• Making use of humans’ abilities to recognize patterns
• People can recognize patterns not captured by data analysis tools
• Related to the area of Exploratory Data Analysis (EDA)
• The focus was on visualization
• Clustering and anomaly detection were viewed as exploratory techniques
• In machine learning, clustering and anomaly detection are major areas of interest, and not thought of as
just exploratory
Summary Statistics of Data
Numbers that summarize properties of the data
• Summarized properties include frequency, location and spread
• Examples: location – mean, spread - standard deviation
• Most summary statistics can be calculated in a single pass through the data
Frequency and Mode
Typically used with categorical data
• Frequency of an attribute value is the percentage of time the value occurs in the data set
• The mode of an attribute is the most frequent attribute value
• The notions of frequency and mode are typically used with categorical data
Percentiles
Typically used for continuous data
• Percentile is more useful.
• Given an ordinal or continuous attribute x and a number p between 0 and 100, the pth percentile
is a value xp of x such that p% of the observed values of x are less than xp.
• For example, the 50th percentile is the value x 50% such that 50% of all values of x are less than x 50%. .
Measures of Location: Mean and Median

• The mean is the most common measure of the location of a set of points.
• However, the mean is very sensitive to outliers.
• Thus, the median or a trimmed mean is also commonly used.

• Assuming sorted {xi}


Measures of Spread: Range and Variance

• Range is the difference between the max and min


• The variance or standard deviation sx is the most common measure of the spread of a set of
points.

• Because of outliers, other measures are often used. Average Absolute Distance is given by
Similarity and Dissimilarity Measures

• Similarity measure
• Numerical measure of how alike two data objects are.
• Is higher when objects are more alike.
• Often falls in the range [0,1]
• Dissimilarity measure
• Numerical measure of how different two data objects are
• Lower when objects are more alike
• Minimum dissimilarity is often 0
• Upper limit varies
• Proximity refers to a similarity or dissimilarity
Similarity/Dissimilarity for Simple Attributes

The following table shows the similarity and dissimilarity between two objects, x and
y, with respect to a single, simple attribute.
Euclidean Distance

• Euclidean Distance

where n is the number of dimensions (attributes) and xk and yk are, respectively, the kth attributes
(components) or data objects x and y.

• Standardization is necessary, if scales differ.


Euclidean Distance

3
point x y
2 p1
p1 0 2
p3 p4
1
p2 2 0
p2 p3 3 1
0 p4 5 1
0 1 2 3 4 5 6

p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Distance Matrix
Minkowski Distance

• Minkowski Distance is a generalization of Euclidean Distance

where r is a parameter, n is the number of dimensions (attributes) and xk and yk are,


respectively, the kth attributes (components) or data objects x and y.
Minkowski Distance: Examples

• r = 1. City block (Manhattan, taxicab, L1 norm) distance.


• A common example of this for binary vectors is the Hamming distance, which is just the
number of bits that are different between two binary vectors

• r = 2. Euclidean distance

• r  . “supremum” (Lmax norm, L norm) distance.


• This is the maximum difference between any component of the vectors

• Do not confuse r with n, i.e., all these distances are defined for all numbers of dimensions.
Minkowski Distance

L1 p1 p2 p3 p4
p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2
p4 6 4 2 0
point x y
p1 0 2 L2 p1 p2 p3 p4
p2 2 0 p1 0 2.828 3.162 5.099
p3 3 1 p2 2.828 0 1.414 3.162
p4 5 1 p3 3.162 1.414 0 2
p4 5.099 3.162 2 0

L p1 p2 p3 p4
p1 0 2 3 5
p2 2 0 1 3
p3 3 1 0 2
p4 5 3 2 0

Distance Matrix
Correlation measures the linear relationship between objects
Visually Evaluating Correlation

Scatter plots
showing the
similarity from –1 to
1.
Drawback of Correlation

• x = (-3, -2, -1, 0, 1, 2, 3) 9


8
• y = (9, 4, 1, 0, 1, 4, 9)
7
6
5
• yi = xi2

Y
4
3
• mean(x) = 0, mean(y) = 4 2
1
• std(x) = 2.16, std(y) = 3.74
0
-4 -2 0 2 4
X

• corr = (-3)(5)+(-2)(0)+(-1)(-3)+(0)(-4)+(1)(-3)+(2)(0)+3(5) / ( 6 * 2.16 * 3.74 )


=0
Comparison of Proximity Measures

• Domain of application
• Similarity measures tend to be specific to the type of attribute and data
• Record data, images, graphs, sequences, 3D-protein structure, etc. tend to have different measures
• However, one can talk about various properties that you would like a proximity measure to have
• Symmetry is a common one
• Tolerance to noise and outliers is another
• Ability to find more types of patterns?
• Many others possible
• The measure must be applicable to the data and produce results that agree with domain
knowledge
Visualization
Conversion of data into a visual or tabular format
• Visualization of data is one of the most powerful and
appealing techniques for data exploration.
• Humans have a well developed ability to analyze large
amounts of information that is presented visually
• Can detect general patterns and trends
• Can detect outliers and unusual patterns
• Data objects, their attributes, and the relationships
among data objects are translated into graphical
elements such as points, lines, shapes, and colors.
• Example:
• Objects are often represented as points
• Their attribute values can be represented as the
position of the points or the characteristics of the
points, e.g., color, size, and shape
Sea Surface Temperature (SST) for July 1982
Visualization Techniques: Histograms

• Histogram
• Usually shows the distribution of values of a single variable
• Divide the values into bins and show a bar plot of the number of objects in each
bin.
• The height of each bar indicates the number of objects
• Shape of histogram depends on the number of bins
• Example: Petal Width (10 and 20 bins, respectively)
Two-Dimensional Histograms

• Show the joint distribution of the values of two attributes


• Example: petal width and petal length
Visualization Techniques: Box Plots
Display distribution of data
outlier

90th percentile

75th percentile

50th percentile
25th percentile

10th percentile
Visualization Techniques: Scatter Plots
Attributes values determine the position
• Useful to have arrays of scatter plots can compactly summarize the relationships of several pairs
of attributes
Visualization Techniques: Contour Plots
Continuous attribute is measured on a spatial grid
• Useful when a They partition the
plane into regions of similar values
• The contour lines that form the
boundaries of these regions connect
points with equal values
• Common example: contour maps of
elevation, temperature, rainfall, air
pressure, etc.

Celsius
Surface Sea Temperature
Visualization Techniques: Matrix Plots
Plot data matrix
• Often useful when objects are sorted according to class
• Plots of similarity or distance matrices can also be useful for visualizing the relationships

standard
deviation
Visualization of the Iris Data Matrix

standard
deviation
Visualization of the Iris Correlation Matrix
Other Visualization Techniques

• Star Plots
• Similar approach to parallel coordinates, but axes radiate from a central point
• The line connecting the values of an object is a polygon
• Chernoff Faces
• Approach created by Herman Chernoff
• This approach associates each attribute with a characteristic of a face
• The values of each attribute determine the appearance of the corresponding facial characteristic
• Each object becomes a separate face
• Relies on human’s ability to distinguish faces
Star Plots for Iris Data

Setosa

Versicolour

Virginica
Chernoff Faces for Iris Data

Setosa

Versicolour

Virginica
Software Tools

Dr. Sugata Ghosal


CSIS Off-Campus Faculty
BITS Pilani
Topics
Software Tools for ML
• Notebooks and Colab
• Python and important libraries
• A few important functions
• Missing data
• Attribute Visualization
• Data Cleaning
• Attribute Combination
• Handling categorical attributes
• Test dataset Creation
Prerequisites
• Jupyter Notebook
• Open-source web application to create and share live code, equations, visualizations and narrative text.
• Used for data cleaning and transformation, numerical simulation, statistical modeling, data visualization,
machine learning, and much more.
• Easy to tinker with code and execute it in steps
• Run in local browser: https://jupyter.org/try
• Local installation: https://jupyter.org/install.html
• Repository of Jupyter Notebooks: https://github.com/ageron/handson-ml
• Colab
• Google’s flavor of Jupyter notebooks tailored for machine learning and data analysis
• Runs entirely in the cloud.
• free access to hardware accelerators like GPUs and TPUs (with some restrictions).
• http://colab.research.google.com
• Popular Datasets
• UC Irvine Machine Learning Repository: http://archive.ics.uci.edu/ml/
• Kaggle datasets: https://www.kaggle.com/datasets
• Amazon’s AWS datasets: http://aws.amazon.com/fr/datasets/
Local Installation
Preferred Mode
• Install a virtual environment via Anaconda
• Free Anaconda Python distribution: https://www.anaconda.com/products/individual
• Download Python 3.* version
Prerequisites
Python and several scientific libraries
• Python
• high-level, dynamically typed multiparadigm programming language.
• almost like pseudocode, very readable
• Platform: Linux/Windows/MacOS
• Documentation: https://www.python.org/doc/
• Interactive Python tutorial: http://learnpython.org/
• Stand alone installation: (preferably 3.7 or higher) https://www.python.org/downloads/
• scikit-learn
• Open source library that supports supervised and unsupervised learning.
• Various model fitting, data preprocessing, model selection and evaluation tools, and other utilities.
• Platform: Linux/Windows/MacOS
• User guide: https://scikit-learn.org/stable/user_guide.html
• Stand alone installation: https://scikit-learn.org/stable/install.html
Prerequisites
Important libraries
• Numpy
• Scientific computing library with Python
• Provides high-performance multidimensional array (e.g., vector, matrix) and basic tools to compute with
and manipulate these arrays, linear algebra, random number generators
• https://numpy.org/doc/stable/reference/
• https://numpy.org/install/
• SciPy
• Builds on numpy, provides functions that operate on numpy arrays
• Useful for different types of scientific and engineering applications, integration, image processing, …
• https://docs.scipy.org/doc/scipy/reference/tutorial/index.html
• Standalone installation: https://scipy.org/install.html
• Matplotlib
• Plotting library
• https://matplotlib.org/
• https://matplotlib.org/tutorials/index.html
• https://matplotlib.org/users/installing.html
Backup
Important ML Steps and Functions
https://github.com/ageron/handson-ml/blob/master/02_end_to_end_machine_learning_project.ipynb

• 20,640 data instances in the dataset – fairly small


• 207 districts are missing total_bedrooms attribute
• except ocean_proximity, rest of the attributes numerical
• type of ocean_proximity is object, so it must be a text attribute - a repeated categorical attribute!
Important ML Steps and Functions

• 20,640 data instances in the dataset – fairly small


• 207 districts are missing total_bedrooms attribute
• except ocean_proximity, rest of the attributes numerical
• type of ocean_proximity is object, so it must be a text attribute - a repeated categorical attribute!
Attribute Visualization

age latitude
# of households

longitude price income


Observations from Histogram

• Median income attribute not expressed in US dollars (USD).


• scaled and capped at 15 for higher median incomes, and at 0.5 for lower median incomes.
• The housing median age and the median house value also capped.
• Machine Learning algorithms may learn that prices never go beyond that limit.
• If precise predictions even beyond $500,000 is needed
• Collect proper labels for the districts whose labels were capped.
• Remove those districts from the training set and also from the test set
• Attributes have very different scales.
• Finally, many histograms are tail heavy.
• Transformation may be needed on to have more bell-shaped distributions
Visualizing Training Data for Insights
Correlations in Training Data

• House prices are strongly correlated with income


• small negative correlation between the latitude and
the median house value
• i.e., prices have a slight tendency to go down when you
go north)
• price cap clearly visible as a horizontal line at $500K.
• Additional horizontal lines around $450k, $350K,
perhaps one around $280K, ...
Experimenting with Attribute Combinations

• Total number of rooms in a district not very useful


• What may be important is the number of rooms per household.
• Similarly, the total number of bedrooms compared to the number of rooms may be more meaningful

• rooms_per_household is more correlated with price than total number of bedrooms or rooms.
Data Cleaning
• Some of the instances don’t have total_bedroom values

• Imputer utility in scikit-learn

• Since median applies only to numerical data, create an copy of the data without ocean-proximity

• The imputer stores the median of each attribute the result in its statistics_ instance variable.

• Use this “trained” imputer to replace any missing values by the learned medians
Handling Textual and Categorical Data
Convert the text attributes to numbers
• Scikit-learn provides a transformer

• ML algorithms assume closer values are more similar. But ‘<1H OCEAN’ and ‘NEAR OCEAN’ are far away in
values!
• Solution! Use 1-hot encoding
Test Set Creation

Ensures reproducibility of
• Test set size is 20% of training set results
• Uses uniform sampling of data.
• Not appropriate for heavy tailed distribution
• income very important to predict house prices.
• Important that test set is representative of the various
categories of incomes in the dataset.
• Most income values are around $20–$50K, but some >> $60K.

Histogram of income categories


Model Selection and Training

Dr. Sugata Ghosal


CSIS Off-Campus Faculty
BITS Pilani
In This Session

• Model Selection and Training


• For regression problem
• For classification problem

• Multi-class classification
In this segment
Model Selection and Training
• Select based on training data
• If prediction label/output is available, use regression or classification model
• Regression if real valued output
• Classification if output is discrete (binary/integer)
• else, unsupervised model is used.
• For the house price prediction problem, use regression model since median house prices are
available along with training data (predictors)
• Example: Linear Regression

• Better model is necessary for improving the prediction accuracy


Regression Model Selection and Training
• Decision Tree Based Regression produces low error on training data

• Cross-validation error is not satisfactory


• Better accuracy can be obtained using Random Forest based regressor
Classification Model
MNIST Dataset Classification
• A set of 70,000 small images of handwritten digits
• Each image 28x28 pixel with intensity 0 (black) – 255 (white)
• Input data represented as 70000 x 784 matrix
• Each image is labeled with the digit it represents.
Classification Model Training
Detect a ‘5’
• Segment the dataset into 60,000 training images and 10,000 test images

• Shuffle the training dataset

• Train a classification model for detecting ‘5’. Target output for training data instance
corresponding an image of ‘5’ is +1, else target output is ‘0’

• Perform cross validation like the regression problem and try out multiple classification model for
achieving acceptable performance.
Multiclass Classification

• Multiclass classifiers (aka multinomial classifiers) can distinguish between more than two classes.
• Some algorithms (such as Random Forest classifiers or naive Bayes classifiers) are capable of
handling multiple classes directly.
• Many (such as Support Vector Machine classifiers or Linear classifiers) are strictly binary
• One-versus-all (OvA) or One-versus-rest strategy using multiple binary classifiers.
• e.g., for MNIST classification, train 10 binary classifiers, one for each digit (a 0-detector, a 1-detector, a
2-detector, and so on).
• get the decision score from each classifier for that image and select the class whose classifier outputs
the highest score.
• One-versus-one (OvO) strategy
• train a binary classifier for every pair of digits: one to distinguish 0s and 1s, another to distinguish 0s and
2s, another for 1s and 2s, and so on.
• If there are N classes, you need to train N × (N – 1) / 2 classifiers.
• Run an image through all 45 classifiers and see which class wins the most duels.
• Main advantage of OvO is each classifier only needs to be trained on the part of the training set
for the two classes that it must distinguish
Model Evaluation

Dr. Sugata Ghosal


CSIS Off-Campus Faculty
BITS Pilani
Model Evaluation

• Metrics for Performance Evaluation


• How to evaluate the performance of a model?

• Methods for Performance Evaluation


Metrics for Performance Evaluation
Focus on the predictive capability of a model
• Confusion Matrix

PREDICTED CLASS
a: TP (true positive)
Class=Yes Class=No b: FN (false negative)
c: FP (false positive)
ACTUAL Class=Yes a b d: TN (true negative)
CLASS
Class=No c d
Metrics for Performance Evaluation…

PREDICTED CLASS

Class=Yes Class=No

ACTUAL Class=Yes a b
(TP) (FN)
CLASS
Class=No c d
(FP) (TN)

• Most widely-used metric:

ad TP  TN
Accuracy  
a  b  c  d TP  TN  FP  FN
Limitation of Accuracy

• Consider a 2-class problem


• Number of Class 0 examples = 9990
• Number of Class 1 examples = 10

• If model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 %


• Accuracy is misleading because model does not detect any class 1 example
Measures for Imbalanced Classes

a
Precision (p) 
ac
a
Recall (r) 
ab
2rp 2a
F - measure (F)  
r  p 2a  b  c

wa  w d
Weighted Accuracy  1 4

wa wb wc w d


1 2 3 4
Methods for Performance Evaluation
How to obtain a reliable estimate of performance?
• Performance of a model may depend on other factors besides the learning algorithm:
• Class distribution
• Cost of misclassification
• Size of training and test sets
Learning Curve

 Learning curve shows how


accuracy changes with
varying sample size
 Requires a sampling
schedule for creating
learning curve:
 Arithmetic sampling
(Langley, et al)
 Geometric sampling
(Provost et al)

Effect of small sample size:


- Bias in the estimate
- Variance of estimate
Methods of Estimation

• Holdout
• Reserve 2/3 for training and 1/3 for testing
• Random subsampling
• Repeated holdout
• Cross validation
• Partition data into k disjoint subsets
• k-fold: train on k-1 partitions, test on the remaining one
• Leave-one-out: k=n
• Stratified sampling
• Bootstrap
• Sampling with replacement
ROC (Receiver Operating Characteristic)

• Developed in 1950s for signal detection theory to analyze noisy signals


• Characterize the trade-off between positive hits and false alarms
• ROC curve plots TP (on the y-axis) against FP (on the x-axis)
• Performance of each classifier represented as a point on the ROC curve
• changing the threshold of algorithm, sample distribution or cost matrix changes the location of the point
ROC Curve

• 1-dimensional data set containing 2 classes (positive and negative)


• any points located at x > t is classified as positive

At threshold t:
TP=0.5, FN=0.5, FP=0.12, TN=0.88
ROC Curve

(TP,FP):
• (0,0): declare everything
to be negative class
• (1,1): declare everything
to be positive class
• (1,0): ideal

• Diagonal line:
• Random guessing
• Below diagonal line:
• prediction is opposite of the true class
Using ROC for Model Comparison

 No model consistently outperform


the other
 M1 is better for small FPR
 M2 is better for large FPR

 Area Under the ROC curve


 Ideal:
 Area = 1
 Random guess:
 Area = 0.5
How to Construct an ROC curve

Instance P(+|A) True Class


• Use classifier that produces posterior
1 0.95 +
probability for each test instance P(+|A)
2 0.93 +
• Sort the instances according to P(+|A)
3 0.87 - in decreasing order
4 0.85 - • Apply threshold at each unique value of
5 0.85 - P(+|A)
6 0.85 + • Count the number of TP, FP,
7 0.76 - TN, FN at each threshold
8 0.53 + • TP rate, TPR = TP/(TP+FN)

9 0.43 - • FP rate, FPR = FP/(FP + TN)


10 0.25 +
How to construct an ROC curve

ROC Curve:

Class + - + - - - + - + +
P 0.25 0.43 0.53 0.76 0.85 0.85 0.85 0.87 0.93 0.95 1.00
Threshold
TP 5 4 4 3 3 3 3 2 2 1 0
>=
FP 5 5 4 4 3 2 1 1 0 0 0

TN 0 0 1 1 2 3 4 4 5 5 5

FN 0 1 1 2 2 2 2 3 3 4 5

TPR 1 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.2 0

FPR 1 1 0.8 0.8 0.6 0.4 0.2 0.2 0 0 0


Hyperparameter optimization

Dr. Sugata Ghosal


sugata.ghosal@pilani.bits-pilani.ac.in
Hyperparameters
Machine Learning systems use many parameters internally

• Gradient Descent
• e.g., Learning rate, how long to run
• Mini-batch
• Batch size
• Regularization constant
• Many Others
• will be discussed in upcoming sessions
Hyperparameter Optimization

• Also called metaparameter optimization


• Also called tuning

• How to find best values of hyperparameters?


Tuning By Hand

• Just fiddle with the parameters until you get the results you want

• Probably the most common type of hyperparameter optimization

• Upsides: the results are generally pretty good…

• Downsides: lots of effort, and no theoretical guarantees


Grid Search

• Define some grid of parameters you want to try


• Try all the parameter values in the grid
• By running the whole system for each setting of parameters
• Then choose the setting with the best result
• Essentially a brute force method

• Downsides:
• As the number of parameters increases, the cost of grid search increases
exponentially!
• Need some way to choose the grid properly
• Something this can be as hard as the original hyperparameter
optimization
Making Grid Search Fast

• Early stopping to the rescue


• Can run all the grid points for one epoch, then discard the half that performed worse, then run for another
epoch, discard half, and continue.

• Can take advantage of parallelism


• Run all the different parameter settings independently on different servers in a cluster.
• An embarrassingly parallel task.
• Downside: doesn’t reduce the energy cost.
One Variant: Random Search

• This is just grid search, but with randomly chosen points instead of points on a grid.
• RandomSearchCV

• This solves the curse of dimensionality


• Don’t need to increase the number of grid points exponentially as the number of dimensions
increases.

• Problem: with random search, not necessarily going to get anywhere near the optimal parameters in
a finite sample.
Cross-Validation

• Partition part of the available data to create an validation dataset that we don’t use for training.

• Then use that set to evaluate the hyperparameters.

• Typically, multiple rounds of cross-validation are performed using different partitions


• Can get a very good sense of how good the hyperparameters are
• But at a significant computational cost!
Machine Learning Pipeline

Dr. Sugata Ghosal


sugata.ghosal@pilani.bits-pilani.ac.in
What is MLOps?
Apply DevOps principles to ML systems
• An engineering culture and practice that aims at unifying ML system development (Dev) and ML
system operation (Ops).

• Automation and monitoring at all steps of ML system construction, including integration, testing,
releasing, deployment and infrastructure management.

• Data scientists can implement and train an ML model with predictive performance on an offline
validation (holdout) dataset, given relevant training data for their use case.

• However, the real challenge is building an integrated ML system and to continuously operate it in
production.
Ecosystem of ML System Components
A small fraction of a real-world ML system is composed of the ML code
DevOps Vs. MLOps

• DevOps for developing and operating large-scale software systems provides benefits such as
• shortening the development cycles
• increasing deployment velocity, and
• dependable releases.
• Two key concepts
• Continuous Integration (CI)
• Continuous Delivery (CD)
• An ML system is a software system, so similar practices apply to reliably build and operate at
scale.
• However, ML systems differ from other software systems
• Team skills: focus on exploratory data analysis, model development, and experimentation.
• Development: ML is experimental in nature.
• The challenge is tracking what worked and what did not, maintaining reproducibility, and maximizing code
reusability.
• Testing: Additional testing needed for data validation, trained model quality evaluation, and model
validation.
DevOps Vs. MLOps

• Deployment: a multi-step pipeline to automatically retrain and deploy model.


• adds complexity
• Automation needed before deployment by data scientists to train and validate new models.
• Production: ML models can have reduced performance due to constantly evolving data
profiles.
• Need to track summary statistics of data and
• monitor the online performance of model to send notifications or roll back for suboptimal values
• ML and other software systems are similar in CI of source control, unit / integration testing, and
CD of the software module / package.
• However, in ML,
• CI is also about testing and validating data, data schemas, and models.
• CD is a system (an ML training pipeline) that automatically deploys another service (model prediction
service).
• Continuous training (CT) is a new property, unique to ML systems, that is concerned with
automatically retraining the model in production and serving the models.
Manual ML Steps
• Manual, script-driven, and interactive process.
• Disconnection between ML and operations, possibly leading to training-serving skew
• Infrequent release iterations. No CI, CD, active performance monitoring
• Deploy trained Model as a prediction service
• Deployment process is concerned only with deploying the trained model as a prediction service,
e.g., a microservice with a REST API
Data and Model Validation

• Data validation: Required prior to model training to decide whether to retrain the model or stop
the execution of the pipeline based on following
• Data values skews: significant changes in the statistical properties of data, triggering retraining
• Data schema skews: downstream pipeline steps, including data processing and model training, receives
data that doesn't comply with the expected schema.
• stop the pipeline to release a fix or an update to the pipeline to handle these changes in the schema.
• Schema skews include receiving unexpected features or with unexpected values, not receiving all the expected
features

• Model validation: Required after retraining the model with the new data. Evaluate and validate
the model before promoting to production. This offline model validation step consists of
• Producing evaluation metric using the trained model on test data to assess the model quality.
• Comparing the evaluation metrics of production model, baseline model, or other business-requirement
models.
• Ensuring the consistency of model performance on various data segments
• Test model for deployment, including infrastructure compatibility and API consistency
• Undergo online model validation—in a canary deployment or an A/B testing setup
Frameworks
Cloud Vendors are providing MLOps framework
• https://
cloud.google.com/solutions/machine-learning/mlops-continuous-delivery-and-automation-pipelin
es-in-machine-learning
• Kubeflow and Cloud Build

• Amazon AWS MLOps

• Microsoft Azure MLOps


Backup
MLOps Level 1

• Perform continuous training


(CT) by automating the ML
pipeline

• Achieves continuous
delivery of model prediction
service.

• Automated data and model
validation steps to the
pipeline

• Needs pipeline triggers and


metadata management.
Level 2: CI/CD and automated pipeline automation
Stages of CI/CD Automation Pipeline

1) Development and experimentation: iteratively try new ML algorithms and modeling. The
output is the source code of the ML pipeline steps that are then pushed to a source
repository.
2) Pipeline continuous integration: build source code and run various tests. The outputs of
this stage are pipeline components (packages, executables, and artifacts).
3) Pipeline continuous delivery: deploy artifacts produced by the CI stage to the target
environment.
4) Automated training: automatically executed in production based on a schedule or trigger.
The output is a trained model pushed to the model registry.
5) Model continuous delivery: serve the trained model as a prediction service for the
predictions.
6) Monitoring: collect statistics on the model performance based on live data. The output is a
trigger to execute the pipeline or to execute a new experiment cycle.
Stages of the CI/CD automated ML pipeline
Continuous Integration

• Pipeline and its components are built, tested, and packaged when
• new code is committed or
• pushed to the source code repository.

• Besides building packages, container images, and executables, CI process can include
• Unit testing feature engineering logic.
• Unit testing the different methods implemented in your model.
• For example, you have a function that accepts a categorical data column and you encode the function as a one-
hot feature.
• Testing for training convergence
• Testing for NaN values due to dividing by zero or manipulating small or large values.
• Testing that each component in the pipeline produces the expected artifacts.
• Testing integration between pipeline components.
Continuous Delivery

• Continuously delivers new pipeline implementations to the target environment


• prediction services of the newly trained model.
• For rapid and reliable continuous delivery of pipelines and models, consider
• Verifying the compatibility of the model with the target infrastructure
• e.g., required packages are installed in the serving environment
• Availability of memory, compute, and accelerator resources.
• Testing the prediction service by calling the service API for the updated model
• Testing prediction service performance, such as throughput, latency.
• Validating the data either for retraining or batch prediction.
• Verifying that models meet the predictive performance targets prior to deployment.
• Automated deployment to a test environment, triggered by new code to the development branch.
• Semi-automated deployment to a pre-production environment, triggered by code merging
• Manual deployment to a production from pre-production.
Thank You!
Applied Machine Learning

Linear Regression

BITS Pilani Anita Ramachandran


Pilani | Dubai | Goa | Hyderabad
2023

1
Example – Polynomial Curve
Fitting

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Polynomial Curve Fitting

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Polynomial Curve Fitting

• The values of the coefficients will be determined by fitting the


polynomial to the training data
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Error Function

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Different Hypotheses

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Bias-Variance
• As the order of the polynomial increases, small changes in the dataset
cause a greater change in the fitted polynomials; thus variance increases.
• But a complex model on the average allows a better fit to the underlying
function; thus bias decreases
• To decrease bias, the model should be flexible, at the risk of having high
variance. If the variance is kept low, we may not be able to make a good
fit to data and have high bias.
• The optimal model is the one that has the best trade-off between the bias
and the variance.
• If there is bias, this indicates that our model class does not contain the
underfitting solution; this is underfitting.
• If there is variance, the model class is overfitting too general and also
learns the noise; this is overfitting.

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Modelling: Bias-Variance
Trade-off
• Errors due to bias: due to underfitting
• Errors due to variance: due to overfitting
• Increasing bias decreases variance
• Increasing variance decreases bias
• With a small training set when the training
instances differ a little bit, we expect the
simpler model to change less than a
complex model: A simple model is thus
said to have less variance
• On the other hand, a too simple model
assumes more, is more rigid, and may fail
if indeed the underlying class is not that
simple: A simpler model has more bias
• Finding the optimal model corresponds to
minimizing both the bias and the variance
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Modelling:
Underfitting/overfitting

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Regularization
• Used to control the over-fitting phenomenon
• Adds a penalty term to the error function in order to discourage the coefficients from
reaching large values

• Modified error function:


where ||w||2 = w02 + w12 + w22 + … + wM2 and λ governs the relative importance of the regularization term compared with
the sum-of-squares error term

• In other words, augmented error function E’ = error on data + λ . model complexity

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Regularization – Example 1

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Regularization – Example 2

Line M=3

M=5
Parabola
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Regularization
M=5, with different levels of
regularization

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Linear Regression

15
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
What is Regression
• Prediction of numerical variables. E.g., predicting value of real estate,
demand forecast in retail, weather forecast etc
• Dependent variable (Y) – the one whose value is to be predicted
• Y is presumed to be functionally related to one (say X) or more
independent variables or predictors
• Regression – finding a relationship/association between Y and X such
that Y = f(X)
• Approaches
• Ordinary Least Squares function fitting given {<x1,y1>…<xn,yn>}
• Gradient Descent
• Bayesian: Choose some parameterized form for P(Y|X; θ) (θ is the vector of
parameters). Derive learning algorithm as MCLE or MAP estimate for θ

16
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Linear Regression

Algorithm -
+ Inc slope
•Choose a random line Dec slope
Inc intercept
Inc intercept

•Epoch = 1000 + +
•Learning rate = 0.01
•Loop (repeat) Inc slope
-
Dec intercept - + Dec slope
• Pick a random point - Dec intercept

• Move line towards point


• Add learning rate x vertical distance x horizontal
distance to slope
• Add learning rate x vertical distance to y-intercept

17
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Linear Regression

Measuring how good or bad a line is -


•Absolute error Dec slope + Inc slope
Inc intercept
Inc intercept
•Square error + +
•How to minimize errors
Inc slope
-
Dec intercept - + Dec slope

- Dec intercept

18
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Linear Regression - OLS
Sum of Square Error

y = mx + b

Ordinary Least Squares method to find


the best line intercept (b) slope (m)

y = mx + b → 4.80x + 9.15
y = 4.80x + 9.15
19
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Linear Regression

Linear Regression with Single


Feature
•Features: X to predict y
•n: no: of input features

Size Price
2104 460
1416 232
1534 315
852 178

20
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
• For a fixed value of ɵ, h(x) is a function of x
• J(ɵ) is a function of the parameters ɵ
• For different values of ɵ, we get different J(ɵ)
• For different hypotheses, we get different costs

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Intuition Behind Cost Function

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Same value
for J(ɵ0, ɵ1)

Each ellipse shows a set of points that


take on the same value for J(ɵ0, ɵ1)

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Gradient Descent

30
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Gradient Descent

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Intuitions:
•What happens if learning rate is
small vs if it is big?
•What happens if ɵ is already at a
local minimum, what will another step
of gradient descent do?

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Gradient Descent

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Gradient Descent for Linear Regression

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Linear Regression

Linear Regression with Multiple Features


•Features: X1, X2, X3, …Xn to predict y
•n: no: of input features
•x(i): input features of the ith training example
•x(i)j: value of feature j in the ith training example

Size Rooms Floors Age Price


2104 5 1 45 460
1416 3 2 40 232
1534 3 2 30 315
852 2 1 36 178

•Hypothesis:

48
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Regularization

60
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Regularization

Regularization is a technique
used for tuning the function by
adding an additional penalty
term in the error function

For M=9:

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Question - How does large value of λ result in simpler hypotheses?

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Self study [for the reader ]

69
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Summary

• What is regression • Generalized linear models - Polynomial,


• Types of regression Gaussian, Sigmoidal basis functions
• Approaches for model fitting • Regularized Linear Regression
• Ordinary Least Squares approach • Gradient descent + regularization
• Bayesian linear regression • Self study - vectorization (closed form
• Linear Regression solution)
• Adjusting the slope and intercept of a
straight line hypothesis
• Cost function
• Parameter vs cost plot - bowl shaped
• Contour plots
• Minimizing cost
• SSE
• Gradient descent
• Multi variate linear regression
• Gradient descent for multivariate
linear regression

78
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Additional References
• Andrew Ng’s videos on Linear Regression (Coursera)
• Luis Serrano video on Linear Regression:
https://www.youtube.com/watch?v=wYPUhge9w5c
• https://medium.com/quick-code/maximum-likelihood-estimation-for-
regression-65f9c99f815d
• https://towardsdatascience.com/linear-regression-simplified-
ordinary-least-square-vs-gradient-descent-48145de2cf76
• Lab capsules on Linear Regression
• https://towardsdatascience.com/introduction-to-bayesian-linear-
regression-e66e60791ea7
• http://bjlkeng.github.io/posts/a-probabilistic-view-of-regression/
• https://towardsdatascience.com/my-journey-into-machine-learning-
class-3-c8139736f550
• https://towardsdatascience.com/understanding-the-mathematics-
behind-gradient-descent-dde5dc9be06e

79
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Classification Models I
Introduction

Dr. Sugata Ghosal


CSIS Off-Campus Faculty
BITS Pilani
Topics

• What is Classification?
• Linear Classifier
• Generative and Discriminatory Classifier
Classification
Definition
• Given a collection of records (training set )
• Each record is by characterized by a tuple (x,y), where x is the attribute (feature) set and y is the
class label
• x aka attribute, predictor, independent variable, input
• Y aka class, response, dependent variable, output

• Task
• Learn a model or function that maps each attribute set x into one of the predefined class labels y

Task Attribute set, x Class label, y


Categorizing Features extracted from email spam or non-spam
email messages message header and content

Identifying tumor Features extracted from x-rays or malignant or benign cells


cells MRI scans
Cataloging Features extracted from Elliptical, spiral, or
galaxies telescope images irregular-shaped galaxies
General Approach for Building Classification
Model
Tid Attrib1 Attrib2 Attrib3 Class Learning
No
1 Yes Large 125K
algorithm
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
Induction
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn
8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes
Model
10

Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ? Deduction
14 No Small 95K ?
15 No Large 67K ?
10

Test Set
Types of Classifiers
Linear Classifier
• Classes are separated by a linear decision surface (e.g., straight line
in 2-dimensional feature/attribute space)
• If for a given record, linear combination of features x i is >= 0, i.e.,
y=1

it belongs to one class (say, y = 1), else it belongs to the other class (say, x2
y=0 or -1)
• wi s are learned during the training (induction) phase of the classifier. Decision
• Learnt wi s are applied to a test record during the deduction / inferencing y=0 Boundary
phase.
x1

• In nonlinear classification, classes are separated by a non-linear


surface
Generative Vs. Discriminative Models
Generative Model Generative

• Class-conditional probability distribution of attribute/feature


set and prior probability of classes are learnt during the
training phase
x2
• Given these learnt probabilities, during inferencing phase,
probability of a test record belonging to different classes are
calculated and compared.
• Can result in linear or nonlinear decision surface x1

Discriminative

Discriminative Model y=1


• Given a training set, a function f is learnt that directly
maps an attribute/feature vector x to the output class (y=1 x2
or 0/-1) Decision
• A linear function f results in linear decision surface y=0 Boundary

x1
Classification Model I
Naïve Bayes Classifier

Dr. Sugata Ghosal


CSIS Off-Campus Faculty
BITS Pilani
Topics

• What is Bayes Classifier?


• Bayes Theorem and its application in classification
• Conditional Independence and Naïve Bayes Classifier
• Applications of Naïve Bayes Classification
• Text classification
• Image classification

8
Bayes Classifier
A generative framework for solving classification problems
• Conditional Probability:
P( X , Y )
P (Y | X ) 
P( X )
P( X , Y )
P( X | Y ) 
P(Y )
• Bayes theorem:

P( X | Y ) P(Y )
P(Y | X ) 
P( X )
Bayes Theorem - Example
Bayesian Learning - Example
ID Headache Fever Vomiting Meningitis

• Probability function (e.g., P(f=true)) 1 True True False False


2 False True False False
• Joint probability (multiple features) (e.g., P(m = true, h = true)) 3 True False True False
• Conditional probability (e.g., P(m=true | h = true)) 4 True False True False
• Joint probability distribution - multi dimensional matrix in which 5 false True False True
each cell lists the probability of a combination of feature values 6 True False True False
being assigned 7 True False True False
8 True False True True
9 False True False False
Generalized Bayes Theorem 10 True False True True

P(t=l | q[1], q[2]..q[m]] = P(q[1], q[2]..q[m] | t= l) P(t=l) / P(q[1], q[2]..q[m])

P(t=l): Prior probability of the target feature taking on the value l


P(q[1], q[2]..q[m]: The joint probability of the descriptive features of a query instance taking a specific set
of values
P(q[1], q[2]..q[m] | t= l): The conditional probability of the descriptive features of a query instance taking a
specific set of values, given that the target feature takes the level l
Bayesian Learning - Example

ID Headache Fever Vomiting Meningitis


• Example query instance: h=true, f=false, v=true 1 True True False False
2 False True False False
• P(m | h, -f, v) = P(h, -f, v | m) x P(m) / P(h, -f, v)
3 True False True False
• P(m) = 0.3
4 True False True False
• P(h, -f, v) = 0.6 5 False True False True
• P(h, -f, v | m) = P(h | m) x P(-f | h, m) x P(v | -f, h, m) [Chain rule] 6 True False True False

• P(h, -f, v | m) = 2/3 x 2/2 x 2/2 = 0.6666 7 True False True False
8 True False True True
• Hence P(m | h, -f, v) = 0.6666 x 0.3 / 0.6 = 0.33
9 False True False False
10 True False True True

Similarly, P(-m | h, -f, v) = 0.6666


 It is twice as probable that the patient does not have meningitis as it is that the patient does. This is counter
intuitive -> Paradox of the false positive
 Reason: Likelihood is weighted by the prior. In this case, likelihood is high, but prior is low. So even when the
evidence is taken into account, the posterior probability of having meningitis remains low
Bayesian Learning - Example

ID Headache Fever Vomiting Meningitis

• Another example query instance: h=true, f=true, v=false 1 True True False False
2 False True False False
3 True False True False
• P(m | h, f, -v) = P(h, f, -v | m) x P(m) / P(h, f, -v) 4 True False True False
• = P(h | m) x P(f | h, m) x P(-v | f, h, m) x P(m) / P(h, f, -v) 5 False True False True

• = 0.6666 x 0 x 0 x 0.3 / 0.1 = 0 6 True False True False


7 True False True False
8 True False True True
• And 9 False True False False

• P(-m | h, f, -v) = P(h | -m) x P(f | h, -m) x P(-v | f, h, -m) x P(-m) / P(h, f, -v) 10 True False True True

• = 0.7143 x 0.2 x 1 x 0.7 / 0.1 = 1.0

It is a certainty that the patient does not have meningitis!


 Reason: We progress along the sequence of conditional probabilities specified by the chain rule. In
doing so, the set of events that fulfil the conditions for each conditional probability in the sequence get
smaller and smaller as more conditions get added.
 Data fragmentation
Using Bayes Theorem for Classification
Consider each attribute and class label as random variables

Tid Refund Marital Taxable


• Given a record with attributes (X1, X2,…, Xd) Status Income Evade
• Goal is to predict class Y (“Evade”)
1 Yes Single 125K No
given (X1,X2,X3)=(<Refund>, <Marital Status>, <Taxable Income>)
• Specifically, we want to find the value of Y that maximizes P(Y| X 1, X2,…, Xd ) 2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
• Can we estimate P(Y | X1, X2,…, Xd ) directly from data? Yes
5 No Divorced 95K
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Using Bayes Theorem for Classification
Approach
• Compute posterior probability P(Y | X 1, X2, …, Xd) using the Bayes theorem

P ( X 1 X 2  X d | Y ) P (Y )
P (Y | X 1 X 2  X n ) 
P( X 1 X 2  X d )

• Maximum a-posteriori: Choose Y that maximizes P(Y | X1, X2, …, Xd)

• Equivalent to choosing value of Y that maximizes P(X 1, X2, …, Xd | Y) P(Y)

• How to estimate P(X1, X2, …, Xd | Y )?


Example Data
Given a Test Record
X  (Refund  No, Divorced, Income  120K)
• Can we estimate P(Evade = Yes | X) and P(Evade = No | X)? Tid Refund Marital Taxable
Status Income Evade

1 Yes Single 125K No


2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Example Data
Given a Test Record
X  (Refund  No, Divorced, Income  120K)
Tid Refund Marital Taxable
Status Income Evade

1 Yes Single 125K No


2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Naïve Bayes Classifier
Assume conditional independence

• Among attributes Xi when class is given:


• P(X1, X2, …, Xd |Yj) = P(X1| Yj) P(X2| Yj)… P(Xd| Yj)

• Now we can estimate P(Xi| Yj) for all Xi and Yj combinations from the training data

• New point is classified to Yj if P(Yj)  P(Xi| Yj) is maximal.


Naïve Bayes on Example Data
Given a Test Record
X  (Refund  No, Divorced, Income  120K)
Tid Refund Marital Taxable
P(X | Yes) = Status Income Evade
P(Refund = No | Yes) x
1 Yes Single 125K No
P(Divorced | Yes) x
2 No Married 100K No
P(Income = 120K | Yes) No
3 No Single 70K
4 Yes Married 120K No
P(X | No) = 5 No Divorced 95K Yes
P(Refund = No | No) x 6 No Married 60K No
P(Divorced | No) x 7 Yes Divorced 220K No
P(Income = 120K | No) 8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Estimate Probabilities from Data
Discrete Attributes
Tid Refund Marital Taxable
Status Income Evade
• P(y) = fraction of instances of class y
1 Yes Single 125K No
• e.g., P(No) = 7/10, P(Yes) = 3/10
2 No Married 100K No
3 No Single 70K No
• For categorical attributes P(Xi =c | y) = nc / n 4 Yes Married 120K No
• nc is number of instances having attribute value Xi =c 5 No Divorced 95K Yes
and belonging to class y
6 No Married 60K No
• e.g., P(Status=Married | y=No) = 4/7,
7 Yes Divorced 220K No
P(Refund=Yes | y=Yes)=0
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Estimate Probabilities from Data
Continuous Attributes

• Discretization
• Partition the range into bins
• Replace continuous value with bin value
• Attribute changed from continuous to ordinal
• Probability density estimation
• Assume attribute follows a normal distribution
• Use data to estimate parameters of distribution
(e.g., mean and standard deviation)
• Once probability distribution is known, use it to estimate the conditional probability
P(Xi|Y)
a l a l s
Estimate Probabilities from Data o ric o ric u o u
g g it n s
at
e
at
e
on las
c c c c
Normal Distribution Tid Refund Marital Taxable
Status Income Evade
• One for each (Xi,Yi) pair
( X i  ij ) 2 1 Yes Single 125K No

1 2 ij2
P( X i | Y j )  e 2 No Married 100K No

2 ij2 3 No Single 70K No


4 Yes Married 120K No
• For (Income, Class=No) 5 No Divorced 95K Yes
• If Class=No 6 No Married 60K No
• sample mean = 110 7 Yes Divorced 220K No
• sample variance = 2975 8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10

1 
( 120110 ) 2

P( Income  120 | No)  e 2 ( 2975)


 0.0072
2 (54.54)
Example of Naïve Bayes Classifier
Given a Test Record
X  (Refund  No, Divorced, Income  120K)
P(Refund = Yes | No) = 3/7
• P(X | No) = P(Refund=No | No)
P(Refund = No | No) = 4/7  P(Divorced | No)
P(Refund = Yes | Yes) = 0  P(Income=120K | No)
P(Refund = No | Yes) = 1 = 4/7  1/7  0.0072 = 0.0006

P(Marital Status = Single | No) = 2/7


• P(X | Yes) = P(Refund=No | Yes)
P(Marital Status = Divorced | No) = 1/7
 P(Divorced | Yes)
P(Marital Status = Married | No) = 4/7  P(Income=120K | Yes)
P(Marital Status = Single | Yes) = 2/3 = 1  1/3  1.2  10-9 = 4  10-10
P(Marital Status = Divorced | Yes) = 1/3
P(Marital Status = Married | Yes) = 0 • P(X|No)P(No) > P(X|Yes)P(Yes)
For Taxable Income: • Therefore, P(No|X) > P(Yes|X) => Class = No
If class = No: sample mean = 110
sample variance = 2975
If class = Yes: sample mean = 90
sample variance = 25
l l
Issues with Naïve Bayes Classifier ric
a
ric
a
o u s
o o u
eg eg it n s s
c at c at co
n
cl
a
Consider the table with Tid = 7 deleted
Tid Refund Marital Taxable
Status Income Evade

P(Refund = Yes | No) = 2/6 1 Yes Single 125K No


P(Refund = No | No) = 4/6
P(Refund = Yes | Yes) = 0 2 No Married 100K No
P(Refund = No | Yes) = 1 3 No Single 70K No
P(Marital Status = Single | No) = 2/6
P(Marital Status = Divorced | No) = 0 4 Yes Married 120K No
P(Marital Status = Married | No) = 4/6
5 No Divorced 95K Yes
P(Marital Status = Single | Yes) = 2/3
P(Marital Status = Divorced | Yes) = 1/3 6 No Married 60K No
P(Marital Status = Married | Yes) = 0/3
For Taxable Income: X 7 Yes Divorced 220K No X
If class = No: sample mean = 91 8 No Single 85K Yes
sample variance = 685
If class = No: sample mean = 90 9 No Married 75K No
sample variance = 25
10 No Single 90K Yes
10

Given X = (Refund = Yes, Divorced, 120K)


Naïve Bayes will not be able to
P(X | No) = 2/6 X 0 X 0.0083 = 0
P(X | Yes) = 0 X 1/3 X 1.2 X 10-9 = 0 classify X as Yes or No!
Issues with Naïve Bayes Classifier
If one of the conditional probabilities is zero, then the entire expression becomes zero
• Need to use other estimates of conditional probabilities than simple fractions
• Probability estimation:
n: number of training instances
𝑛𝑐 belonging to class y
original : 𝑃 ( 𝑋 𝑖= 𝑐| 𝑦 ¿=
𝑛 nc: number of instances with Xi = c and
Y=y
𝑛𝑐 +1
ℒ Estimate : 𝑃 ( 𝑋 𝑖 =𝑐| 𝑦 ¿= v: total number of attribute values that
𝑛+𝑣 Xi can take
Classification Models I
Naïve Bayes Applications

Dr. Sugata Ghosal


CSIS Off-Campus Faculty
BITS Pilani
Naïve Bayes Classifier Applications

27
Baseline: Bag of Words Approach

aardvark 0
about 2
all 2
Africa 1
apple 0
anxious 0
...
gas
...
1
oil

1
Zaire

0
Text Classification: A Simple Example

Text Tag • Which tag does the sentence A very close game
belong to? i.e. P(sports| A very close game)
“A great game” Sports
• Feature Engineering: Bag of words i.e., use
“The election was over” Not sports word frequencies without considering order
“Very clean match” Sports • Using Bayes Theorem:
“A clean but forgettable game” Sports P(sports| A very close game)
“It was a close election” Not sports = P(A very close game| sports)
P(sports)
-------------------------------------------------------------
P(A very close game)
• We assume that every word in a sentence is independent of the other ones

• “close” doesn’t appear in sentences of sports tag, So P(close | sports) = 0, which makes
product 0
Laplace smoothing

• Laplace smoothing: we add 1 or in general constant k to every count so it’s never zero.
• To balance this, we add the number of possible words to the divisor, so the division will never be
greater than 1
• In our case, the 14 possible words are
{a,great,very,over,it,but,game,election,clean,close,the,was,forgettable,match }

30
Apply Laplace Smoothing

Word P(word | Sports) P(word | Not Sports)


a 2+1 / 11+14 1+1 / 9+14
very 1+1 / 11+14 0+1 / 9+14
close 0+1 / 11+14 1+1 / 9+14
game 2+1 / 11+14 0+1 / 9+14

31
Experiment with NewsGroups

• Given 1000 training documents from each group learn to classify new documents according to
which newsgroup it came from

• Naive Bayes: 89% classification accuracy

comp.graphics misc.forsale alt.atheism sci.space


comp.os.ms-win- rec.autos soc.religion.christian sci.crypt
dows.misc rec.motorcycles talk.religion.misc sci.electronics
comp.sys.ibm.pc.hardware rec.sport.baseball talk.politics.mideast sci.med
comp.sys.mac.hardware rec.sport.hockey talk.politics.misc
comp.windows.x talk.politics.guns
Learning Curve for Newsgroups
Accuracy vs. Training set size
• 1/3 withheld for test
Practice – Naïve Bayes

• We want to predict the outcome of


a football match on the basis of
past performance data of playing
teams

• Training data available; 4


parameters

• Using Naïve Bayes theorem, we need to:


• Classify the conditions when this team wins
• Predict the probability of this team winning a particular match when Weather condition = Rainy, they
won 1 of the last 3 matches, Humidity = normal and they won the toss in the particular match
Solution

• Construct a frequency table


• Identify the cumulative probability for “Won match = Yes” and
probability for “Won match = No” on the basis of all attributes
• Calculate probability through normalization by applying the formula
• P(yes) = P(yes) / (P(yes) + P(no))
• P(no) = P(no) / (P(yes) + P(no))
Solution

• (a1): rainy
• (a2): 2 wins in the last 3 matches
• (a3): humidity = normal
• (a4): win toss = true
Solution

• P (win match | a1 Ո a2 Ո a3 Ո a4)


• = P(a1| win match) P(a2| win match) P(a3| win match) P(a4| win match) P(win match) / P(a1)P(a2)P(a3)P(a4)
• = 2/9 * 4/9 * 6/9 * 3/9 * 9/14
• = 0.01411

• P(!win match | a1 Ո a2 Ո a3 Ո a4)


• = P(a1| !win match) P(a2| !win match) P(a3| !win match) P(a4| !win match) P(!win match) / P(a1)P(a2)P(a3)P(a4)
• = 3/5 * 2/5 * 1/5 * 3/5 * 5/14
• = 0.01029

• P(win match) = P (win match) / (P (win match) + P (!win match))


• = 0.01411 / (0.01411 + 0.01029) = 0.578

• P(!win match) = 0.01029 / (0.01411 + 0.01029) = 0.422


Classification Models I
Introduction

Dr. Sugata Ghosal


CSIS Off-Campus Faculty
BITS Pilani
Classification Models I
Logistic Regression

Dr. Sugata Ghosal


CSIS Off-Campus Faculty
BITS Pilani
Topics

• What is Logistic Regression?


• Sigmoid function
• Log loss Cost Function
• Optimization Using Gradient Descent
• Regularization
• Multi-class classification

3
Logistic regression

• Logistic Regression could help us predict, for example, whether the student passed or
failed. Logistic regression predictions are discrete (only specific values or categories are
allowed). We can also view probability scores underlying the model’s classifications.

• In comparison, Linear Regression could help us predict the student’s test score on a scale
of 0 - 100. Linear regression predictions are continuous (numbers in a range).

• Idea

• Naïve Bayes allows computing P(Y|X) by learning P(Y) and P(X|Y)


• Why not learn P(Y|X) directly?
Sigmoid/Logistic Function
Classification requires discrete output values
• For example, output y = 0 or 1 for a two-category classification problem
• In logistic regression, sigmoid/logistic function hɵ(x) takes a real vector x as input and outputs a
value between 0 and 1

• ɵ is a vector of logistic regression parameters

1
h 𝜃 ( 𝑥 )= −𝜃 𝑥
⊤ 𝑔(𝑧)
1+𝑒
𝑧
Logistic regression

𝑔(𝑧)

𝑧=𝜃 ⊤ 𝑥
Suppose predict “y = 1” if

predict “y = 0” if
Decision boundary
At decision boundary output of logistic regressor is 0.5

• e.g.,

Decision boundary
Age

Tumor Size

• Predict “” if
Learning Model Parameters

• Training set:

• m examples
• n features

• How to choose parameters (feature weights)?


MSE Cost Function
MSE cost function is non convex for logistic regression due to sigmoidal function

Logistic regression:

“non-convex” “convex”

• Gradient descent may not find the optimal global minimum.


• So instead of Mean Squared Error, use a error/ cost function called Cross-Entropy, also known
as Log Loss.
Cost function for Logistic Regression
Cross Entropy or Log Loss Function
if 𝑦=1
− log ( h 𝜃 ( 𝑥 ) ) if 𝑦 =1
Cost ( h 𝜃 ( 𝑥 ) , 𝑦)=
{ − log ( 1− h 𝜃 ( 𝑥 ) ) if 𝑦=0

0 h𝜃 (𝑥 ) 1

if 𝑦=0

• J(θ) is convex
• Apply gradient descent on J(θ) w.r.t. θ to find optimal parameters 0 h𝜃 (𝑥 ) 1
Gradient descent

Goal:
Repeat {

(Simultaneously update all )


𝑚
𝜕 1 ( 𝑖)
𝐽 ( 𝜃 ) = ∑ (h 𝜃 ( 𝑥 ) − 𝑦 )𝑥 𝑗
(𝑖) (𝑖)
Slide credit: Andrew Ng

𝜕𝜃𝑗 𝑚 𝑖=1
Multi-class classification

Binary classification Multiclass classification

𝑥2 𝑥2

𝑥1 𝑥1
One-vs-all (one-vs-rest)

𝑥2
( 1)
h𝜃 ( 𝑥)
𝑥1
𝑥2
(2) 𝑥2
h ( 𝑥)
𝜃

𝑥1 𝑥1
Class 1:
Class 2: (3)
h𝜃 ( 𝑥 ) 𝑥2
Class 3:
( 𝑖)
h ( 𝑥 )=𝑃 ( 𝑦 =𝑖|𝑥 ; 𝜃 ) (𝑖=1 , 2 ,3)
𝜃 𝑥1 Slide credit: Andrew Ng
One-vs-all

• Train a logistic regression classifier for each class to predict the probability that

• Given a new input , pick the class that maximizes


Logistic Regression Applications

• Credit Card Fraud : Predicting if a given credit card transaction is fraud or not
• Health : Predicting if a given mass of tissue is benign or malignant
• Marketing : Predicting if a given user will buy an insurance product or not
• Banking : Predicting if a customer will default on a loan.
Classification Models I
Support Vector Machines

Dr. Sugata Ghosal


CSIS Off-Campus Faculty
BITS Pilani
Topics

• Introduction
• Support Vectors
• Linear Support Vector Machine
• Maximizing Margin
• Handling non linearly separable data
• Non-linear Classification
• Kernel Functions

18
Support Vector Machines
Find a linear hyperplane (decision boundary) that will separate the data
Support Vector Machines
One Possible Solution

B1
Support Vector Machines
Another possible solution

B2
Support Vector Machines
Other possible solutions

B2
Support Vector Machines

• Which one is better? B1 or B2?


• How do you define better?
B1

B2
Support Vector Machines
Find hyperplane maximizes the margin
• => B1 is better than B2
B1

B2

b21
b22

margin
b11

b12
Support Vector Machines
B1

 
w x b  0
 
 
w  x  b  1 w  x  b  1

b11

  b12
 1 if w  x  b  1 2
f ( x)     Margin  
 1 if w  x  b  1 || w ||
Linear SVM
Linear Model

 
  1 if w x  b 1
f ( x)    
 1 if w  x  b  1

• Learning the model
 is equivalent to determining the values of w and b
• How to find w and b from training data?
Learning Linear SVM
𝑦 𝑖 (w • x 𝑖 +𝑏) ≥1 , 𝑖=1,2 , ..., 𝑁
• Objective is to maximize:

2
Margin  
|| w ||
• Which is equivalent to minimizing:
 2
 || w ||
L( w ) 
2
• Subject to the following constraints:  
1 if w  x i  b  1
yi    
 1 if w  x i  b  1
or
L(w, b, λi)= ||w||2 - Σ λi [yi (wTxi + b) -1]
• This is a constrained optimization problem
• Solve it using Lagrange multiplier method
• Lagrange multipliers λi are 0 or +ve
Learning Linear SVM

λi λi is non-zero except for support vectors

λi
Example of Linear SVM

Support vectors

x1 x2 y l
0.3858 0.4687 1 65.5261
0.4871 0.611 -1 65.5261
0.9218 0.4103 -1 0
0.7382 0.8936 -1 0
0.1763 0.0579 1 0
0.4057 0.3529 1 0
0.9355 0.8132 -1 0
0.2146 0.0099 1 0
Learning Linear SVM

• Decision boundary depends only on support vectors


• If you have data set with same support vectors, decision boundary will not change

• How to classify using SVM once w and b are found? Given a test record, xi
 
 1 if w  x i  b  1
f ( xi )    
 1 if w  x i  b  1
Support Vector Machines
What if the problem is not linearly separable?
Support Vector Machines
What if the problem is not linearly separable?
• Introduce slack variables
• Need to minimize
 2
|| w ||  N k
L( w)   C   i 
2  i 1 
• subject to
 
1 if w  x i  b  1 - i
yi    
 1 if w  x i  b  1  i

• If k is 1 or 2, this leads to similar objective function as linear SVM but with different constraints
Nonlinear Support Vector Machines
What if decision boundary is not linear?
Nonlinear Support Vector Machines
Transform data into higher dimensional space

Decision boundary:
 
w  ( x )  b  0
Example of Nonlinear SVM

SVM with polynomial degree 2 kernel


Learning Nonlinear SVM

• Advantages of using kernel:


• Don’t have to know the mapping function 
• Computing dot product (xi) (xj) in the original space avoids curse of dimensionality

• Not all functions can be kernels


• Must make sure there is a corresponding  in some high-dimensional space
• Mercer’s theorem (see textbook)
Classification Models I
Comparison and Applicability

Dr. Sugata Ghosal


CSIS Off-Campus Faculty
BITS Pilani
Topics

• Pros and Cons


• Naïve Bayes Classifier
• Logistic Regression
• Support Vector Machines
Generative vs. Discriminative Classifiers

•Training classifiers involves estimating f: X -> Y, or P(Y|X)

•Generative classifiers like Naïve Bayes


• Assume some functional form for probabilities P(X|Y), P(X)
• Estimate parameters of P(X|Y), P(X) directly from training data
• Use Bayes rule to calculate P(Y|X= xi)

•Discriminative classifiers like Logistic regression, support vector machine


• Assume some functional form for f: X -> Y
• Estimate parameters of mapping function f directly from training data
Naïve Bayes – Advantages

• Algorithm is simple to implement and fast


• If conditional independence holds, it will converge quickly than other methods
• Even in cases where conditional independence doesn’t hold, its results are quite acceptable
• Needs less training data (due to conditional independence assumption)
• Highly scalable, scales linearly with the number of predictors and training points
• Can be used for both binary and multi-class classification problems
• Handles continuous and discrete data
• Not sensitive to irrelevant features
• Doesn’t overfit the data due to small model size (compared to other algorithms like Random
Forest)
• Handles missing values well
Naïve Bayes – When to use

• When the training data is small


• When the features are conditionally independent (mostly)
• When we have large number of features with minimal data set
• Ex: Text classification
Naïve Bayes versus Logistic Regression

• Naïve Bayes are Generative Models


• Logistic Regression are Discriminative Models
• When the training size reaches infinity, logistic regression performs better than the
generative model Naive Bayes.
Characteristics of SVM

• The learning problem is formulated as a convex optimization problem


• Efficient algorithms are available to find the global minima
• Many of the other methods use greedy approaches and find locally optimal solutions
• High computational complexity for building the model

• Robust to noise
• Overfitting is handled by maximizing the margin of the decision boundary
• In some sense, the best linear model for classification.

• SVM can handle irrelevant and redundant data better than many other techniques
• The user needs to provide the type of kernel function and cost function
• Difficult to handle missing values
• What about categorical variables?
• Needs to be mapped to some metric space

Computationally expensive for large training data with many attributes


Backup
Learning Nonlinear SVM
Optimization Problem

• which leads to the same set of equations (but involve (x) instead of x)
Learning NonLinear SVM
Issues
• What type of mapping function  should be used?
• How to do the computation in high dimensional space?
• Most computations involve dot product (xi) (xj)
• Curse of dimensionality?
Learning Nonlinear SVM

• Kernel Trick:
• (xi) (xj) = K(xi, xj)

• K(xi, xj) is a kernel function (expressed in terms of the coordinates in the original space)
• Examples:
Support Vector Machines
Dr. Chetana Gavankar, Ph.D,
BITS Pilani IIT Bombay-Monash University Australia
Pilani Campus
Chetana.gavankar@pilani.bits-pilani.ac.in

Updated – Anita Ramachandran, 2024


Agenda
 Linear Classifiers
 Maximum Margin Classification
 Linear SVM
 Soft Margin SVM
 Nonlinear SVM
 Multi-Class Problem

BITS Pilani, Pilani Campus


Linear Classifiers
w x + b>0 f(x,w,b) = sign(w x + b)

0
denotes +1 b=
+
x How would you
denotes -1 w
classify this data?

w x + b<0

BITS Pilani, Pilani Campus


Linear Classifiers
f(x,w,b) = sign(w x + b)

denotes +1
denotes -1 How would you
classify this data?

BITS Pilani, Pilani Campus


Linear Classifiers
f(x,w,b) = sign(w x + b)

denotes +1
denotes -1 How would you
classify this data?

BITS Pilani, Pilani Campus


Linear Classifiers
f(x,w,b) = sign(w x + b)

denotes +1
denotes -1 Any of these
would be fine..

..but which is
best?

BITS Pilani, Pilani Campus


Linear Classifiers
f(x,w,b) = sign(w x + b)

denotes +1
denotes -1
How would you
classify this data?

Misclassified
to +1 class

BITS Pilani, Pilani Campus


Classifier Margin

f(x,w,b) = sign(w x + b)
denotes +1
denotes -1 Define the margin
of a linear
classifier as the
width that the
boundary could be
increased by
before hitting a
datapoint.

BITS Pilani, Pilani Campus


Maximum Margin
1. If hyperplane is oriented such that it is
f(x,w,b) = sign(w x + b)
denotes +1 close to some of the points in your
training set, new data may lie on the
denotes -1 The maximum
wrong side of the hyperplane, even if
margin
the new points lie close linear
to training
classifier
examples of the correct class. is the
linear
2. Solution is maximizing classifier
the margin
Support Vectors with the,
are those
datapoints that maximum margin.
the margin This is the
pushes up
against
simplest kind of
SVM (Called an
LSVM)
Linear SVM
BITS Pilani, Pilani Campus
Support Vectors
• Geometric description of SVM is that the max-
margin hyperplane is completely determined
by those points that lie nearest to it.
• Points that lie on this margin are the support
vectors.
• The points of our data set which if removed,
would alter the position of the dividing
hyperplane

BITS Pilani, Pilani Campus


Line with 2 features: R2

BITS Pilani, Pilani Campus


Linear SVM Mathematically

BITS Pilani, Pilani Campus


Solving the Optimization Problem

BITS Pilani, Pilani Campus


Solving the Optimization Problem
Find w and b such that
Φ(w) ||w||2 is minimized;
 Primal
and for all {(xi ,yi)}: yi (wTxi + b) ≥ 1

 Need to optimize a quadratic function subject to linear inequality


constraints.
 All constraints in SVM are linear
 Quadratic optimization problems are a well-known class of
mathematical programming problems, and many (rather intricate)
algorithms exist for solving them.
 The solution involves constructing a dual problem where a Lagrange
multiplier αi is associated with every constraint in the primal problem

BITS Pilani, Pilani Campus


Solving the Optimization Problem
 The solution involves constructing a dual problem where a Lagrange
multiplier αi is associated with every constraint in the primary
problem:

L(w, b, αi)= ||w||2 - Σ αi [yi (wTxi + b) -1]

 Taking partial derivative with respect to w = 0


 w - Σ α y x = 0
i i i

 w = Σ αi yi xi
 Taking partial derivative with respect to b, = 0
 - Σ α y = 0
i i

 Σ αi yi = 0

BITS Pilani, Pilani Campus


Solving the Optimization Problem

BITS Pilani, Pilani Campus


Dataset with noise
denotes +1
 Hard Margin: So far we require
denotes -1 all data points be classified correctly
- No training error
 What if the training set is
noisy?

BITS Pilani, Pilani Campus


Soft Margin Classification
Slack variables ξi can be added to allow
misclassification of difficult or noisy examples.

What should our quadratic optimization criterion be?


Minimize
R
1
w.w  C  εk
2 k 1

BITS Pilani, Pilani Campus


Slack Variable
• Slack variable as giving the classifier some
leniency when it comes to moving around
points near the margin.
• When C is large, larger slacks penalize the
objective function of SVM’s more than when C
is small.

BITS Pilani, Pilani Campus


Soft Margin

BITS Pilani, Pilani Campus


Hard Margin vs Soft Margin
 Hard Margin:
Find w and b such that
Φ(w) = wTw is minimized and for all {(xi ,yi)}
yi (wTxi + b) ≥ 1

 Soft Margin incorporating slack variables:


Find w and b such that
Φ(w) = wTw + C Σ ξi is minimized and for all {(xi ,yi)}
yi (wTxi + b) ≥ 1- ξi and ξi ≥ 0 for all i

 Parameter C can be viewed as a way to control overfitting.

Ref:
https://towardsdatascience.com/support-vector-machines-soft-margin-formulation-and-kernel-tri
ck-4c9729dc8efe
BITS Pilani, Pilani Campus
Effect of Margin size v/s
misclassification cost

Using a high C value the Using a low C value the


classifier makes fewer margin margin is much larger, but
violations but ends up with a many instances end up on the
smaller margin. street.

Generalizes better

BITS Pilani, Pilani Campus


BITS Pilani
Pilani Campus

Support Vector Machines in


overlapping class distributions &
Kernels
BITS Pilani, Pilani Campus
Non-linear SVMs
 Datasets that are linearly separable with some noise soft
margin work out great:
0 x
 But what are we going to do if the dataset is just too
hard? x
0

 How about… mapping xdata to a higher-dimensional


2

space:

0 x

BITS Pilani, Pilani Campus


The “Kernel Trick”
 The linear classifier relies on dot product between vectors
 xT . x
i j
 If every data point is mapped into high-dimensional space via some
transformation Φ: x → φ(x), the dot product becomes:
K(xi, xj)= φ(xi)T φ (xj)
 A kernel function is some function that corresponds to an inner
product in some expanded feature space.

BITS Pilani, Pilani Campus


SVM Kernels
• SVM algorithms use a set of mathematical
functions that are defined as the kernel.
• Function of kernel is to take data as input and
transform it into the required form.
• Different SVM algorithms use different types
of kernel functions. Example linear, nonlinear,
polynomial, and sigmoid etc.

BITS Pilani, Pilani Campus


Non-linear SVMs:
Feature spaces
 General idea: the original input space can always be
mapped to some higher-dimensional feature space
where the training set is separable:

Φ: x → φ(x)

BITS Pilani, Pilani Campus


Kernel Trick

BITS Pilani, Pilani Campus


Examples of Kernel Functions
 Linear: K(xi,xj)= xi Txj

 Polynomial of power p: K(xi,xj)= (1+ xi Txj)p

 Gaussian (radial-basis function network):


2
xi  x j
K (x i , x j )  exp( )
2 2

 Sigmoid: K(xi,xj)= tanh(β0xi Txj + β1)

BITS Pilani, Pilani Campus


Nonlinear SVM - Overview
 SVM locates a separating hyperplane in the
feature space and classify points in that
space
 It does not need to represent the space
explicitly, simply by defining a kernel
function
 The kernel function plays the role of the dot
product in the feature space.

BITS Pilani, Pilani Campus


Example

BITS Pilani, Pilani Campus


Multi-Class Problem

BITS Pilani, Pilani Campus


Multi-Class Problem

BITS Pilani, Pilani Campus


Multi-Class Problem

BITS Pilani, Pilani Campus


Good Web References for SVM
• http://www.cs.utexas.edu/users/mooney/cs391L/
• https://www.coursera.org/learn/machine-learning/home/week/7
• https://data-flair.training/blogs/svm-kernel-functions/
• MIT 6.034 Artificial Intelligence, Fall 2010
• https://
stats.stackexchange.com/questions/30042/neural-networks-vs-supp
ort-vector-machines-are-the-second-definitely-superior
• https://
www.sciencedirect.com/science/article/abs/pii/S0893608006002796
• https://
medium.com/deep-math-machine-learning-ai/chapter-3-support-vec
tor-machine-with-math-47d6193c82be
• Radial basis kernel
BITS Pilani, Pilani Campus
Applied Machine Learning
Revision (Mid-sem)

Anita Ramachandran
Question 1
Solution 1
Question 2
Solution 2
Question 3
Solution 3
Question 4
Vijay is a certified Data Scientist and he has applied for two companies -
Google and Microsoft. He feels that he has a 60% chance of receiving
an offer from Google and 50% chance of receiving an offer from
Microsoft. If he receives an offer from Microsoft, he has belief that
there are 80% chances of receiving an offer Google.
• If Vijay receives an offer from Microsoft, what is the probability that
he will not receive an offer from Google?
• What are his chances of getting an offer from Microsoft, considering
he has an offer from Google?
Solution 4

Vijay is a certified Data Scientist and he has applied for two companies -Google and Microsoft. He
feels that he has a 60% chance of receiving an offer from Google and 50% chance of receiving an
offer from Microsoft. If he receives an offer from Microsoft, he has belief that there are 80%
chances of receiving an offer Google.
• If Vijay receives an offer from Microsoft, what is the probability that he will not receive an offer
from Google?
• What are his chances of getting an offer from Microsoft, considering he has an offer from
Google?
Ans
• P(not g| m) = 0.2
• P(M|G)=0.67
Question 5
From an insurance company, people have purchased the travel insurance to be safe and
experiencing the world. The below dataset describes the listing purchase transactions in 2015. List
at least 6 issues with the below available dataset which are required for Data Pre-processing.

Name Age Date of Professio Date of Birth Policy Sum Insured


Purchase n Category
Ram 34 15-Jan-2015 Lawyer Feb 24, 1981 SUPER ₹ 1200
Laxman 33 27-Jan-2015 Mar 27, 1982 ECO $ 89
Rama 34 15-Jan-2015 Lawyer Feb 24, 1981 SUPER ₹ 1200
Mahesh 32 30-Jan-2015 Engineer Nov 25,1982 ECO ₹ 4877
Sita n/a 22-Jan-2015 Designer May 22,1990 SUPER ₹0
Solution 5
• 1st and 3rd row are duplicate, should be removed.
• ‘n/a’ replace with Mean and blank profession should replaced with
Mode.
• Age & Date of birth are duplicate columns
• Sum Insured amounts are in different currencies, should be in same
currency.
• Date of Purchase & Date of Birth are in different format
• ₹ 0 Sum Assured should be replaced with the mean value
Question 6
• Solve the below and find the equation for hyper plane using
linear Support Vector Machine method. Positive Points: {(3, 2),
(4, 3), (2, 3), (3, -1)} Negative Points: {(1, 0), (1, -1), (0, 2), (-1,
2)}
Solution 6
• Solve the below
and find the
equation for hyper
plane using linear
Support Vector
Machine method.

Positive Points: {(3, 2),


(4, 3), (2, 3), (3, -1)}

Negative Points: {(1,


0), (1, -1), (0, 2), (-1,
2)}
WEBINAR 1 : ANALYSIS ON QUALITY OF
WHITE
WINE

BITS Pilani
Pilani Campus
Introduction

● Dataset: The "Wine Quality - White" dataset is a collection of data


related to white wines.

● Purpose: This dataset is often used for quality prediction and


analysis.

● Importance: Wine quality prediction is essential for winemakers to


ensure the production of high-quality wines and meet consumer
preferences.

BITS Pilani, Pilani Campus


Data Overview

List the attribute:


1. Fixed Acidity
2. Volatile Acidity
3. Citric Acid
4. Residual Sugar
5. Chlorides
6. Free Sulfur Dioxide
7. Total Sulfur Dioxide
8. Density
9. pH
10.Sulphates
11.Alcohol
Target Variable: The target variable in this dataset is typically
"Quality," which represents the quality rating of the wine. Quality ratings
are often integers between 3 and 9, where higher values indicate better
quality. BITS Pilani, Pilani Campus
5-Class problem

• 5 classes - [6, 5, 7, 8, 4] are considered.


• The dataframe is filtered to contain only these 5 classes.
• Decision Tree Classifier is used.
• Confusion matrix for 5 class is obtained.

BITS Pilani, Pilani Campus


Confusion Matrix

BITS Pilani, Pilani Campus


Histogram with KDE

KDE is often used as complement to histograms. While histograms bin


the data into discrete intervals, KDE provides a continuous
representation. If the KDE plot closely aligns with a histogram, it suggests
a robust estimation.

sns.histplot(X[col], kde=True)

BITS Pilani, Pilani Campus


QQ Plot
Normality / QQ Plot - A normal probability plot, also known as a Q-Q
(quantile-quantile) plot, is a graphical method for assessing the normality
of a distribution. It compares the quantiles of the observed data against
the quantiles of a theoretical normal distribution. In a Q-Q plot, if the
points approximately lie on a straight line, it suggests that the data is
approximately normally distributed.
stats.probplot(filtered_df[col], dist="norm", plot=plt)

BITS Pilani, Pilani Campus


Bubble Plot
A bubble plot is a type of scatter plot where a third dimension of the data is shown
through the size of markers (bubbles) in addition to their x and y positions. It is
useful for visualizing relationships between three continuous variables. In a
bubble plot, the size of each bubble represents the value of a third variable.
sns.scatterplot(data=filtered_df, x='alcohol', y='residual sugar',
size='quality', legend=False, sizes=(20, 200))

BITS Pilani, Pilani Campus


Bubble Plot
hue='pH': Colors the bubbles based on the 'pH' column.

sns.scatterplot(data=filtered_df, x='alcohol', y='residual sugar', size='quality',


hue='pH', legend=None, sizes=(20, 200))

BITS Pilani, Pilani Campus


Decision Tree-Based Feature
Selection

Why Use Decision Trees for Feature Selection?


● Decision trees provide a clear and interpretable way to assess
feature importance.
● They rank features based on their contribution to splitting and
information gain.
● Decision trees can handle both categorical and numerical features.

BITS Pilani, Pilani Campus


Decision Tree-Based Feature
Selection

The Decision Tree Process:


● Decision trees are constructed by recursively splitting data based
on features.
● At each node, the feature with the highest information gain is
selected for splitting.
● The process continues until a stopping criterion is met.

BITS Pilani, Pilani Campus


Decision Tree-Based Feature
Selection

Feature Importance Scores:


● Decision trees assign importance scores to features.
● Features used near the top of the tree are typically more important.
● Importance scores reflect the contribution of each feature to the
model's predictive power.
● Decision trees provide a feature importance score based on how
often a feature is used to split the data across all the decision
trees.

feature_importances = dt_classifier.feature_importances_

BITS Pilani, Pilani Campus


Decision Tree-Based Feature
Selection
Feature Importance
Scores:

BITS Pilani, Pilani Campus


Genetic Algorithms for
Feature Selection

Why Use Genetic Algorithms for Feature Selection?


● GAs are suitable for combinatorial optimization problems like
feature selection.
● They explore a wide search space efficiently and find optimal or
near-optimal solutions.

BITS Pilani, Pilani Campus


Genetic Algorithms for
Feature Selection
The Genetic Algorithm Process: Genetic Algorithms mimic the
process of evolution:
• Initialization: Create an initial population of potential solutions
(chromosomes).
• Selection: Evaluate and select the fittest chromosomes based on a
fitness function.
• Crossover (Recombination): Combine pairs of selected
chromosomes to create offspring.
• Mutation: Introduce small random changes to the offspring.
• Replacement: Replace the old population with the new population of
offspring.
• Termination: Repeat the process for a specified number of
generations or until convergence.

BITS Pilani, Pilani Campus


Genetic Algorithms for
Feature Selection

Applying GAs to Feature Selection:

• In feature selection, each chromosome represents a subset of


features.
• The fitness function evaluates subsets based on their contribution to
model performance.
• GAs iteratively evolve better subsets over generations.

BITS Pilani, Pilani Campus


Genetic Algorithms for
Feature Selection

Benefits of Genetic Algorithm-Based Feature Selection:

• GAs handle a large feature space efficiently.


• They can uncover complex feature interactions.
• GAs are adaptable to different machine learning algorithms.

BITS Pilani, Pilani Campus


Genetic Algorithms for
Feature Selection
• The evaluate function is designed for feature selection using a
binary individual representing feature inclusion or exclusion.
• It creates a boolean mask based on the binary array. The boolean
mask selects the relevant features from the original feature matrix
(X). This step effectively filters out the non-selected features,
creating a subset of features based on the individual's gene
representation.
• Initializes a Random Forest classifier and evaluates its performance
using cross-validation with 5 folds.
• The fitness value returned is the mean accuracy of the classifier
across the folds, serving as the objective for optimization algorithms
seeking to find an optimal set of features for classification.
• The GA parameters are defined.

BITS Pilani, Pilani Campus


Genetic Algorithms for
Feature Selection
creator.create("FitnessMax", base.Fitness, weights=(1.0,)): FitnessMax
that represents the optimization goal and the weight is set to 1.0.

creator.create("Individual", list, fitness=creator.FitnessMax): creates


another class named Individual, which is essentially a list of binary
values representing features. Each individual has an associated fitness
attribute, and it's assigned the FitnessMax class.

toolbox = base.Toolbox(): The toolbox is a container for registering


functions that will be used in the genetic algorithm. It's a convenient
way to organize and reuse these functions.

toolbox.register("attr_bool", random.randint, 0, 1): generates random


integers (0 or 1).

toolbox.register("individual", tools.initRepeat, creator.Individual,


toolbox.attr_bool, n=num_features): a function named individual that
BITS Pilani, Pilani Campus
Genetic Algorithms for
Feature Selection
toolbox.register("individual", tools.initRepeat, creator.Individual,
toolbox.attr_bool, n=num_features): It initializes an Individual with
binary values using the attr_bool function num_features times.

toolbox.register("population", tools.initRepeat, list, toolbox.individual): It


initializes a population of individuals using the individual function.

The registered functions (evaluate, mate, mutate, and select) define


the key components of the genetic algorithm: how individuals are
evaluated, how they are combined (crossover), how they can change
(mutation), and how they are selected for the next generation. These
functions will be used by the genetic algorithm to evolve a population of
individuals toward a solution that optimizes the defined evaluation
criteria.

BITS Pilani, Pilani Campus


Genetic Algorithms for
Feature Selection
population = toolbox.population(n=population_size): Creates the initial
population of individuals with a specified size (population_size). Each
individual is a binary string representing a set of features.

hof = tools.HallOfFame(1): Creates the Hall of Fame, which is a


container to store the best individual found during the evolution. Here,
it's set to store the single best individual.

Runs the genetic algorithm using the eaSimple function from the
algorithms module.

The genetic algorithm iteratively evolves a population of individuals,


optimizing the feature selection based on the defined evaluation criteria
(fitness function). The final result is the best individual, representing the
most optimal set of features for the given classification task.

BITS Pilani, Pilani Campus

You might also like