You are on page 1of 36

Unit-2

Concepts of Machine Learning


CSA202

Presentation by:
Akanksha Shangloo
Asst. Professor
Dept. of CSE #Lecture1
School of Engineering and Technology
Supervised Learning
• Supervised learning is the type of machine learning in which machines are trained using well
"labelled" training data, and on basis of that data, machines predict the output.
• The labelled data means some input data is already tagged with the correct output.
• Supervised learning is a process of providing input data as well as correct output data to the
machine learning model.
• The aim of a supervised learning algorithm is to find a mapping function to map the input
variable(x) with the output variable(y).
• Supervised learning is classified into two categories of algorithms:
✓ Classification: A classification problem is when the output variable is a category, such as “Red” or
“blue” , “disease” or “no disease”.
✓ Regression: A regression problem is when the output variable is a real value, such as “dollars” or
“weight”.
How Supervised Learning Algorithms work?
How Supervised Learning Algorithms work?
• In supervised learning, models are trained using labelled dataset, where the model learns about each type
of data.
• Once the training process is completed, the model is tested on the basis of test data (a subset of the training
set), and then it predicts the output.
• For instance, suppose you are given a basket filled with different kinds of fruits.
• Now the first step is to train the machine with all the different fruits one by one.
• If the shape of the object is rounded and has a depression at the top, is red in color, then it will be labeled as
–Apple.
• If the shape of the object is a long curving cylinder having Green-Yellow color, then it will be labeled as –
Banana.
• Now suppose after training the data, you have given a new separate fruit, say Banana from the basket, and
asked to identify it.
• Since the machine has already learned the things from previous data and this time has to use it wisely. It will
first classify the fruit with its shape and color and would confirm the fruit name as BANANA and put it in the
Banana category.
Errors in Machine Learning
• Reducible errors: These errors can be
reduced to improve the model
accuracy.
• Such errors can further be classified
into bias and Variance.
• Irreducible errors: These errors will
always be present in the model
regardless of which algorithm has
been used.
• The cause of these errors is unknown
variables whose value can't be
reduced.
What is Bias?
• The bias error is known as the difference between the prediction of the values by the Machine
Learning model and the correct value.
• It can be defined as an inability of machine learning algorithms such as Linear Regression to
capture the true relationship between the data points.
• A model has either:
• Low Bias: A low bias model will make fewer assumptions about the form of the target function.
(e.g. Decision Trees, k-Nearest Neighbours and Support Vector Machines)
• High Bias: A model with a high bias makes more assumptions, and the model becomes unable to
capture the important features of our dataset. A high bias model also cannot perform well on
new data. (e.g. Linear Regression, Linear Discriminant Analysis and Logistic Regression)
• By high bias, the data predicted is in a straight line format, thus not fitting accurately in the data
in the data set. (Data Underfitting)
• Being high in biasing gives a large error in training as well as testing data.
• It recommended that an algorithm should always be low-biased to avoid the problem of
underfitting.
What is a Variance
• The variability of model prediction for a given data point which tells us the spread
of our data is called the variance of the model.
• The variance would specify the amount of variation in the prediction if the
different training data was used.
• In simple words, variance tells that how much a random variable is different from
its expected value.
• Variance errors are either of low variance or high variance.
• Low variance: It means there is a small variation in the prediction of the target
function with changes in the training data set.(e.g. Linear Regression, Logistic
Regression, and Linear discriminant analysis)
• High variance: It shows a large variation in the prediction of the target function
with changes in the training dataset.(e.g. decision tree, SVM, and KNN)
• with high variance, the model learns too much from the dataset, it leads to
overfitting of the model. A model with high variance has the below problems:
• A high variance model leads to overfitting.
• Increase model complexities.
Reducing Bias/Variance
• Ways to reduce High Bias:
✓Increase the input features as the
model is underfitted.
✓Decrease the regularization term.
✓Use more complex models, such as
including some polynomial features
• Ways to Reduce High Variance:
✓Reduce the input features or
number of parameters as a model
is overfitted.
✓Do not use a much complex model.
✓Increase the training data.
Bias-Variance Trade-Off
• If the model is very simple with fewer parameters, it may have low variance
and high bias.
• Whereas, if the model has a large number of parameters, it will have high
variance and low bias.
• A balance is required to be maintained between bias and variance errors,
and this balance between the bias error and variance error is known as the
Bias-Variance trade-off.
• Bias-Variance trade-off is a central issue in supervised learning. Ideally, we
need a model that accurately captures the regularities in training data and
simultaneously generalizes well with the unseen dataset.
• A high variance algorithm may perform well with training data, but it may
lead to overfitting to noisy data.
• Whereas, high bias algorithm generates a much simple model that may not
even capture important regularities in the data.
Bias-Variance Trade-Off
• For an accurate prediction of the model,
algorithms need a low variance and low
bias. But this is not possible because bias
and variance are related to each other:
• If we decrease the variance, it will increase
the bias.
• If we decrease the bias, it will increase the
variance.
• Therefore, we need to find a common spot
between bias and variance to make an
optimal model.
Assumption for Linear Regression Model
• Linear regression is a powerful tool for understanding and predicting the behavior
of a variable, however, it needs to meet a few conditions in order to be accurate
and dependable solutions.
1.Linearity: The independent and dependent variables have a linear relationship
with one another. This implies that changes in the dependent variable follow
those in the independent variable(s) in a linear fashion.
2.Independence: The observations in the dataset are independent of each other.
This means that the value of the dependent variable for one observation does
not depend on the value of the dependent variable for another observation.
3.No multicollinearity: There is no high correlation between the independent
variables. This indicates that there is little or no correlation between the
independent variables.
4.Normality: The errors in the model are normally distributed.
5.Homoscedasticity: Across all levels of the independent variable(s), the variance
of the errors is constant. This indicates that the amount of the independent
variable(s) has no impact on the variance of the errors.
Logistic Regression
• Logistic regression predicts the output of a categorical dependent
variable.
• Therefore the outcome must be a categorical or discrete value.
• The value of the logistic regression must be between 0 and 1, which
cannot go beyond this limit, so it forms a curve like the "S" form.
• The S-form curve is called the Sigmoid function or the logistic
function used to map the predicted values to probabilities.
• Assumptions for Logistic Regression:
✓The dependent variable must be categorical in nature.
✓The independent variable should not have multi-collinearity.
Overfitting and Underfitting
• Overfitting is a phenomenon that occurs when a Machine Learning
model is constrained to the training set and not able to perform well
on unseen data. That is when our model learns the noise in the
training data as well. This is the case when our model memorizes the
training data instead of learning the patterns in it.
• Underfitting on the other hand is the case when our model is not able
to learn even the basic patterns available in the dataset. In the case of
the underfitting model is unable to perform well even on the training
data hence we cannot expect it to perform well on the validation
data. This is the case when we are supposed to increase the
complexity of the model or add more features to the feature set.
Overfitting and Underfitting
Regularization
• Sometimes a model is not able to predict the output when deals with
unseen data by introducing noise in the output, and hence the model
is called overfitted.
• This problem can be deal with the help of a regularization technique.
• It is a technique to prevent the model from overfitting by adding extra
information to it.
• This technique can be used in such a way that it will allow to maintain
all variables or features in the model by reducing the magnitude of
the variables, thereby maintaining accuracy as well as a generalization
of the model.
• In regularization technique, we reduce the magnitude of the features
by keeping the same number of features."
Lasso Regression
• It stands for Least Absolute Shrinkage and Selection Operator.
• It is also referred as regression model which uses the L1 Regularization technique.
• Lasso Regression adds the “absolute value of magnitude” of the coefficient as a
penalty term to the loss function(L).

Where,
• m – Number of Features
• n – Number of Examples
• y_i – Actual Target Value
• y_i(hat) – Predicted Target Value
• Lasso regression also helps us achieve feature selection by penalizing the weights
to approximately equal to zero if that feature does not serve any purpose in the
model.
Ridge Regression
• Ridge regression is one of the types of linear regression in which a
small amount of bias is introduced so that we can get better long-
term predictions.
• Ridge regression is a regularization technique, which is used to reduce
the complexity of the model. It is also called as L2 regularization.
• In this technique, the cost function is altered by adding the penalty
term to it.
• The amount of bias added to the model is called Ridge Regression
penalty. We can calculate it by multiplying with the lambda to the
squared weight of each individual feature.
Ordinary Least Squares
• The ordinary least squares (OLS) algorithm is a method for estimating
the parameters of a linear regression model.
• The OLS algorithm aims to find the values of the linear regression
model’s parameters (i.e., the coefficients) that minimize the sum of
the squared residuals.
• A linear regression model establishes the relation between a
dependent variable(y) and at least one independent variable(x) as:

• In OLS method, we have to choose the values of b_1 and b_0 such
that, the total sum of squares of the difference between the
calculated and observed values of y, is minimised.

To get the values of b_0 and b_1 which minimise S, we can take a
partial derivative for each coefficient and equate it to zero.
Normalization
• Normalization is a scaling technique in Machine Learning applied
during data preparation to change the values of numeric columns in
the dataset to use a common scale.
• It is not necessary for all datasets in a model.
• It is required only when features of machine learning models have
different ranges.
• Data normalization consists of remodeling numeric columns to a
standard scale.
• Data normalization is generally considered the development of clean
data.
Normalization techniques in Machine Learning
• Min-Max normalization: In this technique of data normalization, a
linear transformation is performed on the original data. The minimum
and maximum value from data are fetched and each value is replaced
according to the following formula.
• Normalization by decimal scaling: It normalizes by moving the decimal
point of values of the data. To normalize the data by this technique,
we divide each value of the data by the maximum absolute value of
the data. The data value, vi, of data, is normalized
• Z-score normalization or Zero mean normalization: In this technique,
values are normalized based on mean and standard deviation of the
data A.
Difference between Normalization and
Standardization
Normalization Standardization
This technique uses minimum and max values for This technique uses mean and standard deviation for
scaling of model. scaling of model.

It is helpful when features are of different scales. It is helpful when the mean of a variable is set to 0 and
the standard deviation is set to 1.

Scales values ranges between [0, 1] or [-1, 1]. Scale values are not restricted to a specific range.

It got affected by outliers. It is comparatively less affected by outliers.


Scikit-Learn provides a transformer called Scikit-Learn provides a transformer called
MinMaxScaler for Normalization. StandardScaler for Normalization.

It is also called Scaling normalization. It is known as Z-score normalization.


It is useful when feature distribution is unknown. It is useful when feature distribution is normal.
Advantages of Data Normalization

• We can have more clustered indexes.


• Index searching is often faster.
• Data modification commands are faster.
• Fewer null values and less redundant data, making your data more
compact.
• Data modification anomalies are reduced.
• Normalization is conceptually cleaner and easier to maintain and
change as your needs change.
• Searching, sorting, and creating indexes is faster, since tables are
narrower, and more rows fit on a data page.
Disadvantages of Normalization
• When information is dispersed over many tables, it becomes necessary to
link them together, extending the work.
• Tables will include codes rather than actual data since rewritten data will
be saved as lines of numbers rather than actual data. As a result, the query
table must constantly be consulted.
• The show’s pace gradually slows down compared to the typical structural
type.
• To successfully finish the standardization cycle, it is vital to have a thorough
understanding of the many conventional structures.
• A bad plan with substantial irregularities and data inconsistencies can
result from careless use.
Mean normalization
• Mean normalization, also known as zero-centering or subtracting the mean, is a data
preprocessing technique used in various fields, including statistics, machine learning, and
signal processing.
• Its primary purpose is to center the data by subtracting the mean (average) value of a
dataset from each data point.
• The goal is to make the data have a mean of zero.
• Here's how mean normalization works:
✓Calculate the mean (average) value of the dataset you want to normalize. This mean
value is computed across all the data points.
✓Subtract this mean value from each data point in the dataset.
• Mathematically, if you have a dataset with n data points {x1, x2, ..., xn}, the mean
normalization process would be:
xi_norm = xi - mean, for i = 1 to n
• Mean normalization is often used alongside other preprocessing techniques, such as
standardization or min-max scaling depending on the requirements of the specific
analysis or machine learning model
Gradient Descent
• Gradient Descent is defined as one of the most commonly used iterative
optimization algorithms of machine learning to train the machine learning
and deep learning models.
• It helps in finding the local minimum of a function.
• Gradient Descent, which is used to optimize the weight and biases based
on the cost function.
• Cost function evaluates the difference between the actual and predicted
outputs.
• If we move towards a negative gradient or away from the gradient of the
function at the current point, it will give the local minimum of that
function.
• Whenever we move towards a positive gradient or towards the gradient of
the function at the current point, we will get the local maximum of that
function.
Gradient Descent
• The main objective of using a gradient
descent algorithm is to minimize the cost
function using iteration.
• Calculate the first-order derivative of the
function to compute the gradient or slope
of that function.
• Move away from the direction of the
gradient, which means slope increased
from the current point by alpha times,
where Alpha is defined as Learning Rate.
Note: It is a tuning parameter in the
optimization process which helps to decide
the length of the steps.
Gradient Descent: Working
• The equation for simple linear regression is given
as:Y=mX+c.
• The starting point is used to evaluate the performance as
it is considered just as an arbitrary point.
• At this starting point, we will derive the first derivative or
slope and then use a tangent line to calculate the
steepness of this slope. Further, this slope will inform the
updates to the parameters (weights and bias).
• The slope becomes steeper at the starting point or
arbitrary point, but whenever new parameters are
generated, then steepness gradually reduces, and at the
lowest point, it approaches the lowest point, which is
called a point of convergence.
• The main objective of gradient descent is to minimize the
cost function or the error between expected and actual.
Learning Rate
• It is defined as the step size taken to
reach the minimum or lowest point.
• This is typically a small value that is
evaluated and updated based on the
behavior of the cost function.
• If the learning rate is high, it results in
larger steps but also leads to risks of
overshooting the minimum.
• At the same time, a low learning rate
shows the small step sizes, which
compromises overall efficiency but
gives the advantage of more precision.
Gradient Descent: Working
• Let us suppose, we can write our loss function for
the single row as
• In the above function x and y are our input data i.e
constant.
• To find the optimal value of weight w and bias
b(also said to be the gradient of loss function
J(w,b)), we partially differentiate with respect to w
and b. Gradient of J(w,b) with respect to w is:
Implementations of the Gradient Descent algorithm
• # Number of epochs
num_epochs = 1000 • # Matually Update the model parameter
w = w - learning_rate * w.grad
• # Learning Rate
b = b - learning_rate * b.grad
learning_rate = 0.01 • # assign the weight & bias parameter to the linear layer
• # SUBPLOT WEIGHT & BIAS VS lOSSES model.linear.weight = nn.Parameter(w)
fig, (ax1, ax2) = plt.subplots(1, 2, sharey=True) model.linear.bias = nn.Parameter(b)
for epoch in range(num_epochs): if (epoch+1) % 100 == 0:
• # Forward pass ax1.plot(w.detach().numpy(),loss.item(),'r*-’)
ax2.plot(b.detach().numpy(),loss.item(),'g+-')
y_p = model(x)
print('Epoch [{}/{}], weight:{}, bias:{} Loss: {:.4f}'.format(
loss = Mean_Squared_Error(y_p, y) epoch+1,num_epochs,w.detach().numpy(),b.detach().numpy(),
• # Backpropagation loss.item()))
• # Find the gradient using • ax1.set_xlabel('weight')
loss.backward() • ax2.set_xlabel('bias')
• # Learning Rate • ax1.set_ylabel('Loss')
• ax2.set_ylabel('Loss')
learning_rate = 0.001
• plt.show()
• # Model Parameter
w = model.linear.weight
b = model.linear.bias
Automatic convergence testing
• Automatic convergence testing in machine learning is a technique used to
determine when a machine learning algorithm has reached a point where
it can stop iterating and consider its training process complete.
• Convergence testing is crucial in iterative optimization algorithms
commonly used in machine learning, such as gradient descent, stochastic
gradient descent, and various optimization techniques in deep learning.
• The goal is to avoid unnecessary computation while ensuring that the
algorithm has sufficiently learned from the training data
• Specific convergence testing method and criteria can vary widely
depending on the machine learning algorithm and the problem being
solved.
• The use of convergence testing often interacts with other techniques such
as learning rate schedules, batch size, and regularization methods to
achieve optimal training results.
Automatic convergence testing
• Define a Convergence Criterion: You need to establish a criterion that defines when your
algorithm has converged. This criterion can be based on various factors, depending on
the nature of the algorithm and the problem you're solving. Common convergence
criteria include:
I. Loss Function Threshold: Monitor the value of a loss function (e.g., mean squared
error in regression or cross-entropy in classification) during training. If the loss
decreases below a certain threshold or stops decreasing, you may consider the
algorithm to have converged.
II. Gradient Norm: Examine the norm (magnitude) of the gradient of the loss function
with respect to the model parameters. When the gradient becomes sufficiently small,
it indicates convergence.
III. Change in Parameters: Track the change in model parameters (weights and biases)
between iterations. If the changes fall below a certain threshold, it may suggest
convergence.
IV. Validation Set Performance: Monitor the performance of the model on a validation
dataset. If the performance plateaus or deteriorates, it can be an indicator of
convergence.
Automatic convergence testing
• Iterate and Check Convergence: During the training process, the algorithm
iteratively updates its parameters while evaluating the convergence
criterion at each iteration. If the criterion is met, the algorithm stops
training.
• Early Stopping: To prevent overfitting, machine learning practitioners often
use a technique called early stopping. This involves monitoring the
validation set performance and stopping training if it starts to degrade,
even if the primary convergence criterion hasn't been met. Early stopping
helps avoid training for too long and producing an overfit model.
• Hyperparameter Tuning: The convergence criterion may also be
considered as a hyperparameter that you can tune. Different problems or
algorithms may require different criteria for convergence.
Data Redundancy
• Data redundancy refers to the duplication of data in a computer system.
• This duplication can occur at various levels, such as at the hardware or software
level, and can be intentional or unintentional.
• The main purpose of data redundancy is to provide a backup copy of data in case
the primary copy is lost or becomes corrupted.
• This can help to ensure the availability and integrity of the data in the event of a
failure or other problem.
• An attribute is known as redundant if it can be derived from any set of attributes.
• Let us consider we have a set of data where there are 20 attributes.
• Now suppose that out of 20, an attribute can be derived from some of the other
set of attributes.
• Such attributes that can be derived from other sets of attributes are called
Redundant attributes.
• Inconsistencies in attribute or dimension naming may lead to redundancies in the
set of data.
Advantages of data redundancy
• Increased data availability and reliability, as there are multiple copies
of the data that can be used in case the primary copy is lost or
becomes unavailable.
• Improved data integrity, as multiple copies of the data can be
compared to detect and correct errors.
• Increased fault tolerance, as the system can continue to function even
if one copy of the data is lost or corrupted.
Disadvantages of data redundancy
• Increased storage requirements, as multiple copies of the data must
be maintained.
• Increased complexity of the system, as managing multiple copies of
the data can be difficult and time-consuming.
• Increased risk of data inconsistencies, as multiple copies of the data
may become out of sync if updates are not properly propagated to all
copies
• Reduced performance, as the system may have to perform additional
work to maintain and access multiple copies of the data.

You might also like