Machine Learning Theory

1.
Describe the concept of learning in an ML system in relation to Task (T), Performance (P)
and Experience (E)?
Answer:
A computer program is said to learn from experience E with respect to some class of tasks T and
performance measure P, if its performance at tasks in T, as measured by P, improves with
experience E.
For example, a computer program that learns to play chess might improve its performance as
measured by its ability to win at the class of tasks involving playing chess, through experience
obtained by playing chess against itself. In general, to have a well-defined learning problem, we
must identify the class of tasks, the measure of performance to be improved, and the source of
experience. Consider that a chess-learning problem consists of the following: task, performance
measure, and training experience, where:
Task T is playing chess
Performance measure P is the percentage of games won against opponents
Training experience E is the program playing practice chess games against itself
2. With the aid of examples and diagrams explain the difference is between Supervised and
Unsupervised Machine Learning
Answer:
1. In Supervised learning, you train the machine using data which is well “labeled.”
2. Unsupervised learning is a machine learning technique, where you do not need to
supervise the model.
3. Supervised learning allows you to collect data or produce a data output from the
previous experience.
4. Unsupervised machine learning helps you to finds all kind of unknown patterns in data.
5. Regression and Classification are two types of supervised machine learning techniques.
6. Clustering and Association are two types of Unsupervised learning.
7. In a supervised learning model, input and output variables will be given while with
unsupervised learning model, only input data will be given
8. The most commonly used Supervised Learning algorithms are decision tree, logistic
regression, linear regression, support vector machine.
9. The most commonly used Unsupervised Learning algorithms are k-means clustering,
hierarchical clustering, and apriori algorithm.
10. Some common examples of supervised learning include spam filters, fraud detection
systems, recommendation engines, and image recognition systems.
11. Some of the common examples of unsupervised learning include Anomaly detection (for
example, to detect bot activity), Pattern recognition (grouping images, transcribing
audio), Inventory management (by conversion activity or by availability).
3. Explain the concept of Reinforcement Learning and describe where it is applicable.
Answer:
> Reinforcement learning is a feedback-based machine learning approach here an agent learns to
which actions to perform by looking at the environment and the results of actions
> For each correct action, the agent gets positive feedback, and for each incorrect action, the agent
gets negative feedback or penalty.
> The agent interacts with the environment and identifies the possible actions he can perform
> The primary goal of an agent in reinforcement learning is to perform actions by looking at the
environment and get the maximum positive rewards.
> In reinforecement learning, the agent learns automatically using feedbacks without any labelled
data, unlike supervised learning.
> Since there is no labelled data, so the agent is bound to learn by its experience only.
> Reinforecement learning is used to solve specific type of problem where decision making is
sequential, and the goal is long-term, such as game-playing, robotics, etc.
There are two types of reinforcement learning positive and negative:
Positive reinforcement learning is a recurrence of behaviour due to positive rewards
This encourages to execute similar actions that yield maximum reward.
Similarly, in negative reinforcrment learning, negative rewards are used as detterant to weaken the
behaviour and to avoid it.
Rewards decreases strength and the frequency of a specific behaviour.
In a maze game, there may a danger spot that may lead to loss.
Negative rewards are simulated in reinforcement learning, say +10 for positive reward and -10 for
some danger or negative reward.
Reinforcement learning is an example of semi-supervised learning technique and is used to model

sequential decision-making process.
The applications of the reinforcement learning are:
1. Natural Language processing

2. Robotics
3. Transportation(optimal traffic control)
4. Gaming and Recommendation system
4.“Moments are a quantitative measure of the shape of a Probability Density Function”. With the aid
of diagrams, describe the four moments.
Pdf : PDF stands for Probability Density Function. It is a mathematical function that describes the
probability distribution of a continuous random variable.
1. Why Probability Density and why not Probability? 2. What does the area of this graph
represents? 3. How to calculate Probability then? a.
https://en.wikipedia.org/wiki/Normal_distribution b. https://en.wikipedia.org/wiki/Log-
normal_distribution c. https://en.wikipedia.org/wiki/Poisson_distribution 4. Examples of PDF
PMF: PMF stands for Probability Mass Function. It is a mathematical function that describes the
probability distribution of a discrete random variable. The PMF of a discrete random variable assigns
a probability to each possible value of the random variable. The probabilities assigned by the PMF
must satisfy two conditions: The probability assigned to each value must be non-negative (i.e.,
greater than or equal to zero). a. b. The sum of the probabilities assigned to all possible values must
equal 1.
The First Moment
The first moment in machine learning refers to the mean of a distribution. The mean is a measure of
central tendency, and it is often represented as the first moment. In mathematical terms, the first
moment(μ) of a random variable
X is givenby μ=E[X]
where E[X] denotes the expected value of X. In the case of a sample mean, it is calculated as the sum
of all data points divided by the number of data points.
Case-1: When all outcomes have the same probability of occurrence

It is defined as the sum of all the values the variable can take times the probability of that
value occurring. Intuitively, we can understand this as the arithmetic mean.
Case-2: When all outcomes don’t have the same probability of occurrenceThis is the more general
equation that includes the probability of each outcome and is defined as the summation of all the
variables multiplied by the corresponding probability.
Conclusion
For equally probable events, the expected value is exactly the same as the Arithmetic Mean. This is
one of the most popular measures of central tendency, which we also called Averages.
The Second Moment
The second central moment is “Variance”.
It measures the spread of values in the distribution OR how far from the normal.
Variance represents how a set of data points are spread out around their mean value.
The Third Moment
The third statistical moment is “Skewness”.
It measures how asymmetric the distribution is about its mean.
We can differentiate three types of distribution with respect to its skewness:
Symmetrical Distribution
If both tails of a distribution are symmetrical, and the skewness is equal to zero, then that
distribution is symmetrical.
Positively Skewed
In these types of distributions, the right tail (with larger values) is longer. So, this also tells us about
‘outliers’ that have values higher than the mean. Sometimes, this is also referred to as:
Right-skewed
Right-tailed
Skewed to the Right
Negatively Skewed
In these types of distributions, the left tail (with small values) is longer. So, this also tells us about
‘outliers’ that have values lower than the mean. Sometimes, this is also referred to as:
Left-skewed
Left-tailed
Skewed to the Right
The Fourth Moment
The fourth statistical moment is “kurtosis”.
It measures the amount in the tails and outliers.
It focuses on the tails of the distribution and explains whether the distribution is flat or rather with a
high peak. This measure informs us whether our distribution is richer in extreme values than the
normal distribution.
5. Describe how you would use Naïve Bayes to create a simple spam classifier.
https://medium.com/analytics-vidhya/how-to-build-spam-classifier-with-naive-bayes-
beginner-guide-6c40d2c0a559
Step 0: Introduction to the Naive Bayes Theorem
Step 1.1: Understanding our dataset
Step 1.2: Data Preprocessing
Step 2.1: Bag of Words (BoW)
Step 2.2: Implementing BoW from scratch (available only on the Jupyter Notebook, not in
this article)
Step 2.3: Implementing Bag of Words in scikit-learn
Step 3.1: Training and testing sets
Step 3.2: Applying Bag of Words processing to our dataset.
Step 4.1: Bayes Theorem implementation from scratch
Step 4.2: Naive Bayes implementation from scratch
Step 5: Naive Bayes implementation using scikit-learn
Step 6: Evaluating our model
Step 7: Conclusion
6. What considerations must be kept in mind when selecting a sample dataset?
Best Practices for Selecting a Dataset
I’d like to start with best practices for selecting datasets you may find online.
Later on, I will also dig into the best practices for creating one on your own but first have a
look at the six key steps to identify and keep in mind during your data selection process:
Understand the problem:

It’s essential to understand the problem you’re trying to solve. This includes identifying the
input and output variables, the type of problem (classification, regression, clustering, etc.),
and the performance metrics.
Define the scope:
This step will help you narrow down the datasets that are relevant to your problem. It
includes specifying the industry or domain you’re working in, the type of data you need (text,
image, audio, etc.), and any constraints or limitations on the dataset.
Look for quality:
Quality is key when selecting a dataset. Look for datasets that are reliable, accurate, and
relevant to your problem. Check for missing data, outliers, and inconsistencies that can
negatively impact the performance of your model.
Consider the size:
The size of the dataset is crucial, as it affects the accuracy and generalization of your model.
A larger dataset usually leads to a more accurate and robust model but also requires more
computational resources and longer training times.
Check for biases:
Biases in the dataset can lead to unfair or inaccurate predictions. Watch out for biases
related to the data collection process, such as sampling biases, and those related to societal
issues, such as gender, race, or socioeconomic status. Check out my other blog post, where I
covered 6 common biases in AI.
Seek diversity:
A diverse dataset can help your model learn from a wide range of examples and avoid
overfitting specific patterns in the data. Consider selecting datasets with a variety of samples
from different sources, populations, or locations.
These best practices are a good first step in identifying and selecting a proper dataset for
your specific AI problem.
Next, I’d like to dig into typical pitfalls to avoid when selecting a predefined dataset.
Pitfalls to Avoid When Selecting a Dataset

Even though most of the following hints refer to predefined datasets, most of them can, of
course, be applied to self-created ones as well.
Insufficient Data
The first and foremost issue might be insufficient data.
Generally, it can lead to poor model performance, as the model may not be able to capture
the underlying patterns in the data.
If you don’t have enough data, consider using techniques such as data augmentation or
transfer learning to enhance your dataset or model capabilities.
Also, another option may be combining multiple datasets into one, if the labels allow for it.
Imbalanced Classes
Generally speaking, this problem describes the circumstance when one class has significantly
more samples than the other, leading to biased predictions or other unwanted model
behavior.
7. What does r-squared measure? What is the formula used to calculate r-squared and is an r-
squared value of 0.08 good or bad? Explain.
Residual for a point in the data is the difference between the actual value and the value
predicted by our linear regression model.
R=1− Sum of Squared Residuals/Total Sum of Squares
Total Sum of Squares:
Total variation in target variable is the sum of squares of the difference between the actual
values and their mean.
R-Squared, also known as the coefficient of determination, is a regression evaluation metric
that tells you how well a regression model approximates the actual data when compared to
the mean of the actual data. How is it calculated? It is one minus the sum of the squared
residuals, residuals being the difference between the actual target and the model's
prediction divided by the sum of the squared differences between the actual target and the
target mean. How do you interpret it? R-Squared values are between the range of zero and
one but often stated in percentages. In general, the higher the R-Squared, the better the
model performance is when compared to any baseline model that is simply predicting the
mean of the target. A lower R-Squared value indicates that your independent variables are
not able to explain the variation in your dependent variable. Another metric similar to R-
Squared is adjusted R-Squared, which is often used to see whether the newly added features
to your dataset are relevant or not. To know more stay tuned!
Interpretation: An R-squared value of 0.08 indicates that only 8% of the variance in
the dependent variable is explained by the model. While the interpretation of "good"
or "bad" depends on the context, an R-squared of 0.08 might be considered relatively
low and suggests that the model may not be a strong predictor for the given data.
8. How is Covariance related to Correlation? What does a small Covariance (close to 0) and a
large Covariance (far from 0) mean?
Covariance: It measures how two variables change together. A positive covariance indicates that
as one variable increases, the other tends to increase, and vice versa for negative covariance.
However, the scale of covariance is not standardized, making it challenging to interpret.
Correlation: It is a standardized measure of the strength and direction of the linear relationship
between two variables. Correlation takes values between -1 and 1, where -1 indicates a perfect
negative linear relationship, 1 indicates a perfect positive linear relationship, and 0 indicates no
linear relationship.
Interpretation of Covariance:
Small Covariance (close to 0): Indicates weak or no linear relationship between variables.
However, it doesn't provide information about the strength or direction of the relationship.
Large Covariance (far from 0): Suggests a strong linear relationship between variables, but
without normalization, it's challenging to compare the strength of relationships between
different pairs of variables.
Key Points:
Covariance's scale is not standardized, making it difficult to compare across different pairs of
variables.
Correlation standardizes covariance, providing a more interpretable measure of the strength and
direction of the linear relationship.
A covariance close to 0 suggests weak or no linear relationship, while a large positive or negative
covariance indicates a strong linear relationship, but the scale makes it hard to compare with
other pairs of variables.
R(corelation) = C o v ( x , y ) / ( σ x ∗ σ y )
9. Decision Tree:
Decision Tree in Machine learning is used both in classification and regression tasks and mostky
used in the classification tasks
Refer note book
10.Random Forest:
A Random Forest is an ensemble learning method that combines the predictions of multiple
individual decision trees to improve overall performance and generalization. It is used for both
classification and regression tasks. In the context of "Bagging" (Bootstrap Aggregating), here's an
explanation:
Bagging Overview:
Bagging involves training multiple instances of the same learning algorithm on different random
subsets of the training data. Each model is trained independently, and their predictions are
combined, often through averaging (for regression) or voting (for classification), to make the final
prediction.
Random Forest as a Bagging Algorithm:
A Random Forest is a specific implementation of bagging using decision trees as the base
learners. It introduces randomness during both the training of individual trees and the prediction
process.
Key Features of Random Forest:
Random Subsets (Bootstrap Samples): Each tree in the Random Forest is trained on a random
subset of the training data, sampled with replacement (bootstrap sampling). This introduces
diversity among the trees.
Random Feature Selection: During the construction of each tree, a random subset of features is
considered at each split. This helps to decorrelate the trees and makes the Random Forest robust
to overfitting.
Voting or Averaging: For classification tasks, the Random Forest combines the predictions of
individual trees through majority voting, and for regression, it uses averaging.
Advantages of Random Forest:
Reduced Overfitting: The ensemble nature of Random Forests, along with the randomness
introduced, helps reduce overfitting, making them more robust on unseen data.
High Accuracy: Random Forests often provide high accuracy due to the combination of multiple
trees, capturing a more comprehensive representation of the underlying patterns in the data.
Feature Importance: Random Forests can provide insights into feature importance, helping
identify which features contribute the most to the predictive performance.
Versatility: Suitable for various types of data and tasks, and generally requires minimal
hyperparameter tuning.
In summary, a Random Forest is a bagging algorithm that leverages the power of multiple
decision trees trained on random subsets of data and features to enhance predictive
performance, reduce overfitting, and provide a more robust model for classification and
regression tasks.
11.Kmeans clustering and knn algorithm refer note book with graphs
12. Q-learning
Q-learning models operate in an iterative process that involves multiple components working
together to help train a model. The iterative process involves the agent learning by exploring the
environment and updating the model as the exploration continues. The multiple components of Q-
learning include the following:
Agents. The agent is the entity that acts and operates within an environment.
States. The state is a variable that identifies the current position in an environment of an agent.
Actions. The action is the agent's operation when it is in a specific state.
Rewards. A foundational concept within reinforcement learning is the concept of providing either a
positive or a negative response for the agent's actions.
Episodes. An episode is when an agent can no longer take a new action and ends up terminating.
Q-values. The Q-value is the metric used to measure an action at a particular state.
Here are the two methods to determine the Q-value:
Temporal difference. The temporal difference formula calculates the Q-value by incorporating the
value of the current state and action by comparing the differences with the previous state and
action.
Bellman's equation. Mathematician Richard Bellman invented this equation in 1957 as a recursive
formula for optimal decision-making. In the q-learning context, Bellman's equation is used to help
calculate the value of a given state and assess its relative position. The state with the highest value is
considered the optimal state.
Q-learning models work through trial-and-error experiences to learn the optimal behavior for a task.
The Q-learning process involves modeling optimal behavior by learning an optimal action value
function or q-function. This function represents the optimal long-term value of action a in state s and
subsequently follows optimal behavior in every subsequent state.
Bellman's equation
Q(s,a) = Q(s,a) + α * (r + γ * max(Q(s',a')) - Q(s,a))
The equation breaks down as follows:
Q(s, a) represents the expected reward for taking action a in state s.
The actual reward received for that action is referenced by r while s' refers to the next state.
The learning rate is α and γ is the discount factor.
The highest expected reward for all possible actions a' in state s' is represented by max(Q(s', a')).
What is a Q-table?
The Q-table includes columns and rows with lists of rewards for the best actions of each state in a
specific environment. A Q-table helps an agent understand what actions are likely to lead to positive
outcomes in different situations.
Refer to notebook reg q-learning

13.Principal Component Analysis(PCA) :
>It’s a dimension reduction(variable reduction) technique
>Create new variables by taking linear combinations of our existing variables
>New variables will do a better job of explaining the variation in our data(if we take a certain subset
of them)
Principal Component Analysis (PCA) is a powerful technique used in data analysis, particularly for
reducing the dimensionality of datasets while preserving crucial information. It does this by
transforming the original variables into a set of new, uncorrelated variables called principal
components. Here’s a breakdown of PCA’s key aspects:
Dimensionality Reduction: PCA helps manage high-dimensional datasets by extracting essential

information and discarding less relevant features, simplifying analysis.
Data Exploration and Visualization: It plays a significant role in data exploration and visualization,
aiding in uncovering hidden patterns and insights.
Linear Transformation: PCA performs a linear transformation of data, seeking directions of maximum
variance.
Feature Selection: Principal components are ranked by the variance they explain, allowing for
effective feature selection.
Data Compression: PCA can compress data while preserving most of the original information.
Clustering and Classification: It finds applications in clustering and classification tasks by reducing
noise and highlighting underlying structure.
Advantages: PCA offers linearity, computational efficiency, and scalability for large datasets.
Limitations: It assumes data normality and linearity and may lead to information loss.
Matrix Requirements: PCA works with symmetric correlation or covariance matrices and requires
numeric, standardized data.
Eigenvalues and Eigenvectors: Eigenvalues represent variance magnitude, and eigenvectors indicate
variance direction.
Number of Components: The number of principal components chosen determines the number of
eigenvectors computed.
Note: We must use only numeric variables only.
Relation to the Curse of Dimensionality:
Curse of Dimensionality: The curse of dimensionality refers to challenges and limitations that arise
when dealing with high-dimensional data. As the number of features (dimensions) increases, the
amount of data needed to effectively cover the feature space grows exponentially, leading to issues
like increased computational complexity, sparse data, and overfitting.
PCA Mitigation of Curse of Dimensionality:
PCA addresses the curse of dimensionality by identifying and retaining the most important features
(principal components) while discarding less informative ones. This reduces the dimensionality of the
data, making subsequent analysis more efficient and less prone to overfitting.

Machine Learning Theory

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning Theory

Uploaded by

Copyright:

Available Formats

1.

Task T is playing chess

Performance measure P is the percentage of games won against opponents

There are two types of reinforcement learning positive and negative:

Positive reinforcement learning is a recurrence of behaviour due to positive rewards

This encourages to execute similar actions that yield maximum reward.

Rewards decreases strength and the frequency of a specific behaviour.

Reinforcement learning is an example of semi-supervised learning technique and is used to model

The applications of the reinforcement learning are:

1. Natural Language processing

The First Moment

Case-1: When all outcomes have the same probability of occurrence

value occurring. Intuitively, we can understand this as the arithmetic mean.

The Second Moment

The second central moment is “Variance”.

The Third Moment

The third statistical moment is “Skewness”.

It measures how asymmetric the distribution is about its mean.

We can differentiate three types of distribution with respect to its skewness:

Skewed to the Right

The Fourth Moment

The fourth statistical moment is “kurtosis”.

It measures the amount in the tails and outliers.

Understand the problem:

Pitfalls to Avoid When Selecting a Dataset

Refer note book

Random Forest as a Bagging Algorithm:

Key Features of Random Forest:

Advantages of Random Forest:

Actions. The action is the agent's operation when it is in a specific state.

Here are the two methods to determine the Q-value:

Q(s,a) = Q(s,a) + α * (r + γ * max(Q(s',a')) - Q(s,a))

The equation breaks down as follows:

Q(s, a) represents the expected reward for taking action a in state s.

The learning rate is α and γ is the discount factor.

Refer to notebook reg q-learning

>It’s a dimension reduction(variable reduction) technique

>Create new variables by taking linear combinations of our existing variables

Dimensionality Reduction: PCA helps manage high-dimensional datasets by extracting essential

Note: We must use only numeric variables only.

Relation to the Curse of Dimensionality:

You might also like