You are on page 1of 13

1

According to Arthur Samuel “Machine Learning enables a Machine to automatically


learn from Data, Improve performance from an Experience and predict things without
explicitly programmed.”
In Simple Words, When we fed the Training Data to Machine Learning Algorithm, this
algorithm will produce a mathematical model and with the help of the mathematical
model, the machine will make a prediction and take a decision without being explicitly
programmed. Also, during training data, the more machine will work with it the more it
will get experience and the more it will get experience the more efficient result is
produced. 

Example :  In Driverless Car, the training data is fed to Algorithm like how to Drive Car
in Highway, Busy and Narrow Street with factors like speed limit, parking, stop at
signal etc. After that, a Logical and Mathematical model is created on the basis of that
and after that, the car will work according to the logical model. Also, the more data the
data is fed the more efficient output is produced.
2

Designing a Learning System in Machine Learning:

According to Tom Mitchell, “A computer program is said to be learning from experience


(E), with respect to some task (T). Thus, the performance measure (P) is the
performance at task T, which is measured by P, and it improves with experience E.”
Example: In Spam E-Mail detection,
 Task, T: To classify mails into Spam or Not Spam.
 Performance measure, P: Total percent of mails being correctly classified as being
“Spam” or “Not Spam”.
 Experience, E: Set of Mails with label “Spam”
Steps for Designing Learning System are:
3

Step 1) Choosing the Training Experience: The very important and first task is to
choose the training data or training experience which will be fed to the Machine
Learning Algorithm. It is important to note that the data or experience that we fed to the
algorithm must have a significant impact on the Success or Failure of the Model. So
Training data or experience should be chosen wisely.
Below are the attributes which will impact on Success and Failure of Data:
 The training experience will be able to provide direct or indirect feedback regarding
choices. For example: While Playing chess the training data will provide feedback to
itself like instead of this move if this is chosen the chances of success increases.
 Second important attribute is the degree to which the learner will control the
sequences of training examples. For example: when training data is fed to the
machine then at that time accuracy is very less but when it gains experience while
playing again and again with itself or opponent the machine algorithm will get
feedback and control the chess game accordingly.
 Third important attribute is how it will represent the distribution of examples over
which performance will be measured. For example, a Machine learning algorithm
will get experience while going through a number of different cases and different
examples. Thus, Machine Learning Algorithm will get more and more experience by
passing through more and more examples and hence its performance will increase.
Step 2- Choosing target function: The next important step is choosing the target
function. It means according to the knowledge fed to the algorithm the machine
learning will choose NextMove function which will describe what type of legal moves
should be taken.  For example : While playing chess with the opponent, when
opponent will play then the machine learning algorithm will decide what be the number
of possible legal moves taken in order to get success.
Step 3- Choosing Representation for Target function: When the machine algorithm
will know all the possible legal moves the next step is to choose the optimized move
4

using any representation i.e. using linear Equations, Hierarchical Graph


Representation, Tabular form etc. The NextMove function will move the Target move
like out of these move which will provide more success rate. For Example : while
playing chess machine have 4 possible moves, so the machine will choose that
optimized move which will provide success to it.
Step 4- Choosing Function Approximation Algorithm: An optimized move cannot
be chosen just with the training data. The training data had to go through with set of
example and through these examples the training data will approximates which steps
are chosen and after that machine will provide feedback on it. For Example : When a
training data of Playing chess is fed  to algorithm so at that time it is not machine
algorithm will fail or get success and again from that failure or success it will measure
while next move what step should be chosen and what is its success rate.
Step 5- Final Design: The final design is created at last when system goes from
number of examples  , failures and success , correct and incorrect decision and what
will be the next step etc. Example: DeepBlue is an intelligent  computer which is ML-
based won chess game against the chess expert Garry Kasparov, and it became the
first computer which had beaten a human chess expert.

Decision Tree
 Difficulty Level : Easy
 Last Updated : 22 Jun, 2021

Decision Tree: Decision tree is the most powerful and popular tool for classification and
prediction. A Decision tree is a flowchart like tree structure, where each internal node denotes a
test on an attribute, each branch represents an outcome of the test, and each leaf node (terminal
node) holds a class label. 
 
Attention reader! Don’t stop learning now. Get hold of all the important DSA concepts with
the DSA Self Paced Course at a student-friendly price and become industry ready.  To complete
your preparation from learning a language to DS Algo and many more,  please refer Complete
Interview Preparation Course.
In case you wish to attend live classes with experts, please refer DSA Live Classes for Working
Professionals and Competitive Programming Live for Students.
5

A decision tree for the concept PlayTennis. 


Construction of Decision Tree: 
A tree can be “learned” by splitting the source set into subsets based on an attribute value test.
This process is repeated on each derived subset in a recursive manner called recursive
partitioning. The recursion is completed when the subset at a node all has the same value of the
target variable, or when splitting no longer adds value to the predictions. The construction of
decision tree classifier does not require any domain knowledge or parameter setting, and
therefore is appropriate for exploratory knowledge discovery. Decision trees can handle high
dimensional data. In general decision tree classifier has good accuracy. Decision tree induction is
a typical inductive approach to learn knowledge on classification. 
Decision Tree Representation: 
Decision trees classify instances by sorting them down the tree from the root to some leaf node,
which provides the classification of the instance. An instance is classified by starting at the root
node of the tree, testing the attribute specified by this node, then moving down the tree branch
corresponding to the value of the attribute as shown in the above figure. This process is then
repeated for the subtree rooted at the new node. 
The decision tree in above figure classifies a particular morning according to whether it is
suitable for playing tennis and returning the classification associated with the particular leaf.(in
this case Yes or No). 
6

For example, the instance 


 
(Outlook = Rain, Temperature = Hot, Humidity = High, Wind = Strong )

 
would be sorted down the leftmost branch of this decision tree and would therefore be classified
as a negative instance. 
In other words we can say that decision tree represent a disjunction of conjunctions of constraints
on the attribute values of instances. 
 

(Outlook = Sunny ^ Humidity = Normal) v (Outlook = Overcast) v (Outlook = Rain ^ Wind =


Weak) 
 
Strengths and Weakness of Decision Tree approach 
The strengths of decision tree methods are: 
 
 Decision trees are able to generate understandable rules.
 Decision trees perform classification without requiring much computation.
 Decision trees are able to handle both continuous and categorical variables.
 Decision trees provide a clear indication of which fields are most important for prediction or
classification.

Entropy

Definition: [E]ntropy provides an absolute limit on the shortest possible


average length of a lossless compression encoding of the data produced
by a source, and if the entropy of the source is less than the channel
capacity of the communication channel,the data generated by the source
can be reliably communicated to the receiver.

The definition is extremely difficult to understand, and it is not


necessarily pertinent to our discussions of decision
trees. Shannon(1948) used the concept of entropy for the theory of
communication, to determine how to send encoded (bits) information
from a sender to a receiver without loss of information and with the
minimum amount of bits.
7

Please take a look at Demystifying Entropy and The intuition behind


Shannon’s Entropy for an easy to understand explanation.

Bits

What are bits? We usually have TRUE or FALSE when using if


statements which takes on 1 bit of data. A Bit takes on a single binary
value 0(FALSE) or 1(TRUE). See the table below to understand how the
storage capabilities increase with each bit.

x: The number of bits, n: set of values

Photo by Markus Spiske on Unsplash

Lossless

This concept simply means that no information was loss in the


transmission from sender to receiver.

Formula
8

The formula above gives us the minimum average encoding size , which
uses the minimum encoding size for each message type.

High Entropy : More uncertainty

Low Entropy : More Predictability


Occam’s razor, also spelled Ockham’s razor, also called law of
economy or law of parsimony, principle stated by
the Scholastic philosopher William of Ockham (1285–1347/49)
that pluralitas non est ponenda sine necessitate, “plurality should not be
posited without necessity.” The principle gives precedence to simplicity: of
two competing theories, the simpler explanation of an entity is to be
preferred. The principle is also expressed as “Entities are not to be
multiplied beyond necessity.”

The principle was, in fact, invoked before Ockham by Durandus of Saint-


Pourçain, a French Dominican theologian and philosopher of dubious
orthodoxy, who used it to explain that abstraction is the apprehension of
some real entity, such as an Aristotelian cognitive species, an active
intellect, or a disposition, all of which he spurned as unnecessary. Likewise,
in science, Nicole d’Oresme, a 14th-century French physicist, invoked the
law of economy, as did Galileo later, in defending the simplest hypothesis of
the heavens. Other later scientists stated similar simplifying laws and
principles.

Ockham, however, mentioned the principle so frequently and employed it


so sharply that it was called “Occam’s razor” (also spelled Ockham’s razor).
He used it, for instance, to dispense with relations, which he held to be
nothing distinct from their foundation in things; with efficient causality,
which he tended to view merely as regular succession; with motion, which
is merely the reappearance of a thing in a different place; with
psychological powers distinct for each mode of sense; and with the presence
of ideas in the mind of the Creator, which are merely the creatures
themselves.
9

MODULE 2

In computational learning theory, probably approximately correct (PAC) learning is a framework for
mathematical analysis of machine learning. It was proposed in 1984 by Leslie Valiant.[1]

In this framework, the learner receives samples and must select a generalization function (called the
hypothesis) from a certain class of possible functions. The goal is that, with high probability (the
"probably" part), the selected function will have low generalization error (the "approximately
correct" part). The learner must be able to learn the concept given any arbitrary approximation ratio,
probability of success, or distribution of the samples.

The model was later extended to treat noise (misclassified samples).

An important innovation of the PAC framework is the introduction of computational complexity


theory concepts to machine learning. In particular, the learner is expected to find efficient functions
(time and space requirements bounded to a polynomial of the example size), and the learner itself
must implement an efficient procedure (requiring an example count bounded to a polynomial of the
concept size, modified by the approximation and likelihood bounds).

k-Fold Cross-Validation
Cross-validation is a resembling procedure used to evaluate machine learning models on a limited
data sample.
10

The procedure has a single parameter called k that refers to the number of groups that a given data
sample is to be split into. As such, the procedure is often called k-fold cross-validation. When a
specific value for k is chosen, it may be used in place of k in the reference to the model, such as k=10
becoming 10-fold cross-validation.

Cross-validation is primarily used in applied machine learning to estimate the skill of a machine
learning model on unseen data. That is, to use a limited sample in order to estimate how the model is
expected to perform in general when used to make predictions on data not used during the training of
the model.

It is a popular method because it is simple to understand and because it generally results in a less
biased or less optimistic estimate of the model skill than other methods, such as a simple train/test
split.

The general procedure is as follows:

1. Shuffle the dataset randomly.


2. Split the dataset into k groups
3. For each unique group:
1. Take the group as a hold out or test data set
2. Take the remaining groups as a training data set
3. Fit a model on the training set and evaluate it on the test set
4. Retain the evaluation score and discard the model
4. Summarize the skill of the model using the sample of model evaluation scores
Importantly, each observation in the data sample is assigned to an individual group and stays in that
group for the duration of the procedure. This means that each sample is given the opportunity to be
used in the hold out set 1 time and used to train the model k-1 times.

This approach involves randomly dividing the set of observations into k groups, or folds, of
approximately equal size. The first fold is treated as a validation set, and the method is fit on the
remaining k − 1 folds.

— Page 181, An Introduction to Statistical Learning, 2013.


It is also important that any preparation of the data prior to fitting the model occur on the CV-assigned
training dataset within the loop rather than on the broader data set. This also applies to any tuning of
hyperparameters. A failure to perform these operations within the loop may result in data leakage and
an optimistic estimate of the model skill.
Despite the best efforts of statistical methodologists, users frequently invalidate their results by
inadvertently peeking at the test data.

— Page 708, Artificial Intelligence: A Modern Approach (3rd Edition), 2009.


11

The results of a k-fold cross-validation run are often summarized with the mean of the model skill
scores. It is also good practice to include a measure of the variance of the skill scores, such as the
standard deviation or standard error.

Configuration of k
The k value must be chosen carefully for your data sample.

A poorly chosen value for k may result in a mis-representative idea of the skill of the model, such as a
score with a high variance (that may change a lot based on the data used to fit the model), or a high
bias, (such as an overestimate of the skill of the model).

Three common tactics for choosing a value for k are as follows:

 Representative: The value for k is chosen such that each train/test group of data samples is large
enough to be statistically representative of the broader dataset.
 k=10: The value for k is fixed to 10, a value that has been found through experimentation to generally
result in a model skill estimate with low bias a modest variance.
 k=n: The value for k is fixed to n, where n is the size of the dataset to give each test sample an
opportunity to be used in the hold out dataset. This approach is called leave-one-out cross-validation.
The choice of k is usually 5 or 10, but there is no formal rule. As k gets larger, the difference in size
between the training set and the resampling subsets gets smaller. As this difference decreases, the
bias of the technique becomes smaller

— Page 70, Applied Predictive Modeling, 2013.


A value of k=10 is very common in the field of applied machine learning, and is recommend if you are
struggling to choose a value for your dataset.

To summarize, there is a bias-variance trade-off associated with the choice of k in k-fold cross-
validation. Typically, given these considerations, one performs k-fold cross-validation using k = 5 or k
= 10, as these values have been shown empirically to yield test error rate estimates that suffer neither
from excessively high bias nor from very high variance.

— Page 184, An Introduction to Statistical Learning, 2013.


If a value for k is chosen that does not evenly split the data sample, then one group will contain a
remainder of the examples. It is preferable to split the data sample into k groups with the same
number of samples, such that the sample of model skill scores are all equivalent.

Learning Curves in Machine Learning


Generally, a learning curve is a plot that shows time or experience on the x-axis and learning or
improvement on the y-axis.
12

Learning curves (LCs) are deemed effective tools for monitoring the performance of workers exposed
to a new task. LCs provide a mathematical representation of the learning process that takes place as
task repetition occurs.

— Learning curve models and applications: Literature review and research directions, 2011.
For example, if you were learning a musical instrument, your skill on the instrument could be
evaluated and assigned a numerical score each week for one year. A plot of the scores over the 52
weeks is a learning curve and would show how your learning of the instrument has changed over
time.

 Learning Curve: Line plot of learning (y-axis) over experience (x-axis).


Learning curves are widely used in machine learning for algorithms that learn (optimize their internal
parameters) incrementally over time, such as deep learning neural networks.

The metric used to evaluate learning could be maximizing, meaning that better scores (larger
numbers) indicate more learning. An example would be classification accuracy.

It is more common to use a score that is minimizing, such as loss or error whereby better scores
(smaller numbers) indicate more learning and a value of 0.0 indicates that the training dataset was
learned perfectly and no mistakes were made.

During the training of a machine learning model, the current state of the model at each step of the
training algorithm can be evaluated. It can be evaluated on the training dataset to give an idea of how
well the model is “learning.” It can also be evaluated on a hold-out validation dataset that is not part of
the training dataset. Evaluation on the validation dataset gives an idea of how well the model is
“generalizing.”
 Train Learning Curve: Learning curve calculated from the training dataset that gives an idea of how
well the model is learning.
 Validation Learning Curve: Learning curve calculated from a hold-out validation dataset that gives
an idea of how well the model is generalizing.
It is common to create dual learning curves for a machine learning model during training on both the
training and validation datasets.

In some cases, it is also common to create learning curves for multiple metrics, such as in the case of
classification predictive modeling problems, where the model may be optimized according to cross-
entropy loss and model performance is evaluated using classification accuracy. In this case, two plots
are created, one for the learning curves of each metric, and each plot can show two learning curves,
one for each of the train and validation datasets.

 Optimization Learning Curves: Learning curves calculated on the metric by which the parameters of
the model are being optimized, e.g. loss.
 Performance Learning Curves: Learning curves calculated on the metric by which the model will be
evaluated and selected, e.g. accuracy.
13

Now that we are familiar with the use of learning curves in machine learning, let’s look at some
common shapes observed in learning curve plots.

A statistical hypothesis is a hypothesis that is testable on the basis of observed data modelled as the


realised values taken by a collection of random variables.[1] A set of data is modelled as being realised
values of a collection of random variables having a joint probability distribution in some set of possible joint
distributions. The hypothesis being tested is exactly that set of possible probability distributions.
A statistical hypothesis test is a method of statistical inference. An alternative hypothesis is proposed for
the probability distribution of the data, either explicitly or only informally. The comparison of the two models
is deemed statistically significant if, according to a threshold probability—the significance level—the data
would be unlikely to occur if the null hypothesis were true. A hypothesis test specifies which outcomes of a
study may lead to a rejection of the null hypothesis at a pre-specified level of significance, while using a
pre-chosen measure of deviation from that hypothesis (the test statistic, or goodness-of-fit measure). The
pre-chosen level of significance is the maximal allowed "false positive rate". One wants to control the risk of
incorrectly rejecting a true null hypothesis.
The process of distinguishing between the null hypothesis and the alternative hypothesis is aided by
considering two types of errors. A Type I error occurs when a true null hypothesis is rejected. A Type II
error occurs when a false null hypothesis is not rejected.
Hypothesis tests based on statistical significance are another way of expressing confidence intervals (more
precisely, confidence sets). In other words, every hypothesis test based on significance can be obtained via
a confidence interval, and every confidence interval can be obtained via a hypothesis test based on
significance.[2]
Significance-based hypothesis testing is the most common framework for statistical hypothesis testing. An
alternative framework for statistical hypothesis testing is to specify a set of statistical models, one for each
candidate hypothesis, and then use model selection techniques to choose the most appropriate model.
[3]
 The most common selection techniques are based on either Akaike information criterion (=AIC)
or Bayesian information criterion (=BIC).

You might also like