You are on page 1of 19

UNIT I

INTRODUCTION

Learning – Types of Machine Learning – Supervised Learning – The Brain and the Neuron –
Design a Learning System – Perspectives and Issues in Machine Learning – Concept Learning Task
– Concept Learning as Search – Finding a Maximally Specific Hypothesis – Version Spaces and the
Candidate Elimination Algorithm – Linear Discriminants – Perceptron – Linear Separability –
Linear Regression.

Learning
Machine learning is a branch of artificial intelligence (AI) and computer science which focuses
on the use of data and algorithms to imitate the way that humans learn, gradually improving
its accuracy

Image recognition is a well-known and widespread example of machine learning in the real world.
It can identify an object as a digital image, based on the intensity of the pixels in black and white
images or colour images. Real-world examples of image recognition: Label an x-ray as cancerous or
not.

Data is being produced and stored continuously (“big data”):

-science: genomics, astronomy, materials science, particle accelerators. . .

– sensor networks: weather measurements, traffic. . .

Data is not random; it contains structure that can be used to predict outcomes, or gain knowledge in
some way.

Ex: patterns of Amazon purchases can be used to recommend items.

It is more difficult to design algorithms for such tasks (compared to, say, sorting an array or
calculating a payroll). Such algorithms need data.

Ex: construct a spam filter, using a collection of email messages labelled as spam/not
spam.

Data mining: the application of ML methods to large databases

Ex of ML applications: fraud detection, medical diagnosis, speech or face recognition. . .

ML is programming computers using data (past experience) to optimize a performance criterion


ML relies on:

– Statistics: making inferences from sample data

. – Numerical algorithms (linear algebra, optimization): optimize criteria, manipulate


models.

– Computer science: data structures and programs that solve a ML problem efficiently.

Types of Machine Learning


The three machine learning types are supervised, unsupervised, and reinforcement learning.

Supervised learning
A training set of examples with the correct responses (targets) is provided and, based on this
training set, the algorithm generalises to respond correctly to all possible inputs. This is also called
learning from exemplars.

– Classification (pattern recognition):

1. Face recognition. Difficult because of the complex variability in the data: pose and illumination in
a face image, occlusions, glasses/beard/make-up/etc.

2. Optical character recognition: different styles, slant. . .

3. Medical diagnosis: often, variables are missing (tests are costly).

4. Speech recognition, machine translation, biometrics. . .

5. Credit scoring: classify customers into high- and low-risk, based on their income and savings,
using data about past loans (whether they were paid or not).
U
nsupervised learning no labels provided, only input data.

Correct responses are not provided, but instead the algorithm tries to identify similarities between
the inputs so that inputs that have something in common are categorized together. The statistical
approach to unsupervised learning is known as density estimation.

1. Learning associations: ∗ Basket analysis: let p(Y |X) = “probability that a customer who buys
product X also buys product Y ”, estimated from past purchases. If p(Y |X) is large (say 0.7),
associate “X → Y ”. When someone buys X, recommend them Y .

2.Clustering: group similar data points.

3. Density estimation: where are data points likely to lie?

4.Dimensionality reduction: data lies in a low-dimensional manifold.

5. Feature selection: keep only useful features. – Outlier/novelty detection

Semi supervised learning: labels provided for some points only.

Reinforcement learning
This is somewhere between supervised and unsupervised learn-ing. The algorithm gets told when the
answer is wrong, but does not get told how to correct it. It has to explore and try out different
possibilities until it works out how to get the answer right. Reinforcement learning is sometime
called learning with a critic because of this monitor that scores the answer, but does not suggest
improvements.
Evolutionary learning
Biological evolution can be seen as a learning process: biological organisms adapt to improve their
survival rates and chance of having offspring in their environment. We’ll look at how we can model
this in a computer, using an idea of fitness, which corresponds to a score for how good the current
solution is.

Supervised learning
1. Learning a class from examples: two-class problems

We are given a training set of labeled examples (positive and negative) and want to learn a classifier
that we can use to predict unseen examples, or to understand the data.

Input representation: we need to decide what attributes (features) to use to describe the input patterns
(examples, instances). This implies ignoring other attributes as irrelevant.

Training set: X = {(xn, yn)} N n=1 where xn ∈ R D is the nth input vector and yn ∈ {0, 1} its class
label.

• Hypothesis (model) class H: the set of classifier functions we will use. Ideally, the true class
distribution C can be represented by a function in H (exactly, or with a small error).

• Having selected H, learning the class reduces to finding an optimal h ∈ H. We don’t know the true
class regions C, but we can approximate them by the empirical error :

E(h; X ) = X N n=1 I(h(xn) 6= yn) = number of misclassified instances


2. Noise

1.Noise is any unwanted anomaly in the data. It can be due to:

– Imprecision in recording the input attributes: xn.

– Errors in labeling the input vectors: yn.

– Attributes not considered that affect the label (hidden or latent attributes, may be
unobservable).

2.Noise makes learning harder.

3. Learning multiple classes

With K classes, we can code the label as an integer y = k ∈ {1, . . . , K}, or as a one-of-K binary
vector y = (y1, . . . , yK) T ∈ {0, 1} K (containing a single 1 in position k).

• One approach for K-class classification: consider it as K two-class classification problems, and
minimize the total empirical error

: E({hk} K k=1; X ) = X N n=1 X K k=1 I(hk(xn) 6= ynk)

where yn is coded as one-of-K and hk is the two-class classifier for problem k, i.e., hk(x) ∈ {0, 1}.

• Ideally, for a given pattern x only one hk(x) is one. When no, or more than one, hk(x) is one then
the classifier is in doubt and may reject the pattern.
4.Regression

• Training set X = {(xn, yn)} N n=1 where the label for a pattern xn ∈ R D is a real value yn ∈ R. In
multivariate regression, yn ∈ R d is a real vector

. • We assume the labels were produced by applying an unknown function f to the instances, and we
want to learn (or estimate) that function using functions h from a hypothesis class H. Ex: H= class of
linear functions : h(x) = w0 + w1x1 + · · · + wDxD = wT x + w0.

• Interpolation: we learn a function h(x) that passes through each training pair (xn, yn) (no noise):
yn = h(xn), n = 1, . . . , N. Ex: polynomial interpolation (requires a polynomial of degree N − 1 with
N points in general position).

• Regression: we assume random additive noise ǫ: y = h(x) + ǫ.

• Empirical error: E(h; X ) = 1 N X N n=1 (yn − h(xn))2 = sum of squared errors at each instance.
Other definitions of error possible, e.g. absolute value instead of square.

5. Model selection and generalization

Machine learning problems (classification, regression and others) are typically ill-posed: the
observed data is finite and does not uniquely determine the classification or regression function.

• For best generalization, we should match the complexity of the hypothesis class H with the
complexity of the function underlying the data:

– If H is less complex: underfitting. Ex: fitting a line to data generated from a cubic
polynomial.
– If H is more complex: overfitting. Ex: fitting a cubic polynomial to data generated from a
line.

In summary, in ML algorithms there is a tradeoff between 3 factors:

– the complexity c(H) of the hypothesis class

– the amount of training data

6. Outlier (novelty, anomaly) detection

• An outlier is an instance that is very different from the other instances in the sample. Reasons: –
Abnormal behaviour. Fraud in credit card transactions, intruder in network traffic, etc. – Recording
error. Faulty sensors, etc.

• Not usually cast as a two-class classification problem because there are typically few outliers and
they don’t fit a consistent pattern that can be easily learned.

• Instead, “one-class classification”: fit a density p(x) to non-outliers, then consider x as an outlier if
p(x) < θ for some threshold θ > 0 (low-probability instance).

• We can also identify outliers as points that are far away from other samples.

The Brain and the Neuron

Page No(39-43)

Design a Learning System


It briefly examines the process by which machine learning algorithms can be selected, applied, and
evaluated for the problem

Data Collection and Preparation Throughout this book we will be in the for tunateposition of having
datasets readily available for downloading and using to test the algorithms. This is, of course, less
commonly the case when the desire is to learn about some new problem, when either the data has to
be collected from scratch, or at the very least, assembled and prepared. In fact, if the problem is
completely new, so that appropriate data can be chosen, then this process should be merged with the
next step of feature selection, so that only the required data is collected. This can typically be done
by assembling a reasonably small dataset with all of the features that you believe might be useful,
and experimenting with it before choosing the best features and collecting and analysing the full
dataset..

For supervised learning, target data is also needed, which can require the involvement of experts in
the relevant field and significant investments of time.
Finally, the quantity of data needs to be considered. Machine learning algorithms need significant
amounts of data, preferably without too much noise, but with increased dataset size comes increased
computational costs, and the sweet spot at which there is enough data without excessive
computational overhead is generally impossible to predict.

Feature Selection

An example of this part of the process was given in Section 1.4.2 when we looked at possible
features that might be useful for coin recognition. It consists of identifying the features that are most
useful for the problem under examination. This invariably requires prior knowledge of the problem
and the data; our common sense was used in the coins example above to identify some potentially
useful features and to exclude others.

As well as the identification of features that are useful for the learner, it is also necessary that the
features can be collected without significant expense or time, and that they are robust to noise and
other corruption of the data that may arise in the collection process.

Algorithm Choice

Given the dataset, the choice of an appropriate algorithm (or algo-rithms) is what this book should
be able to prepare you for, in that the knowledge of the underlying principles of each algorithm and
examples of their use is precisely what is required for this.

Parameter and Model Selection

For many of the algorithms there are parameters that have to be set manually, or that require
experimentation to identify appropriate values. These requirements are discussed at the appropriate
points of the book.

Training

Given the dataset, algorithm, and parameters, training should be simply the use of computational
resources in order to build a model of the data in order to predict the outputs on new data.

Evaluation

Before a system can be deployed it needs to be tested and evaluated for ac-curacy on data that it was
not trained on. This can often include a comparison withhuman experts in the field, and the selection
of appropriate metrics for this compare-son.

Perspectives and Issues in Machine Learning


 Machine learning is a subfield of artificial intelligence and machine learning algorithms are used
in other related fields like natural language processing and computer vision.
 In general, there are three types of learning and these are supervised learning, unsupervised
learning, and reinforcement learning.
 Their names tell the main idea behind them actually.
 In supervised learning, your system learns under the supervision of the data outputs so
supervised algorithms are preferred if your dataset contains output information.
 Let me give you an example in there.
 Let’s assume you have a medical statistic company and you have a dataset which contains
patients’ features like blood pressure, sugar rate in their blood, heart rate per minute, etc.
ISSUES:
LACK OF QUALITY DATA
 One of the main issues in Machine Learning is the absence of good data.
 While, algorithms tend to make developers exhaust most of their time on artificial intelligence.
FAULT IN CREDIT CARD FRAUD DETECTION
 Although this AI-driven software helps to successfully detect credit card fraud, there are issues
in Machine Learning that make the process redundant.
GETTING BAD RECOMMENDATIONS
 Proposal engines are quite regular today.
 While some might be dependable, others may not appear to provide the necessary results.
TALENT DEFICIT
 Albeit numerous individuals are pulled into the ML business, however, there are still not many
experts who can take complete control of this innovation.
MAKING THE WRONG ASSUMPTIONS
 ML models can’t manage datasets containing missing data points.
 Thus, highlights that contain a huge part of missing data should be erased.

Concept Learning Task


In a concept learning task, a human classifies objects by being shown a set of example objects
along with their class labels.

the main goal is to find the hypothesis that best fits the training data set.

The below points covers the simplified overview.

1. X — The set of items over which the concept is defined is called the set of instances, which we
denote by X. In the current example, X is the set of all possible days, each represented by the
attributes Sky, AirTemp, Humidity, Wind, Water, and Forecast.

2. c — The concept or function to be learned is called the target concept, which we denote by c. In
general, c can be any boolean valued function defined over the instances X; that is, c: X → {0, 1}.
In the current example, the target concept corresponds to the value of the attribute EnjoySport
(i.e, c(x)=1 if EnjoySport=Yes, and c(x)=0 if EnjoySport= No).
3. (x, c(x)) — When learning the target concept, the learner is presented by a set of training
examples, each consisting of an instance x from X, along with its target concept value c(x).
Instances for which c(x) = 1 are called positive examples and instances for which c(x) = 0 are
called negative examples. We will often write the ordered pair (x, c(x)) to describe the training
example consisting of the instance x and its target concept value c(x).

4. D — We use the symbol D to denote the set of available training examples.

5. H — Given a set of training examples of the target concept c, the problem faced by the learner is
to hypothesize, or estimate, c. We use the symbol H to denote the set of all possible hypotheses
that the learner may consider regarding the identity of the target concept.

6. h(x) — In general, each hypothesis h in H represents a Boolean-valued function defined over X;


that is, h : X →{0, 1}. The goal of the learner is to find a hypothesis h such that h(x) = c(x) for all
x in X.

Notice that, the learning algorithm objective is to find a hypothesis h in H such that h(x) = c(x) for all
x in D.

We know that, the Inductive learning algorithm tries to induce a “general rule” from a set of observed
instances. So the above case is same as inductive learning where a learning algorithm is trying to find
a hypothesis h (general rule) in H such that h(x) = c(x) for all x in D. For a given collection of
examples, in reality, learning algorithm return a function h (hypothesis) that approximates c (target
concept). But the expectation is, the learning algorithm to return a function h (hypothesis) that equals
c (target concept) ie. h(x) = c(x) for all x in D

Maximally Specific Hypothesis


Let X denote the instances and H as hypotheses in the EnjoySport learning task. Lets compute the
distinct instances and hypothesis in X and H respectively as below. Out hypothesis h is a vector of six
constraints, specifying the values of the six attributes <Sky, Air Temperature, Humidity, Wind,
Water, Forecast>. In this hypothesis representation, value of each attribute could be either “?” or “0”
other than defined values. So the hypothesis space H has 5120 distinct hypothesis.

The number of combinations: 5×4×4×4×4×4 = 5120 syntactically distinct hypotheses. They are
syntactically distinct but not semantically. For example, the below 2 hypothesis says the same but
they look different.

Lets formalize this concept. Hypothesis h1 and h2 classifies the instance x as positive can written as

h1(x) = 1 and h2(x) = 1. Now h2 is more general


than h1, so it can be written as if h1(x) = 1 implies h2(x) =1

Now after learning the concept of general-to-specific ordering of hypotheses, Now its time to use

this partial ordering to organize the search for a hypothesis, that is consistent with the observed

training examples. One way is to begin with the most specific possible hypothesis in H, then

generalize this hypothesis each time it fails to cover an observed positive training example.FIND-S

algorithm is used for this purpose. Here are the steps for find-s algorithm.

To illustrate this algorithm, assume the learner is given the sequence of training examples from the
Enjoy Sport task
1. The first step of FIND-S is to initialize h to the most specific hypothesis in H h — (Ø, Ø, Ø, Ø,
Ø, Ø)

2. First training example x1 = < Sunny, Warm, Normal, Strong ,Warm ,Same>, EnjoySport = +ve.
Observing the first training example, it is clear that hypothesis h is too specific. None of the “Ø”
constraints in h are satisfied by this example, so each is replaced by the next more general
constraint that fits the example h1 = < Sunny, Warm, Normal, Strong ,Warm, Same>.

3. Consider the second training example x2 = < Sunny, Warm, High, Strong, Warm, Same>,
EnjoySport = +ve. The second training example forces the algorithm to further generalize h, this
time substituting a “?” in place of any attribute value in h that is not satisfied by the new example.
Now h2 =< Sunny, Warm, ?, Strong, Warm, Same>

4. Consider the third training example x3 =< Rainy, Cold, High, Strong, Warm,
Change>,EnjoySport = — ve. The FIND-S algorithm simply ignores every negative example. So
the hypothesis remain as before, so h3 =< Sunny, Warm, ?, Strong, Warm, Same>

5. Consider the fourth training example x4 =<Sunny,Warm,High,Strong, Cool,Change>, EnjoySport


=+ve. The fourth example leads to a further generalization of h as h4 =< Sunny, Warm, ?,
Strong, ?, ?>

6. So the final hypothesis is < Sunny, Warm, ?, Strong, ?, ?>


The search begins (ho) with the most specific hypothesis in H, then considers increasingly
general hypotheses (hl through h4) as mandated by the training examples. The search moves from
hypothesis to hypothesis, searching from the most specific to progressively more general hypotheses
along one chain of the partial ordering. At each step, the hypothesis is generalized only as far as
necessary to cover the new positive example. Therefore, at each stage the hypothesis is the most
specific hypothesis consistent with the training examples observed up to this point.

Linear Discriminants – Perceptron – Linear Separability – Linear


Regression.
Linear Discriminants
Linear Discriminant Analysis or Normal Discriminant Analysis or Discriminant Function
Analysis is a dimensionality reduction technique that is commonly used for supervised
classification problems. It is used for modelling differences in groups i.e. separating two or more
classes

• Consider learning a classifier given a sample {(xn, yn)} N n=1 where xn ∈ R D and yn ∈ {1, K}.

• Classification works as follows: – Training time: learn a set of discriminant functions {gk(x)} K
k=1. – Test time: given a new instance x, choose Ck if k = arg maxi=1,...,K {gi(x)}.

• Two approaches to learning discriminant functions:

– Generative approach: we learn p(x|Ck) and p(Ck) for each class from the training data,
and then use gk(x) = p(Ck|x) ∝ p(x|Ck)p(Ck) (from Bayes’ rule) to predict a class. Hence, besides
learning the class boundaries (where p(Ci |x) = p(Cj |x) for i 6= j), we model also the density of each
class. Previous chapters, using parametric and nonparametric methods for p(x|Ck).

– Discriminative approach: we learn only the class boundaries, through discriminant


functions gk(x), k = 1, . . . , K. We don’t learn the class densities, and gk(x) need not be modeled
using probabilities. Hence, it requires assumptions only about the class boundaries but not about
their densities. It solves a simpler problem. This and future chapters, using linear and nonlinear
discriminant functions

Linear discriminants are simpler than nonlinear ones:

– Faster to train.

– Low space and time complexity at test time: O(D) per class. To store wk and multiply
times it.

– Simple to interpret: the output is a weighted sum of the features xd.


∗ Magnitude of wkd: importance of xd in the decision.

∗ Sign of wkd: whether the effect of xd is positive or negative. – Accurate


enough in many applications.

• In practice, try linear discrimination first, before trying nonlinear discrimination

Two classes
• One discriminant function is sufficient: g1(x) − g2(x) ?= wT x + w0 = g(x). Testing: choose C1 if
g(x) > 0 and C2 if g(x) < 0.

• This defines a hyperplane where w is the weight vector and w0 the threshold (or bias). It divides
the input space R D into two half-spaces, the decision regions R1 for C1 (positive side) and R2 for
C2 (negative side). The hyperplane itself is the boundary or decision surface.

• The origin x = 0 is on the    positive side if w0 > 0 boundary if w0 = 0 negative side if w0 < 0

. • w is orthogonal to the hyperplane. Pf. Pick x, y on the hyperplane.

• The signed distance from x ∈ R D to the hyperplane is r = g(x)/kwk. Pf. Write x = xp + r w kwk
where xp = orthogonal projection of x on the hyperplane and compute g(x). The signed distance of
the origin to the hyperplane is r0 = w0/kwk.

• So w determines the orientation of the hyperplane and w0 its location wrt the origin

Perceptron
The Perceptron is nothing more than a collection of McCulloch and Pitts neurons together with a set
of inputs and some weights to fasten the inputs to the neurons. On the left of the figure, shaded in
light grey, are the input nodes. These are not neurons, they are just a nice schematic way of showing
how values are fed into the network,
FIGURE The Perceptron network, consisting of a set of input nodes (left) connectedto McCulloch
and Pitts neurons using weighted connections.

They are almost always drawn as circles, just like neurons, which is rather confusing, so I’ve shaded
them a different colour. The neurons are shown on the right, and you can see both the additive part
(shown as a circle) and the thresholder. In practice nobody bothers to draw the thresh older
separately, you just need to remember that it is part of the neuron.

The Perceptron Learning Algorithm


It might be the first time that you have seen an algorithm written out like this, and it could be hard to
see how it can be turned into code. Equally, it might be difficult to believe that something as simple
as this algorithm can learn something. The only way to fix these things is to work through the
algorithm by hand on an example or two, and to try to write the code and then see if it does what is
expected. We will do both of those things next, first working through a simple example by hand.

LINEAR SEPARABILITY
Linear separability is the concept wherein the separation of input space into regions is based on
whether the network response is positive or negative. A decision line is drawn to separate positive
and negative responses.

It is computed by multiplying each element of the first vector by the matching element of the second
and adding them all together. As you might remember from high school, a · b = kakkbk cos θ, where
θ is the angle between a and b and kak is the length ofthe vector a. So the inner product computes a
function of the angle between the two vectors, scaled by their lengths. It can be computed in NumPy
using the np.inner() function.
Getting back to the Perceptron, the boundary case is where we find an input vectorx1 that has x1 ·
wT = 0. Now suppose that we find another input vector x2 that satisfies x2 · wT = 0. Putting these
two equations together we get:

x1 · wT = x2 · wT

⇒ (x1 − x2) · wT = 0. (3.17)

What does this last equation mean? In order for the inner product to be 0, either kakor kbk or cos θ
needs to be zero. There is no reason to believe that kak or kbk should be 0,so cos θ = 0. This means
that θ = π/2 (or −π/2), which means that the two vectors areat right angles to each other.

LINEAR REGRESSION
Linear Regression is a machine learning algorithm based on supervised learning. It performs a
regression task. Regression models a target prediction value based on independent variables. It is mostly
used for finding out the relationship between variables and forecasting for examples in the class and 0 for all
of the others. Since classification can be replaced byregression using these methods, we’ll think about
regression here.
The only real difference between the Perceptron and more statistical approaches is inthe way that the
problem is set up. For regression we are making a prediction about anunknown value y (such as the
indicator variable for classes or a future value of some data)by computing some function of known
values xi. We are thinking about straight lines, sothe output y is going to be a sum of the xi values,
each multiplied by a constant parameter:y =PMi=0 βixi. The βi define a straight line (plane in 3D,
hyperplane in higher dimensions)that goes through (or at least near) the datapoints..

where t is a column vector containing the targets and X is the matrix of input values (evenincluding
the bias inputs), just as for the Perceptron.

You might also like