You are on page 1of 154

Soft Computing

2
WHAT IS SOFT COMPUTING?

Usually the primary considerations of traditional (“hard”)


computing are:
 precision,
 certainty, and
 rigor
The challenge is to exploit the tolerance for imprecision by
devising methods of computation that lead to an acceptable,
approximate solution at low cost.

The principal notion in soft computing is that precision and


certainty carry a cost; and that computation, reasoning, and
decision-making should exploit the tolerance for imprecision,
uncertainty, approximate reasoning, and partial truth for
obtaining low-cost solutions.
3
WHAT IS SOFT COMPUTING?

 Soft computing is a consortium of methodologies that


provide flexible information processing capability for
handling real-life ambiguous situations.

 This leads to the remarkable human ability of understanding


distorted speech, deciphering sloppy handwriting,
comprehending the nuances of natural language,
summarizing text, recognizing and classifying images,
driving a vehicle in dense traffic, and, more generally, making
rational decisions in an environment of uncertainty and
imprecision.

4
WHAT IS SOFT COMPUTING?

 Soft Computing became a formal Computer Science area of study in


the early 1990's.

 Earlier computational approaches could model and precisely analyse


only relatively simple systems.

 More complex systems arising in biology, medicine,


humanities, management sciences, and similar fields often
remained intractable to conventional mathematical and
analytical methods.

 Soft computing deals with imprecision, uncertainty, partial


truth, and approximation to achieve tractability, robustness and
low solution cost.

5
WHAT IS SOFT COMPUTING?

 Generally speaking, soft computing techniques resemble biological


processes more closely than traditional techniques, which are largely
based on formal logical systems, such as sentential
logic andpredicate logic, or rely heavily on computer-aided
numerical analysis (as in finite element analysis).

 Soft computing techniques are intended to complement each other.

 Unlike hard computing schemes, which strive for exactness


and full truth, soft computing techniques exploit the given
tolerance of imprecision, partial truth, and uncertainty for a
particular problem.

6
Soft Computing
 The main constituents of soft computing include:
 fuzzy logic
 neural networks
 genetic algorithms
 rough sets and
 signal processing tools such as wavelets.

 Each of them contribute a distinct methodology for addressing


problems in its domain.

 An intelligent and robust system that provides


 human-interpretable,
 low-cost &
 approximate solution
7
Soft Computing
There are ongoing efforts to integrate artificial neural networks
(ANNs), fuzzy set theory, genetic algorithms (GAs), rough set
theory and other methodologies in the soft computing
paradigm.

Hybridization, exploiting the characteristics of these theories


include:
 neuro-fuzzy,
rough-fuzzy,
neuro-genetic,
fuzzy-genetic,
neuro-rough,
rough-neuro-fuzzy
approaches.
8
Soft Computing
 Fuzzy sets provide a natural framework for the process in
dealing with uncertainty or imprecise data - suitable for
handling the issues related to understanding patterns,
incomplete and noisy data

 Neural networks are robust and exhibit good learning and


generalization capabilities in data-rich environments.

 Genetic algorithms (GAs) provide efficient search


algorithms to optimally select a model, from mixed media
data, based on some preference criterion or objective
function.

9
Soft Computing
 Rough sets are suitable for handling different types of
uncertainty in data.
Neural networks and rough sets are widely used for classification and rule
generation.

 Application of wavelet-based signal processing techniques


is new in the area of soft computing.
 Wavelet transformation of a signal results in
decomposition of the original signal in different multi-
resolution sub-bands.
 This is useful in dealing with compression and
retrieval of data, particularly images.

10
Relevance?

11
Machine Learning

Arthur Samuel (1959):

Field of study that gives computers the ability to learn without


being explicitly programmed

Tom Mitchell (1998):

A computer program is said to learn from experience E w.r.t.


some task T and some performance measure P, if it’s
performance on T, as measured by P, improves with
experience E

12
Machine Learning:
An Indispensable Tool in
Bioinformatics

13
Introduction
 The development of high-throughput data acquisition
technologies in biological sciences in the last 5 to 10 years, together
with advances in digital storage, computing, and information
and communication technologies in the 1990s, has begun to
transform biology from a data-poor into a data-rich science.

 This phenomenon is gradually transforming biology from classic


hypothesis-driven approaches, in which a single answer to a single
question is provided, to a data-driven research, in which many
answers are given at a time and we have to seek the hypothesis that
best explains them.

 As a reaction to the exponential growth in the amount of biological


data to handle, the discipline of bioinformatics stores, retrieves,
analyzes and assists in understanding biological information.
14
Introduction (Cont.)
 The development of methods for the analysis of this massive (and
constantly increasing) amount of information is one of the key
challenges in bioinformatics.

 This analysis step – also known as computational biology – faces the


challenge of extracting biological knowledge from all the in-house and
publicly available data.

 Furthermore, the knowledge should be formulated in a transparent


and coherent way if it is to be understood and studied by bio-experts.

 The term “data mining” in bioinformatics refers to the set of


techniques aimed at discovering useful relationships and patterns in
biological data that were previously undetected.

15
Data mining techniques provide a robust means to evaluate the
generalization power of extracted patterns on unseen data, although
these must be further validated and interpreted by the domain
expert. 16
Machine Learning
 Machine learning methods are essentially computer programs that
make use of sampled data or past experience information to
provide solutions to a given problem.

 A wide spectrum of algorithms, commonly based on the artificial


intelligence and statistics fields, have been proposed by the machine
learning community in the last decades.

 Machine learning is able to deal with the huge volumes of data


generated by novel high-throughput devices, in order to extract
hidden relationships that exist and that are not noticeable to
experts.

 As new data and novel concept types are generated every day in
molecular biology research, it is essential to apply techniques able to fit
this fast-evolving nature - Machine learning can be adapted efficiently
to these changing environments. 17
Machine Learning (Cont.)
 Machine learning is able to deal with the abundance of missing and
noisy data from many biological scenarios.

 In several biological scenarios, experts can only specify input–


output data pairs, and they are not able to describe the general
relationships between the different features that could serve to
further describe how they interrelate.

 Machine learning is able to adjust its internal structure to the existing


data, producing approximate models and results.
 Machine learning methods are used to investigate the
underlying mechanisms and the interactions between
biological molecules in many diseases.
 They are also essential for the biomarker discovery process.
18
Machine Learning (Cont.)
 Mainly due to the availability of novel types of biology throughput
data, the set of biology problems on which machine learning is
applied is constantly growing.

 Two practical realities severely condition many bioinformatics


applications:

 a limited number of samples, and


 several thousands of features characterizing each sample

 The development of machine learning techniques capable of


dealing with these problems is currently a challenge for the
bioinformatics community.

19
Machine Learning Applications

20
Machine Learning (Cont.)
Machine learning algorithms have been taxonomized in the following way:

• Supervised learning:
 Starting from a database of training data that consists of pairs of
input cases and desired outputs, its goal is to construct a function
(or model) to accurately predict the target output of future cases
whose output value is unknown.

 When the target output is a continuous-value variable, the task is


known as regression.

 Otherwise, when the output (or label) is defined as a finite set of


discrete values, the task is known as classification.

21
• Unsupervised learning/ Clustering:
 Starting from a database of training data that consists of input
cases, its goal is to partition the training samples into subsets
(clusters) so that the data in each cluster show a high level of
proximity.
 In contrast to supervised learning, the labels for the data are not
used or are not available in clustering.

• Semi-supervised learning:
 Starting from a database of training data that combines both
labeled and unlabeled examples, the goal is to construct a model
able to accurately predict the target output of future cases for
which its output value is unknown.
 Typically, this database contains a small amount of labeled data
together with a large amount of unlabeled data.

22
• Reinforcement learning:
 These algorithms are aimed at finding a policy that maps states
of the world to actions.
 The actions are chosen among the options that an agent ought to
take under those states, with the aim of maximizing some notion
of long-term reward.
 Its main difference regarding the previous types of machine
learning techniques is that input–output pairs are not present
in a database, and its goal resides in online performance.

• Optimization:
 The task of searching for an optimal solution in a space of
multiple possible solutions.
 As the process of learning from data can be regarded as
searching for the model that best fits the data, optimization
methods can be considered an ingredient in modeling. 23
Supervised and unsupervised classification are the most
broadly applied machine learning types in most application
areas, including bioinformatics.

24
Supervised learning Algorithms
 Averaged One- Dependence  Ripple down rules, a knowledge
Estimators (AODE) acquisition methodology
 Artificial neural network:  Statistical classification : Hidden
Backpropagation Markov models
 Bayesian statistics  Symbolic machine learning algorithms
 Case-based reasoning  Sub-symbolic machine
 Decision trees learning algorithms
 Inductive logic programming  Support vector machines
 Gaussian process regression  Random Forests
 Learning automata  Ensembles of classifiers
 Minimum message length (decision  Bootstrap aggregating (bagging)
trees, decision graphs, etc.)  Boosting
 Lazy learning  Ordinal classification
 Instance-based learning: Nearest  Regression analysis
Neighbor Algorithm  Information fuzzy networks (IFN)
 Probably approximately correct
learning (PAC) learning
25
Fuzzy sets
 We are continuously having to recognize
 people,
 objects,
 handwriting,
 voice, images, and other patterns,
using :

 Distorted
 unfamiliar,
 incomplete,
 occluded,
 fuzzy, and
 inconclusive data,
where a pattern should be allowed to have membership or
belongingness to more than one class. 26
Fuzzy sets
 This is also very significant in medical diagnosis, where a
patient afflicted with a certain set of symptoms can be
simultaneously suffering from more than one disease.

 Again, the symptoms need not necessarily be strictly


numerical.

 This is how the concept of fuzziness comes into the


picture.

Example?

27
Fuzzy control system

A fuzzy control system is a control system based


on fuzzy logic—a mathematical system that
analyzes input values in terms of variables that take
continuous values between 0 and 1, in contrast to
classical or digital logic, which operates on discrete values
of either 0 or 1 (true or false).

28
Overview
 Fuzzy logic is widely used in machine learning.

 The term itself inspires a certain skepticism, sounding equivalent to


"half-baked logic" or "bogus logic", but the "fuzzy" part does not refer
to a lack of rigour in the method, rather to the fact that the logic
involved can deal with fuzzy concepts—concepts that cannot be
expressed as "true" or "false" but rather as "partially true".

 Although genetic algorithms and neural networks can perform just as


well as fuzzy logic in many cases, fuzzy logic has the advantage that the
solution to the problem can be cast in terms that human
operators can understand, so that their experience can be used in
the design of the controller.

 This makes it easier to mechanize tasks that are already successfully


performed by humans. 29
History and Applications
 Fuzzy logic was first proposed by Lotfi A. Zadeh of the University of
California at Berkeley in 1965.

 He elaborated on his ideas in a 1973 paper that introduced the concept


of "linguistic variables", which equate to a variable defined as a fuzzy
set.

 Other research followed, with the first industrial application, a


cement kiln built in Denmark, coming on line in 1975.

 Interest in fuzzy systems was sparked by Seiji Yasunobu and Soji


Miyamoto of Hitachi, who in 1985 provided simulations that
demonstrated the superiority of fuzzy control systems for
the Sendai railway.

 Their ideas were adopted, and fuzzy systems were used to control
accelerating, braking, and stopping when the line opened in 1987.
30
History and Applications

Following such demonstrations, Japanese engineers developed a


wide range of fuzzy systems for both industrial and consumer
applications.

In 1988 Japan established the Laboratory for International


Fuzzy Engineering (LIFE), a cooperative arrangement
between 48 companies to pursue fuzzy research.

Japanese consumer goods often incorporate fuzzy systems


(next slide)

31
Japanese consumer goods involving fuzzy systems

 Matsushita vacuum cleaners use microcontrollers running fuzzy


algorithms to interrogate dust sensors and adjust suction power
accordingly.

 Hitachi washing machines use fuzzy controllers to load-weight,


fabric-mix, and dirt sensors and automatically set the wash cycle for
the best use of power, water, and detergent.

 Canon developed an autofocusing camera that uses a charge-


coupled device (CCD) to measure the clarity of the image in six
regions of its field of view and use the information provided to
determine if the image is in focus.

 It also tracks the rate of change of lens movement during


focusing, and controls its speed to prevent overshoot.
32
Japanese consumer goods involving fuzzy systems (Cont.)

An industrial air conditioner designed by Mitsubishi uses 25


heating rules and 25 cooling rules.

 A temperature sensor provides input, with control


outputs fed to an inverter, a compressor valve, and a
fan motor.

 Compared to the previous design, the fuzzy controller


heats and cools five times faster, reduces power
consumption by 24%, increases temperature stability by
a factor of two, and uses fewer sensors.

33
Japanese consumer goods involving fuzzy systems (Cont.)

The enthusiasm of the Japanese for fuzzy logic is reflected in the


wide range of other applications they have investigated or
implemented:

 character and handwriting recognition;


 optical fuzzy systems;
 robots, including one for making Japanese flower
arrangements;
 voice-controlled robot helicopters
 elevator systems; etc.

34
Fuzzy systems in US & Europe

Work on fuzzy systems is also proceeding in the US and


Europe, though not with the same enthusiasm shown in Japan.

The US Environmental Protection Agency has investigated


fuzzy control for energy-efficient motors, and NASA has
studied fuzzy control for automated space docking:

simulations show that a fuzzy control system can greatly


reduce fuel consumption.

Firms such as Boeing, General Motors, Allen-Bradley, Chrysler,


Eaton, and Whirlpool have worked on fuzzy logic for use in
low-power refrigerators, improved automotive
transmissions, and energy-efficient electric motors.
35
Fuzzy systems in US & Europe (Cont.)
 In 1995 Maytag introduced an "intelligent" dishwasher based on a
fuzzy controller and a "one-stop sensing module" that combines:

 a thermistor, for temperature measurement;


 a conductivity sensor, to measure detergent level from the ions
present in the wash;
 a turbidity sensor that measures scattered and transmitted light to
measure the soiling of the wash; and
 a magnetostrictive sensor to read spin rate.

 The system determines the optimum wash cycle for any load to obtain
the best results with the least amount of energy, detergent, and water.

 It even adjusts for dried-on foods by tracking the last time the door
was opened, and estimates the number of dishes by the number of
times the door was opened.
36
Research and development is also continuing on
fuzzy applications in software, as opposed
to firmware, design, including fuzzy expert
systems and integration of fuzzy logic
with neural-network and so-called adaptive
"genetic" software systems, with the ultimate
goal of building "self-learning" fuzzy control
systems.

37
Fuzzy sets
The input variables in a fuzzy control system are in general
mapped into by sets of membership functions, known as "fuzzy
sets‖.

The process of converting a crisp input value to a fuzzy value is


called "fuzzification".

A control system may also have various types of switch, or


"ON-OFF", inputs along with its analog inputs, and such switch
inputs will always have a truth value equal to either 1 or 0, but
the scheme can deal with them as simplified fuzzy functions that
happen to be either one value or another.

38
Fuzzy sets
Given "mappings" of input variables into membership functions
and truth values, the microcontroller then makes decisions for
what action to take based on a set of "rules", each of the form:

IF brake temperature IS warm AND speed IS not very fast


THEN brake pressure IS slightly decreased.

In this example, the two input variables are "brake temperature"
and "speed" that have values defined as fuzzy sets.

The output variable, "brake pressure", is also defined by a fuzzy


set that can have values like "static", "slightly increased", "slightly
decreased", and so on.
39
Fuzzy sets (Cont.)
The decision is based on a set of rules:

All the rules that apply are invoked, using the membership
functions and truth values obtained from the inputs, to
determine the result of the rule.

This result in turn will be mapped into a membership function


and truth value controlling the output variable.

These results are combined to give a specific ("crisp") answer,


the actual brake pressure, a procedure known as
"defuzzification".

This combination of fuzzy operations and rule-


based "inference" describes a "fuzzy expert system“. 40
Fuzzy control in detail
Fuzzy controllers are very simple conceptually.

They consist of an input stage, a processing stage, and an


output stage.

The input stage maps sensor or other inputs to the appropriate


membership functions and truth values.

The processing stage invokes each appropriate rule and


generates a result for each, then combines the results of the
rules.

Finally, the output stage converts the combined result back


into a specific control output value.
41
Fuzzy control in detail (Cont.)
The most common shape of membership functions is
triangular, although trapezoidal and bell curves are also used,
but the shape is generally less important than the number of
curves and their placement.

From three to seven curves are generally appropriate to cover


the required range of an input value, or the "universe of
discourse" in fuzzy jargon.

As discussed earlier, the processing stage is based on a


collection of logic rules in the form of IF-THEN
statements, where the IF part is called the "antecedent"
and the THEN part is called the "consequent".

Typical fuzzy control systems have dozens of rules. 42


Fuzzy control in detail (Cont.)
Consider a rule for a thermostat:
IF (temperature is "cold") THEN (heater is "high")

 This rule uses the truth value of the "temperature" input, which is
some truth value of "cold", to generate a result in the fuzzy set for
the "heater" output, which is some value of "high".

 This result is used with the results of other rules to finally generate
the crisp composite output.

 Obviously, the greater the truth value of "cold", the higher the truth
value of "high", though this does not necessarily mean that the output
itself will be set to "high", since this is only one rule among many.

43
Fuzzy control in detail (Cont.)
 In some cases, the membership functions can be modified by
"hedges" that are equivalent to adjectives.

 Common hedges include "about", "near", "close to", "approximately",


"very", "slightly", "too", "extremely", and "somewhat".

 These operations may have precise definitions, though the definitions


can vary considerably between different implementations.

 "Very", for example, squares membership functions; since the


membership values are always less than 1, this narrows the
membership function.

 "Extremely" cubes the values to give greater narrowing, while


"somewhat" broadens the function by taking the square root.
44
Fuzzy control in detail (Cont.)

 In practice, the fuzzy rule sets usually have several antecedents


that are combined using fuzzy operators, such as AND, OR,
and NOT, though again the definitions tend to vary:

 AND uses the minimum weight of all the antecedents, while


 OR uses the maximum value.
 NOT subtracts a membership function from 1 to give the
"complementary" function.

 There are several different ways to define the result of a rule, but
one of the most common and simplest is the "max-min"
inference method, in which the output membership function is
given the truth value generated by the premise.

45
Neural Networks

46
Human Brain

47
Human Brain

48
Neural Networks in Brain

49
50
51
Brain as an information processing system

• Consists of ~10 billion nerve cells or neurons.

• ~60 trillion inter-connections

Neural Network?

52
Introduction to Artificial Neural Networks
Usefulness and Capabilities

1. Non-linearity
2. Input-Output Mapping
- associations
- Auto-associations
3. Adaptivity
- “free parameters”
4. Evidential Response
- Decision with a measure of confidence
5. Fault Tolerance
- graceful degradation
6. VLSI implementability
7. Neurobiological analogy

53
Introduction

Computer science has been widely adopted by modern medicine.

One reason is that an enormous amount of data has to be


gathered and analysed which is very hard or even impossible
without making use of computer systems.

The majority of medical tools are able to send results of their


work directly to a computer facilitating significantly collection of
necessary information.

A large number of such tools already exists and they provide an


aid to the doctors in their everyday work.

54
ANN use in medicine

 ANNs’ effectiveness in recognizing patterns and relations is


a reason why they are being used to aid doctors in solving
medical problems.

 They have shown large efficiency not only in diagnosis but


also in modelling parts of the human body.

 One of the most important dermatological problems is


melanoma diagnosis.

 Dermatologists achieve accuracy in recognizing malignant


melanoma between 65 and 85%, whereas early detection means
decreasing of mortality.
55
ANN use in medicine

 Diagnostic and neural analysis of skin cancer (DANAOS)


showed the results comparable to results of dermatologists.

 It was also found that images hard to recognize by DANAOS


differed from those causing problems to dermatologists.

 Cooperation between humans and computers could therefore


lower the probability of mistakes.

 Results obtained are also dependent on the size and quality of


the database used.

56
ANN use in medicine

 ANNs have also been adopted in pharmaceutical


research and in many other different clinical applications
using pattern recognition; for example:

- diagnosis of breast cancer,

- interpreting ECGs,

- diagnosing dementia,

- predicting prognosis and survival rates.

57
Neural networks
 The ANNs consist of many connected neurons simulating a
brain at work.
 A basic feature which distinguishes an ANN from an
algorithmic program is the ability to generalize the knowledge
of new data which was not presented during the learning
process.
 Expert systems need to gather actual knowledge of its
designated area.
 However, ANNs only need one training and show tolerance
for discontinuity, accidental disturbances or even defects in the
training data set.
 This allows for usage of ANNs in solving problems which
cannot be solved by other means effectively. 58
Neural networks

 These features and advantages are the reason why the area
of ANN’s application is very wide and includes for
example:
– Pattern recognition,
– Object classification,
– Medical diagnosis,
– Forecast of economical risk, market prices
changes, need for electrical power, etc.,
– Selection of employees

59
Biological neural networks

The human brain consists of around 1011 nerve cells called neurons

60
What is a neural network (NN)?

• Neural networks is a branch of "Artificial Intelligence".

• Artificial Neural Network is a system loosely modeled based


on the human brain.

• The field goes by many names, such as connectionism, parallel


distributed processing, neuro-computing, natural intelligent
systems, machine learning algorithms, and artificial neural
networks.

61
Description

• Most neural networks have some sort of "training" rule


whereby the weights of connections are adjusted on the
basis of presented patterns.

• In other words, neural networks "learn" from examples,


just like children learn to recognize dogs from examples of
dogs, and exhibit some structural capability for
generalization.

• Neural networks normally have great potential for


parallelism, since the computations of the components are
independent of each other

62
Introduction
• Neural networks are a powerful technique to solve many real
world problems.
• They have the ability to learn from experience in order to
improve their performance and to adapt themselves to
changes in the environment.
• In addition to that they are able to deal with incomplete
information or noisy data and can be very effective
especially in situations where it is not possible to define the
rules or steps that lead to the solution of a problem.

• They typically consist of many simple processing units, which


are wired together in a complex communication network.
63
The Brain

The Brain as an Information


Processing System

The human brain contains


about 10 billion nerve cells, or
neurons. On average, each
neuron is connected to other
neurons through about 10000
synapses.

64
Computation in the brain
• The brain's network of neurons forms a massively parallel
information processing system. This contrasts with conventional
computers, in which a single processor executes a single series of
instructions.

• Against this, consider the time taken for each elementary


operation: neurons typically operate at a maximum rate of about
100 Hz, while a conventional CPU carries out several hundred
million machine level operations per second. Despite of being
built with very slow hardware, the brain has quite remarkable
capabilities:

– Its performance tends to degrade gracefully under partial damage. In


contrast, most programs and engineered systems are brittle: if you
remove some arbitrary parts, very likely the whole will cease to function.
65
Computation in the brain
• It can learn (reorganize itself) from experience.
• This means that partial recovery from damage is possible if healthy
units can learn to take over the functions previously carried out by
the damaged areas.
• It performs massively parallel computations extremely efficiently.
For example, complex visual perception occurs within less than
100 ms, that is, 10 processing steps!
• It supports our intelligence and self-awareness.
• As a discipline of Artificial Intelligence, Neural Networks attempt
to bring computers a little closer to the brain's capabilities by
imitating certain aspects of information processing in the brain, in
a highly simplified way.

66
Introduction to ANNs

Structure of an artificial neuron, worked out by McCulloch


and Pitts in 1943, is similar to biological neuron.

The weighted sum of the inputs is transformed by the


activation function to give the final output

67
Introduction to ANNs (Cont.)
 It consist of two modules: summation module Σ and
activation module F.
 Roughly the summation module corresponds to biological
nucleus.
 There algebra summation of weighted input signals is realised
and the output signal is generated.

Output signal can be calculated using the formula:

where w - vector of weights (synapses equivalent),


u - vector of input signals (dendrites equivalent),
m - number of inputs.
68
Introduction to ANNs (Cont.)

Signal is processed by the activation module F, which can


be specified by different functions according to needs.

A simple linear function can be used, then the output signal y


has form

where k is coefficient.

Networks using this function are called Madaline and their


neurons are called Adaline (ADAptive LINear Element).

They are the simplest networks, which have found practical


application.
69
ADALINE

 ADALINE (Adaptive Linear Neuron or later Adaptive


Linear Element) is a single layer neural network.

 It was developed by Bernard Widrow and his graduate


student Ted Hoff at Stanford University in 1960.

 It is based on the McCulloch–Pitts neuron.

 It consists of a weight, a bias and a summation function.

70
ADALINE (Cont.)

 The difference between Adaline and the standard


(McCulloch-Pitts) perceptron is that in the learning phase
the weights are adjusted according to the weighted sum of
the inputs (the net).

 In the standard perceptron, the net is passed to the


activation (transfer) function and the function's output is
used for adjusting the weights.

71
ADALINE - Definition

Adaline is a single layer neural network with multiple nodes where


each node accepts multiple inputs and generates one output.

Given the following variables:


• x is the input vector
• w is the weight vector
• n is the number of inputs
• θ is some constant
• y is the output

Then we find that the output is

72
ADALINE - Learning Algorithm
Let us assume:

• η is the learning rate (some constant)


• d is the desired output
• o is the actual output

then the weights are updated as follows:

The ADALINE converges to the least squares error which is

E = (d − o)

73
Madaline

 Madaline (Multiple Adaline) is a two layer neural network


with a set of ADALINEs in parallel as its input layer and
a single PE (processing element) in its output layer.

 The madaline network is useful for problems which involve


prediction based on multiple inputs, such as:

- weather forecasting

74
Introduction to ANNs (Cont.)

Another type of the activation module function is a threshold


function

where is constant threshold value.

75
Introduction to ANNs (Cont.)
However, functions which describe a non-linear profile of
biological neuron more precisely are:

A sigmoid function:

where β is given parameter, and

A tangensoid function:

where α is given parameter


76
Introduction to ANNs (Cont.)

 Information capacity and processing ability of a single neuron


is relatively small.

 However, it can be raised by the appropriate connection of


many neurons.

 In 1958, the first ANN, called perceptron, was developed by


Rosenblatt.

 It was used for alphanumerical character recognition.

77
Introduction to ANNs (Cont.)
Neurons in the multilayer ANNs are grouped into 3 different
types of layers: input, output, and hidden layer

There can be one or more hidden layers in the network but


only one output and one input layer.
78
Introduction to ANNs (Cont.)

 The number of neurons in the input layer is specified by the


type and amount of data which will be given to the input.

 The number of output neurons corresponds to the type of


answer of the network.

 The amount of hidden layers and their neurons is more


difficult to determine.

 A network with one hidden layer suffices to solve most tasks.

 None of the known problems needs a network with more than


three hidden layers in order to be solved.
79
Introduction to ANNs (Cont.)
Selection of the number of layers for solving different problems.

More complicated networks can solve more complicated issues.


80
Introduction to ANNs (Cont.)

 There is no good recipe for the number of hidden neurons


selection.

 One of the methods is described by formula

where Nh is the number of neurons in the hidden layer, and


Ni and No are the corresponding numbers for the
input and output layers, respectively.

However, usually the quantity of hidden neurons is


determined empirically
81
Introduction to ANNs (Cont.)

 Two types of a multilayer ANNs can be distinguished with


regards to the architecture:

- feed-forward and
- feedback networks.

82
Introduction to ANNs (Cont.)
 In the feed-forward networks, signal can move in one
direction only and cannot move between neurons in the same
layer.

Multilayer feed-forward ANN

Such networks can be used in the pattern recognition.


83
Introduction to ANNs (Cont.)

 Feedback networks are more complicated, because a signal can


be sent back to the input of the same layer with a changed value.

Signals can move in these loops until the proper state is achieved.

These networks are also called interactive or recurrent. 84


Training of the ANN
 The process of training of the ANN consists in changing the
weights assigned to connections of neurons until the achieved
result is satisfactory.

 Two main kinds of learning can be distinguished:


- supervised and
- unsupervised learning.

 In supervised learning, external teacher is being used to


correct the answers given by the network.

 ANN is considered to have learned when computed errors are


minimized.

85
Training of the ANN (Cont.)

 Unsupervised learning does not use a teacher.

 ANN has to distinguish patterns using the information


given to the input without external help.

 This learning method is also called self-organisation.

 It works like a brain which uses sensory impressions to


recognise the world without any instructions

86
Training of the ANN (Cont.)
 One of the best known learning algorithms is the Back-
Propagation Algorithm (BPA).

 This basic, supervised learning algorithm for multilayered


feed-forward networks gives a recipe for changing the weights
of the elements in neighbouring layers.

 It consists in minimization of the sum-of-squares errors,


known as least squares.

87
Back-Propagation Algorithm (BPA)
To teach ANN using BPA the following steps have to be carried
out for each pattern in the learning set:

1. Insert the learning vector uμ as an input to the network.

2. Evaluate the output values ujmμ of each element for all layers
using the formula

3. Evaluate error values for the output layer using the


formula

88
Back-Propagation Algorithm (BPA) (Cont.)

4. Evaluate sum-of-squares errors from

5. Carry out the back-propagation of output layer error


to all elements of hidden layers calculating their errors
from

89
Back-Propagation Algorithm (BPA) (Cont.)
6. Update the weights of all elements between output and hidden
layers and then between all hidden layers moving towards the
input layer.

Changes of the weights can be obtained from

Back-propagation of errors values

90
Back-Propagation Algorithm (BPA) (Cont.)
Above steps have to be repeated until satisfactory minimum of
complete error function is achieved:

91
Back-Propagation Algorithm (BPA) (Cont.)

 Every iteration of these instructions is called epoch.

 After the learning process is finished another set of patterns


can be used to verify the knowledge of the ANN.

 For complicated networks and large sets of patterns the


learning procedure can take a lot of time.

 Usually it is necessary to repeat the learning process many


times with different coefficients selected by trial and error.

 There are a variety of optimisation methods that can be used


to accelerate the learning process.

92
Back-Propagation Algorithm (BPA) (Cont.)

One of them is momentum technique, which consists in


calculating the changes of the weights for the pattern (k + 1)
using formula

where α is constant value which determines the influence of


the previous change of weights to the current change.

93
A Simple Artificial Neuron
•The basic computational element (model neuron) is often called a node or unit. It
receives input from some other units, or perhaps from an external source. Each
input has an associated weight w, which can be modified so as to model synaptic
learning. The unit computes some function f of the weighted sum of its inputs

•Its output, in turn, can serve as input to other


units.

•The weighted sum is called the net input to


unit i, often written neti.

•Note that wij refers to the weight from unit j to


unit i (not the other way around).

•The function f is the unit's activation function.


In the simplest case, f is the identity function,
and the unit's output is just its net input. This is
called a linear unit. 94
Applications:
Neural Network Applications can be grouped in following categories:

• Clustering:

A clustering algorithm explores the similarity between patterns and


places similar patterns in a cluster. Best known applications include
data compression and data mining.

• Classification/Pattern recognition:

The task of pattern recognition is to assign an input pattern (like


handwritten symbol) to one of many classes. This category includes
algorithmic implementations such as associative memory.

95
Applications:
• Function approximation:

The tasks of function approximation is to find an estimate of


the unknown function f() subject to noise. Various engineering
and scientific disciplines require function approximation.

• Prediction/Dynamical Systems:

The task is to forecast some future values of a time-sequenced


data. Prediction has a significant impact on decision support
systems. Prediction differs from Function approximation by
considering time factor.

Here the system is dynamic and may produce different results


for the same input data based on system state (time).

96
Types of Neural Networks
Neural Network types can be classified based on following attributes:

• Applications
-Classification
-Clustering
-Function approximation
-Prediction
• Connection Type
- Static (feedforward)
- Dynamic (feedback)
• Topology
- Single layer
- Multilayer
- Recurrent
- Self-organized
• Learning Methods
- Supervised
- Unsupervised
97
The McCulloch-Pitts Model of Neuron
• The early model of an artificial neuron is introduced by Warren McCulloch
and Walter Pitts in 1943. The McCulloch-Pitts neural model is also known as
linear threshold gate. It is a neuron of a set of inputs I1,I2,I3…Im and one
output y . The linear threshold gate simply classifies the set of inputs into two
different classes. Thus the output y is binary. Such a function can be described
mathematically using these equations:

•W1,W2…Wm are weight values


normalized in the range of either (0,1) or (-
1,1) and associated with each input line,
Sum is the weighted sum, and T is a
threshold constant. The function f is a
linear step function at threshold T as
shown in figure
98
The Perceptron
• In late 1950s, Frank Rosenblatt introduced a network composed of
the units that were enhanced version of McCulloch-Pitts Threshold
Logic Unit (TLU) model. Rosenblatt's model of neuron, a perceptron,
was the result of merger between two concepts from the 1940s,
McCulloch-Pitts model of an artificial neuron and Hebbian learning
rule of adjusting weights. In addition to the variable weight values,
the perceptron model added an extra input that represents bias.
Thus, the modified equation is now as follows:

where b represents the bias value.

99
The McCulloch-Pitts Model of Neuron
Symbolic Illustration of Linear Threshold Gate

• The McCulloch-Pitts model of a neuron is simple yet has substantial computing potential.
It also has a precise mathematical definition. However, this model is so simplistic that it
only generates a binary output and also the weight and threshold values are fixed. The
neural computing algorithm has diverse features for various applications . Thus, we need to
obtain the neural model with more flexible computational features.

100
Artificial Neuron with Continuous
Characteristics
• Based on the McCulloch-Pitts model described previously, the
general form an artificial neuron can be described in two stages
shown in figure. In the first stage, the linear combination of inputs is
calculated. Each value of input array is associated with its weight
value, which is normally between 0 and 1. Also, the summation
function often takes an extra input value Theta with weight value
of 1 to represent threshold or bias of a neuron. The summation
function will be then performed as

• The sum-of-product value is then passed into the second stage to


perform the activation function which generates the output from
the neuron. The activation function ``squashes" the amplitude the
output in the range of [0,1] or [-1,1] alternately. The behavior of
the activation function will describe the characteristics of an
artificial neuron model. 101
Artificial Neuron with Continuous
Characteristics

•The signals generated by actual biological neurons are the action-potential spikes, and the
biological neurons are sending the signal in patterns of spikes rather than simple absence or
presence of single spike pulse. For example, the signal could be a continuous stream of
pulses with various frequencies. With this kind of observation, we should consider a signal to
be continuous with bounded range. The linear threshold function should be ``softened".

•One convenient form of such ``semi-linear" function is the logistic sigmoid function, or in
short, sigmoid function as shown in figure. As the input x tends to large positive value, the
output value y approaches to 1. Similarly, the output gets close to 0 as x goes negative.
However, the output value is neither close to 0 nor 1 near the threshold point.
102
Artificial Neuron with Continuous
Characteristics
• This function is expressed
mathematically as follows:

• Additionally, the sigmoid


function describes the
``closeness" to the threshold
point by the slope. As x
approaches to

- infinity or + infinity ,

the slope is zero; the slope


increases as x approaches to 0.
This characteristic often plays an
important role in learning of
neural networks.

103
Support Vector Machines

104
Support Vector Machines
 SVMs are a set of related supervised learning methods used for
classification and regression.

 In simple words, given a set of training examples, each marked as


belonging to one of two categories, a SVM training algorithm
builds a model that predicts whether a new example falls into
one category or the other.

 An SVM model is a representation of the examples as points in


space, mapped so that the examples of the separate categories
are divided by a clear gap that is as wide as possible.

 New examples are then mapped into that same space and
predicted to belong to a category based on which side of the gap
they fall on. 105
Support Vector Machines

A linear support vector machine is composed of a set of


given support vectors z and a set of weights w.

The computation for the output of a given SVM with N


support vectors z1, z2, … , zN and weights w1, w2, … , wN is
then given by:

106
Support Vector Machines

 SVMs map input samples into a higher-dimensional space


where a maximal separating hyperplane among the instances
of different classes is constructed.

107
Support Vector Machines (Cont.)

 The method works by constructing another two parallel


hyperplanes on each side of this hyperplane.

 The SVM method tries to find the separating hyperplane that


maximizes the area of separation between the two parallel
hyperplanes.

 A larger separation between these parallel hyperplanes


implies a better predictive accuracy of the classifier.

108
Support Vector Machines (Cont.)

 As the widest area of separation is, in fact, determined by a few


samples that are close to both parallel hyperplanes, these
samples are called support vectors.

 They are also the most difficult samples to be correctly


classified.

109
Support Vector Machines

 SVM proposed by Vapnik was originally designed for


classification and regression tasks.

 Essence of SVM method is construction of optimal


hyperplane, which can separate data from opposite
classes using the biggest possible margin.

 Margin is a distance between optimal hyperplane and a


vector which lies closest to it.

110
Support Vector Machines

 SVMs are a set of related supervised learning methods


which analyse data and recognize patterns, used for statistical
classification and regression analysis.

 Given a set of training examples, each marked as belonging to


one of two categories, an SVM training algorithm builds a
model that predicts whether a new example falls into one
category or the other.

 An SVM model is a representation of the examples as points in


space, mapped, so that the examples of the separate categories
are divided by a clear gap that is as wide as possible.

111
Support Vector Machines
 New examples are then mapped into that same space and
predicted to belong to a category based on which side of the
gap they fall on.

 More formally, SVM constructs a hyperplane or set of


hyperplanes in a high or infinite dimensional space, which can
be used for classification, regression or other tasks.

 A good separation is achieved by the hyperplane that has the


largest distance to the nearest training datapoints of any class
(so-called functional margin), since in general the larger the
margin the lower the generalization error of the classifier.

112
Motivation
 Classifying data is a common task in machine learning.

 Suppose some given data points each belong to one of two


classes, and the goal is to decide which class a new data point will
be in.

 There are many hyperplanes that might classify the data.

 One reasonable choice as the best hyperplane is the one that


represents the largest separation, or margin, between the two
classes.

113
Motivation
 So we choose the hyperplane so that the distance from it to the
nearest data point on each side is maximized.

 If such a hyperplane exists, it is known as the maximum-


margin hyperplane and the linear classifier it defines is
known as a maximum margin classifier.

114
Optimal hyperplane separating two classes

115
Classification with Large Margin
 Whenever a dataset is linearly separable, i.e. there exists a
hyperplane that correctly classifies all data points, there exist
many such separating hyperplanes.

 We are thus faced with the question of which hyperplane to


choose, ensuring that not only the training data, but also future
examples, unseen by the classifier at training time, are classified
correctly.

 Our intuition as well as statistical learning theory suggests that


hyperplane classifiers will work better if the hyperplane not only
separate the examples correctly, but does so with a large margin.

 Here, the margin of a linear classifier is defined as the distance


of the closest example to the decision boundary. 116
Hard Margin
 Let us adjust b such that the hyperplane is half way in between
the closest positive and negative example, respectively.

 This ―hard margin‖ SVM, applicable to linearly separable data,


is the classifier with maximum margin among all classifiers
that correctly classify all the input examples.

117
Soft Margin
In practice, data is often not linearly separable; and even if it is, a
greater margin can be achieved by allowing the classifier to
misclassify some points.

118
Theory and experimental results show that the
resulting larger margin will generally provide better
performance than the hard margin SVM.

119
The two points closest to the Here, those points move
hyperplane strongly affect its inside the margin, and the
orientation, leading to a hyperplane’s orientation is
hyperplane that comes close changed, leading to a much
to several other data points. larger margin for the rest of
the data.
120
Soft Margin

The constant C > 0 sets the relative importance of maximizing


the margin and minimizing the amount of slack.

This formulation is called the soft-margin SVM.

For a large value of C a large penalty is assigned to errors.

121
Non-linear classification
 The original optimal hyperplane algorithm proposed by Vladimir
Vapnik in 1963 was a linear classifier.

 In 1992, Bernhard Boser, Isabelle Guyon and Vapnik suggested a


way to create non-linear classifiers by applying the kernel trick to
maximum-margin hyperplanes.

 This allows the algorithm to fit the maximum-margin hyperplane


in a transformed feature space.

122
Non-linear classification (Cont.)

 The transformation may be non-linear and the transformed


space high dimensional; thus though the classifier is a
hyperplane in the high-dimensional feature space, it may be
non-linear in the original input space.

 If the kernel used is a Gaussian radial basis function, the


corresponding feature space is a Hilbert space of infinite
dimension.

 Maximum margin classifiers are well regularized, so the


infinite dimension does not spoil the results.

123
Kernel Support Vector Machines

 Using kernels, the original formulation for the SVM given


SVM with support vectors z1, z2, … , zN and weights w1,
w2, … , wN is now given by:

124
The Kernel trick

 The Kernel trick is a very interesting and powerful tool.

 It is powerful because it provides a bridge from linearity to non-


linearity to any algorithm that solely depends on the dot
product between two vectors.

 It comes from the fact that, if we first map our input data into a
higher-dimensional space, a linear algorithm operating in this
space will behave non-linearly in the original input space.

 Now, the Kernel trick is really interesting because that mapping


does not need to be ever computed.

125
The Kernel trick

If our algorithm can be expressed only in terms of a inner


product between two vectors, all we need is replace this inner
product with the inner product from some other suitable
space.

That is where resides the "trick": wherever a dot product is


used, it is replaced with a Kernel function.

The kernel function denotes an inner product in feature space


and is usually denoted as:

K(x,y) = <φ(x),φ(y)>

126
Kernel functions (Cont.)
We are now looking for solution in other space, but the problem
is linearly separable, so it is more effective, even if the problem
was linearly non-separable in the input space

127
Kernel functions (Cont.)

128
The Kernel trick

Using the Kernel function, the algorithm can then be carried


into a higher-dimension space without explicitly mapping
the input points into this space.

This is highly desirable, as sometimes our higher-


dimensional feature space could even be infinite-dimensional
and thus infeasible to compute.

129
Kernel functions

 Possibility of occurrence of the linear non-separability in the


input space consist the cause why the idea of SVM is not
optimal for hyperplane construction in the input space but
rather in high-dimensional so called feature space Z.

130
Kernels: from Linear to Non-Linear
Classifiers

In many applications a non-linear classifier provides better


accuracy.

And yet, linear classifiers have advantages, one of them being


that they often have simple training algorithms that scale well
with the number of examples.

Whether the machinery of linear classifiers can be


extended to generate non-linear decision boundaries?

131
Kernels: from Linear to Non-Linear
Classifiers
There is a straightforward way of turning a linear classifier non-
linear, or making it applicable to non-vectorial data.

It consists of mapping our data to some vector space, which we


will refer to as the feature space, using a function φ.

The discriminant function then is

132
Kernels for Real-valued Data

 Real-valued data, i.e. data where the examples are vectors of a


given dimensionality, is common in bioinformatics and other
areas.

 A few examples of applying SVM to real-valued data include:


- prediction of disease state from microarray data
- prediction of protein function from a set of
features that include amino acid composition and
various properties of the amino acids in the protein.

 The two most commonly used kernel functions for real-valued


data are the polynomial and the Gaussian kernel.
133
Kernels for Real-valued Data
The polynomial kernel of degree d is defined as:

The kernel with d = 1 and kappa = 0, denoted by klinear, is


the linear kernel leading to a linear discriminant function.

134
Kernels for Real-valued Data
The degree of the polynomial kernel controls the flexibility of
the resulting classifier.

The lowest degree polynomial is the linear kernel, which is not


sufficient when a non-linear relationship between features exists.

In some cases, a degree 2 polynomial may be flexible enough to


discriminate between the two classes with a good margin.

The degree 5 polynomial yields a similar decision boundary, with


greater curvature.

Normalization can help to improve performance and numerical


stability for large d.
135
Kernels for Real-valued Data
Nonlinear kernels such as the polynomial kernel provide
additional flexibility.

136
Standard Kernels

Other common kernel functions include:

 Linear

 Polynomial

 Radial Basis Function

 Gaussian Radial basis function

 Hyperbolic tangent

137
Gaussian Kernel

 The Gaussian kernel is by far one of the most versatile


Kernels.

 It is a radial basis function kernel, and is the preferred Kernel


when we don’t know much about the data we are trying to
model.

138
Learning Algorithms

The algorithm is governed by extra parameters besides the


Kernel function and the data points:

The parameter C controls the trade off between allowing some


training errors and forcing rigid margins.

- Increasing the value of C increases the cost of


misclassifications but may result in models that do not
generalize well to points outside the training set.

139
Example: Splice Site Recognition
 It is a problem arising in computational gene finding and concerns
the recognition of splice sites that mark the boundaries between
exons and introns in eukaryotes.

 Introns are excised from premature mRNAs in a processing step


after transcription.

 The vast majority of all splice sites are characterized by the


presence of specific dimers on the intronic side of the splice site:

- GT for donor and


- AG for acceptor sites.

 However, only about 0.1-1% of all GT and AG occurrences in the


genome represent true splice sites.
140
Example: Splice Site Recognition

141
Example: Splice Site Recognition

 There are two different splice sites: the exon-intron boundary,


referred to as the donor site or 5’ site (of the intron) and the
intron-exon boundary, that is the acceptor or 3’ site.

 Splice sites have quite strong consensus sequences, i.e. almost


each position in a small window around the splice site is
representative of the most frequently occurring nucleotide
when many existing sequences are compared in an alignment,

142
Performance of SVM
 To evaluate the classifier performance, receiver operating
characteristic (ROC) curves are used, which show the true positive
rates (y-axis) over the full range of false positive rates (x-axis).

 Different values are obtained by using different thresholds on the


value of the discriminant function for assigning the class
membership.

 The area under the curve quantifies the quality of the classifier,
and a larger value indicates better performance.

 Research has shown that it is a better measure of classifier


performance, in particular when the fraction of examples in one
class is much smaller than the other.
143
Sequence logo for acceptor splice sites:

Splice sites have quite strong consensus sequences, i.e. almost each
position in a small window around the splice site is representative
of the most frequently occurring nucleotide when many existing
sequences are compared in an alignment.

The sequence logo shows the region around the intron/ exon
boundary—the acceptor splice site.

144
Multi-class classification

 Although SVM method is naturally adapted for separating


data from two classes, it can be easily transformed into
very useful tool for the classification of more than two
classes.

 There are two basic ways of solving the N-class problem:

– Solving N number of two class classification tasks,

– Pairwise classification

145
Multi-class classification (Cont.)

 The first method consists in teaching of many classifiers


using one versus the rest method.

 It means that while solving every i-th task (i = 1, 2, ...,N)


we carry out the separation of one, current class from the
other classes and every time new hyperplane comes into
being

146
Multi-class classification (Cont.)

 Support vectors, which belong to class i satisfy y(x;wi, bi) = 1,


whereas the other ones satisfy condition

y(x;wi, bi) = −1.

 If for a new vector we have

then the vector is assigned for class i. However, it may happen,


that is true for many i or is not true for any of them.

For such cases the classification is unfeasible.

147
Multi-class classification (Cont.)

 In pairwise classification N-class problem is replaced with


N(N−1)/2 differentiation tasks between two classes.

 However, the number of classifiers is greater than in the


previous method, individual classifiers can be trained faster
and depending on the dataset this results in time savings.

 Unambiguous classification is not always possible, since it


may happen that more than one class will get the same
number of votes.

148
Evaluation and Comparison of the
Model Predictive Power (Cont.)
 Receiver operating characteristic (ROC) curves are an
interesting tool for representing the accuracy of a classifier.

 The ROC analysis evaluates the accuracy of an algorithm


over a range of possible operating (or tuning) scenarios.

 A ROC curve is a plot of a model’s true-positive rate


against its false-positive rate: sensitivity versus 1-specifity.

 The ROC curve represents a plot of these two concepts


for a number of values of a parameter (operating
scenarios) of the classification algorithm.
149
Genetic Algorithms

150
Genetic Algorithms
 Genetic algorithms are based on evolutionary principles
wherein a particular function or definition that best fits the
constraints of an environment survives to the next generation,
and the other functions are eliminated.

 This iterative process continues indefinitely, allowing the


algorithm to adapt dynamically to the environment as needed.

 Genetic algorithms evaluate a large number of solutions to a


problem that are generated at random.

 The members highest fitness scores are allowed to "mate" with


crossovers and mutations, creating the next generation.

151
Genetic Algorithm - Concept

152
Genetic Algorithm
• Inspired by natural evolution
• Population of individuals
– Individual is feasible solution to problem
• Each individual is characterized by a Fitness function
– Higher fitness is better solution
• Based on their fitness, parents are selected to reproduce offspring
for a new generation
– Fitter individuals have more chance to reproduce
– New generation has same size as old generation; old generation
dies
• Offspring has combination of properties of two parents
• If well designed, population will converge to optimal solution
153
Example of convergence

154
Genetic Algorithm
{
initialize population;
evaluate population;
while TerminationCriteriaNotSatisfied
{
select parents for reproduction;
perform recombination and mutation;
evaluate population;
}
}

155

You might also like