2020 Early History of Machine Learning

Available online at www.sciencedirect.
com
ScienceDirect
IFAC PapersOnLine 53-2 (2020) 1385–1390
Early History of Machine Learning
Early
Early History of Machine Learning
Early History
History of of Machine
Machine ∗Learning
Learning
Early History Alexanderof Machine
L. Fradkov ∗Learning
Alexander L. Fradkov ∗
Alexander
Alexander L.L. Fradkov
Fradkov ∗
∗
∗
Institute for Problems in Mechanical
Alexander L. Engineering,
Fradkov ∗ Russian Academy
∗ Institute
Sciences,for Problems
of
∗ Institute
Institute Problems in
in Mechanical
Saint-Petersburg
for Engineering,
Engineering, Russian
Russian Academy
(e-mail: Alexander.Fradkov@gmail.com)
Mechanical Academy
of
of Sciences,for Problems in Mechanical
Saint-Petersburg Engineering, Russian Academy
∗ Sciences, Saint-Petersburg (e-mail: Alexander.Fradkov@gmail.com)
of Institute
Sciences,for Problems in Mechanical
Saint-Petersburg Engineering, Russian Academy
Abstract: Machine of Sciences, Saint-Petersburg
learning belongs to (e-mail: Alexander.Fradkov@gmail.com)
the crossroad of cybernetics (control science) and
Abstract:
computer
Abstract: Machine
science.
Machine It islearning
attracting
learning belongs
recently
belongs to
to the crossroad
an overwhelming
the crossroad of
of cybernetics
interest,
cybernetics both(control science)
of professionals
(control science) and
and
Abstract:
computer
of the general Machine
science. It
public. learning
isInattracting
the belongs
talk arecently
brief tooverview
the
an crossroad
overwhelming
of the of interest,
cybernetics
historical both
development (control
of science)
professionals
of the machine and
computer
Abstract:
computer science.
Machine
science. It
It is
is attracting
learning
attracting recently
belongs
recently an
tooverview
the
an overwhelming
crossroad
overwhelming of interest,
cybernetics
interest, both
both of
(control
of professionals
science)
professionals and
and
of the
learning
of general
field withpublic.
a focusIn the talk
ontalk a brief
the adevelopment of the
of mathematical historical development
apparatus in its of of
firstthe machine
decades is
of the
the general
computer
learning
provided. general
field
public.
science.
with
A number
It
public. isInattracting
aa focus
the
ofInlittle-known
theon talk
the
brief
recently
adevelopment
brief overview
factsoverview of
published
of
ofinthe
an overwhelming the
mathematical
hard
historical
interest,
historical development
both in
development
apparatus
to reach sources
of its
are of
first
the
professionals machine
thedecades
presented. machine and is
learning
of
learning field
the general
field with
public.
with focus
a focus theon the
Inlittle-known
ontalk development
the adevelopment
brief overview of mathematical
ofinthe
of mathematical historical apparatus
development
apparatus in its
its of
in are first
first decades
thedecades
machine is
is
provided. A number of facts published hard to reach sources presented.
provided.
Copyright
learning
provided.field A
© number
A 2020 with
numberThe of little-known
Authors.
a focus
of the is
on This
little-known facts
an
development
facts published
access
openpublished in
article hard
under
of mathematical to
in hard to reachthe reach
CC
apparatusBY-NC-ND
sources
sources are
in are license
presented.
its first decades is
presented.
(http://creativecommons.org/
Keywords:
provided. AMachine number Learning, licenses/by-nc-nd/4.0)
of little-known Neuralfacts Networks,
published Separation
in hard to Theorems,
reach sources Convex areOptimization
presented.
Keywords:
Keywords: Machine
Machine Learning,
Learning, Neural
Neural Networks,
Networks, Separation
Separation Theorems,
Theorems, Convex
Convex Optimization
Optimization
Keywords: Machine Learning, Neural Networks, Separation Theorems, Convex Optimization
Keywords: 1. INTRODUCTION
Machine Learning, Neural Networks, Separation 3. THE 1960s: FIRST
Theorems, ConvexBOOM Optimization OF MACHINE
1. INTRODUCTION
1. INTRODUCTION 3.
3. THE
THE 1960s:
1960s: FIRST
LEARNING
FIRST BOOM
BOOM OF MACHINE
1. INTRODUCTION 3. THE 1960s: FIRST LEARNING BOOM OF OF MACHINE
MACHINE
Machine learning 1. is INTRODUCTION
an area related to both cybernetics 3. THE 1960s: FIRST LEARNING BOOM OF MACHINE
Machine learning is an area related to both cybernetics LEARNING
and computer
Machine learning science
is (orarea
Control Science andcybernetics
Computer 3.1 Deterministic approaches
Machine
and 1learning
computer
Science) , is an
science
attracting an(or area related
related
Control
recently an
to
to both
Science both
overwhelmingandcybernetics
Computer
interest 3.1 Deterministic LEARNING
approaches
and
and computer
Machine 1learning
computer science
is
science an (or
(or Control
area related
Control Science
to
Science and
both
and Computer
cybernetics
Computer 3.1 Deterministic approaches
Science)
both
Science) , attracting
of 11professionals
,, attracting recently
and of the
recently an
an overwhelming
general
overwhelming interest
public. interest 3.1 Deterministic approaches
In the In early 1960s several groups were engaged in the de-
and computer
Science)
both of science (or
attracting
professionals Control
recently anofScience
overwhelmingand Computer
interest 3.1 Deterministic approaches
last
both fewof years thanks toand
1professionals and of
successes
of the
the general
computer
general public.
science
public. In
In the
(the
the In
sign early
and 1960stestingseveralof learning groups were
recognition engagedsystems in the
Widrow de-
Science)
both
last fewof years, attracting
professionals
thanks to recently
and of
successes theanofgeneral
overwhelming
computer public.
science interest
In the
(the In
In early
early 1960s
1960s several
several groups
groups were
were engaged
engaged in
in the
the de-
de-
emergence
last few years of thanks
GPUs, to leading
successesto significant
of computer improvements
science (the sign
(1961); and testing
Bongard of
(1961); learning Bravermanrecognition (1962). systems
General Widrow
state-
both
last fewof
emergence professionals
years of thanks
GPUs, to and of
successes
leading the
to of general
computer
significant public.
science
improvements In the
(the sign and testing of learning recognition systems Widrow
in the performance
emergence of GPUs, of computers
leading to and development
significant improvements of In sign early
(1961);
ments and of
1960s
testing
Bongard
pattern
several
ofrecognition
(1961); learning groups
Braverman were(1962).
recognition
problem
engagedsystems
were Generalin Widrow
proposed
the de-
state- in
last few performance
emergence
in the years of thanks
GPUs, to
of successes
leading of computer
to significant
computers and science (the
improvements (1961); Bongard (1961); Braverman (1962). General state-
special
in the software,
performance allowingof to work
computers withand bigdevelopment
data)
development machine of
of sign
(1961);
ments
1963-1964
and of testing
Bongard
pattern
Abramson
ofrecognition
(1961); learning
et Braverman
al,
recognition
problem
(1963); (1962).
Widrow
systems
were General
proposed
(1964)
Widrow
state-
in the in
emergence
in the software,
special of GPUs,
performance allowingleading
of computers
to to significant
andbig improvements
development ments
of (1961); of pattern recognition problem were proposed in
learning
special is often attributed
software, allowing totowork
computer
work with
with big data)
science.
data) machine
However,
machine ments
1963-1964
groups ofBongard
of pattern
Abramson
Mark
(1961);recognition
Aizerman et Braverman
al, problem
(1963);
(including
(1962).
Widrow were General
proposed
(1964)
E.Braverman
state-
in the
and in
in the
special
learning performance
software,
is often allowingof
attributed computers
totowork
computerwithand bigdevelopment
data)
science. machine
However, of 1963-1964
ments
1963-1964 Abramson
ofof pattern
Abramson et
recognition
et al,
al, (1963);
problem
(1963); Widrow
Widrow (1964)
were(1964)proposed in
in the
the in
historically,
learning is learning
often algorithms
attributed to that provide
computer science. convergence
However, groups Mark Aizerman (including E.Braverman and
special
learning software,
historically,is often
learningallowing
attributed totowork
algorithms computer
thatwith big
provide data)
science. machine L.Rozonoer)
However,
convergence groups
1963-1964
groups of
of Markand Aizerman
Abramson
Mark
Alexander (including
Aizerman et al,
Lerner (including
(1963);
(including Widrow E.Braverman V. Vapnik
(1964)
E.Braverman in and
the
and
and sufficient
historically, convergence
learning algorithms rate of
that the learning
provide process
convergence L.Rozonoer) and Alexander Lerner (including V. Vapnik
learning
historically,is often attributed to computer science. However, and A.Chervonenkis)
L.Rozonoer) and Alexander in Moscow Lerner Institute
(including of Control
V. Sci-
Vapnik
and
arose
and within learning
sufficient
sufficient
algorithms
convergence
ybernetics/control.
convergence rate
rate
that
of
Below
of
provide
the
the look convergence
a learning
learning of a process
control groups
process L.Rozonoer)
and
ence
of Mark
A.Chervonenkis)
and in andgroup
the
Aizerman
Alexander inof Moscow (including
Lerner
Vladimir Institute E.Braverman
(including
Yakubovich of V.
Control
(including
and
Vapnik
Sci-
historically,
and
arose sufficient
within learning algorithms
convergence
ybernetics/control. rate that provide
of learning
Below the a learning
look convergence
aa process and A.Chervonenkis) in Moscow Institute of Control Sci-
theorist
arose at the
within history of machine
ybernetics/control. Below a look isof
of control
presented.
control L.Rozonoer)
and
ence
V.Fomin, A.Chervonenkis)
and in andgroup
the
A.Gelig
Alexander
and ofinof Moscow Lerner
Vladimir
a Vladimir
(including
Institute
Yakubovich
number ofYakubovich of V.
Control
younger researchers)
Vapnik
(including Sci-
and
arose sufficient
within convergence rate Belowof learning
thea learning a process ence and
and in the group (including
theorist
The author
theorist at
at isybernetics/control.
the
the history
a mathematician
history of
of machine
machineby education
learning
lookand
is
is
ofpresented.
has control
some and
presented. ence
V.Fomin,
in
A.Chervonenkis)
Leningrad in the
A.Gelig group
State and inof
aa Moscow
University Vladimir
number LSU
Institute
ofYakubovich
younger
(currently
of Control
(including
researchers)
St.
Sci-
Peters-
arose
theorist
The within
authorat isybernetics/control.
the a history
mathematician of machineby Below
educationa
learninglook is
and of a
has control
presented.
some V.Fomin,
ence
V.Fomin, and A.Gelig
in the
A.Gelig group and
and ofa number
Vladimir
number of
of younger
Yakubovich
younger researchers)
(including
researchers)
experience
The author in
is pattern recognition and control based on in Leningrad State UniversityIn the LSU (currently St. Peters-
theorist
The author
experience
adaptation
at in is a
the
and
mathematician
history recognition
apattern
mathematician of machine
in the by
learningrecognition
by education
learning
education
1960s. and
and
and
control
has
has some
is presented.
based someon burg
in
V.Fomin,
in
State University).
Leningrad
Leningrad State
A.Gelig
State University
and aInnumber
University
papers
LSU
LSU
Vapnik and
of (currently
younger
(currently St.
researchers)
Lerner
St. Peters-
Peters-
experience
The author and in pattern
is apattern
mathematician by and
education control based
and based
has some on burg
(1963);
burg State
StateVapnikUniversity).
and
University). Chervonenkis
In the
the papers
papers (1964) Vapnik
Vapnikand and
Yakubovich
and Lerner
Lerner
experience
adaptation in learning recognition
in the 1960s. and control on in burg Leningrad
State State
University). UniversityIn the LSU
papers (currently
Vapnik St. Peters-
adaptation
experience
adaptation and and learning
in pattern in
learningrecognitionthe 1960s.
in the 1960s. and control based on (1963); (1963);
(1963, 1965, Vapnik
Vapnik and
1966)
and Chervonenkis
deterministic approaches
Chervonenkis (1964)
(1964) and
and were Lerner
and
Yakubovich
Yakubovich devel-
2. EARLY YEARS burg
(1963);
(1963, StateVapnik
1965, University).
and Chervonenkis
1966) In the papers
deterministic (1964) Vapnik
approaches and Lerner
and Yakubovich
were devel-
oped,
(1963, while
1965, in Aizerman
1966) et
deterministic al (1964a,b) probabilistic
approaches were state-
adaptation and learning 2.
2.
in
EARLY
EARLY
the 1960s.
YEARS
YEARS (1963);
(1963,
oped,
ment
Vapnik
1965,
while
of the in and Chervonenkis
1966)
Aizerman
machine deterministic
learninget al (1964a,b)
problem
(1964)
approaches
probabilistic
was were devel-
and Yakubovich
suggested. devel-
state-
2. EARLY YEARS oped,
(1963,
oped, of while
1965,
while in
in Aizerman
1966)
Aizerman et
deterministic al (1964a,b)
et al (1964a,b) probabilistic
approaches
probabilisticwere devel- state-
state-
The origin of machine learning in its modern sense is ment
ment of the
the machine
machine learning
learning problem
problem was
was suggested.
suggested.
The origin of 2. EARLY
machine learning YEARS in its modern sense is oped,
Let usof
ment whilethe in
consider Aizerman
machine the developed
learninget al (1964a,b)algorithms
problem probabilistic
was of learning
suggested. state- in
usually
The associated
origin of with the
machine name ofinthe
learning itspsychologist
modern Frank
sense is
The
usuallyorigin
associatedof machinewith learning
the name ofinthe itspsychologist
modern sense
Frank is mentLet
more
Let us
us of considerusingthe
the machine
detail
consider the developed
learning problem
a traditional
developed algorithms
simple
algorithms was of learning
suggested.
example.of Let each
learning in
in
Rosenblatt
usually from
associated Cornell
with the University,
name of thewho, based on
psychologist ideas
Frank
The
usuallyorigin
Rosenblattassociatedof machine
from with
Cornell learning
the name
University, ofinthe itspsychologist
who, modern
based sense
on Frank
ideasis Let more
object
more
usdetail
consider
(image)
detail using
X
using
the , developed
aaktraditional
= 1,
traditional 2, ...N algorithms
simple
shown
simple example.
to of machine
the
example.
learning
Let
Let each
eachbein
about the work
Rosenblatt from of the human
Cornell nervous
University, who,system,
based created
on ideas a Let more us consider
detail using k
the a developed
traditional algorithms
simple example.of learning
Let each in
usually
Rosenblatt
about associated
the from
work with
Cornell the name
University, of the psychologist
who, based on Frank
ideas object (image) X , k = 1, 2, ...N shown to the machine be
group
about that
the built of
work a the
of machine
the human
human fornervous
recognizing
nervous system,
system,the created
letters of
created aa encoded
object
more
object detail
by a setXof
(image)
(image) using
X
k
k ,,akreal
k = numbers
1,
traditional
= 1, 2,
2, ...N
...N
- valuestoofthe
shown
simple
shown example.
to
some
the machine
machine
features:
Let eachbe
be
Rosenblatt
about
group the work
that from
built Cornell
of
a the
machine University,
human for nervous who,
recognizing based
system,the on ideas
created
letters ofa encoded
X k = (Xby
encoded by a set
k1 ,aXset
of
...,real
k
k2 , of Xknnumbers)TT and it-- values is required of some to trainfeatures:the
the
groupalphabet
that Rosenblatt
built ofa machine (1957,for 1959, 1960).
recognizing The
the machine,
letters ofa X object
encoded= (image)
(X by ,aX X, of
set , kreal
..., X =kn1,
real numbers
) 2,and
numbers - values
...N itshownvalues
is toof
required
some
ofthe
some machine
to
features:
features:
train be
about
group
the
called
the
the work
that
alphabet built
the ”perceptron”
alphabet
a
Rosenblatt
Rosenblatt
the
machinehuman
(1957,for nervous
by its 1959,
(1957,
recognizing
1959,
creator,
system,
1960).
usedThe
1960).
the
The both
created
letters
machine, of
analog X
machine, machine
k
k = (X to
k1
k1 , recognize
X k2
k2 , k
..., X to
kn ) Twhich
T and one
it is of two
required classes:
to train Athe
the or
group
the
called that
alphabet
the built a
Rosenblatt
”perceptron” machine (1957,
by for
its recognizing
1959,
creator, 1960).
used the
The both letters
machine,
analog of encoded
X
machine
k = (X by
to
k1 ,a X set
k2 , of
recognize..., real
X to
kn numbers
) and
which -
it
onevalues
is required
of twoof someto
classes: features:
train A the or
and discrete
called the signals and included
”perceptron” by its a threshold
creator, used element
both that X
analog B belongs
machine tothe presented
recognize to new
Twhich objectone X∗.
of twoLet for
classes: certainty
A or
the
and alphabet
called the
discrete Rosenblatt
”perceptron”
signals and (1957,
by its
included 1959,
creator,
a 1960).
used
threshold The both
elementmachine,
analog
that machine
B k =
belongs(X to
k1 ,
theX k2 ,
recognize...,
presented X to
kn ) and
which
new it
objectone is required
of
X∗. twoLet to
classes:
for train A
certainty the or
converted
and discrete analog
signalssignals
and into
included discrete
a ones.
threshold It became the X
B belongs
belongs theto class
presented A fornew k = 1,
object 2, ...k
X∗. and
Let class
for B
certainty for
usedItelement that k A
called
and the analog
discrete
converted ”perceptron”
signals and
signals by itsdiscrete
included
into creator,
a threshold
ones. both
element
became analog
that
the machine
B
X belongs
belongs totherecognize
to presented
class A tofor which
new k =objectone
1, 2, of Atwo
X∗.
...k Let
and classes:
for
class BA for
certainty or
prototype
converted of modern
analog artificial
signals into neural
discrete networks
ones. It (ANN),
became and
the k
X = belongs
k
k kA + 1, ..., to N . That
class A is, the
for k = objects
1, 2, have
...k and
A Let a membership
class B for
and discrete
converted
prototype signals
analog and
signals included
into close a threshold
discrete ones. Itelement
became that
the k B
X
k =
k belongs
belongs
k + 1,theto
..., presented
Nclass
. That A fornew
is, k
the =object
1,
objects2, X∗.
...k
have
A anda for
class certainty
membership B for
the model of
prototype of modern
its learning
modern artificial
was
artificial neural
neural tonetworks
the models
networks (ANN),
of animaland function
Xk= k A +y(X) 1,
1, ...,
to N
such
.. That thatis,y(X the ) = 1 for
kobjects have k a=membership
1, 2, ...kA ,
converted
prototype
the model analog
of modern
its signals
learning into close
artificial
was discrete
neural to ones.
networks
the It (ANN),
models became
(ANN),
of
and
the k
and
animal function
where
function
kA
= belongs
A 1+y(X)<
y(X)
...,
k Nclass
such
<
such
That N
A for
that
that andis,y(Xk =
the
y(X y(X ))k1,=
kobjects
)
=
2,1...k
=
1
have
for
−1
forA k and
kfor=
class
a=membership
k 1,
1, 2,
2,
B Afor,
= ...kkAA+,
...k
and
the humanoflearning
model its learningdeveloped
was closein psychology,
to the models seeofBush
animal & k =
functionk + 1,
y(X)...,A
N .
such That that is, the
y(X kobjects
) = 1 have
for k a= membership
1, 2, ...k ,
prototype
the
and model
human of modern
its learning
learning artificial
was
developed neural
closein to networks
the
psychology, models (ANN),
see of Bush and
animal & where
1, ...,
where NA 1
.
1 <
It
< k
is
k A <
assumed
< N
N and
andthat y(X
y(Xk
the k )
) =
convex
= −1
−1 for
hulls
for k
k of =
= k AA+
vectors
k
Mosteller
and human (1951).
learning Rosenblatt
developed himself
in performed
psychology, see the
Bush first
& function y(X) A such that y(X ) k = 1 for k = 1, 2, ...kA +
the
and model
humanof
Mosteller its learning
learning
(1951). Rosenblattwas close
developed himself to the
in psychology, models
performed seeoftheanimal
Bush & X
first where
1, 1
...,belonging
N .. It< k
is Ato <
assumed N
differentthat and y(X
classesthe )
do = −1 for
k k convex hulls of vectors
not intersect, k = so k AA ,
that +
mathematical
Mosteller studies
(1951). of the perceptron
Rosenblatt himself Rosenblatt
performed (1959).
the first 1, k...,
where N 1 It
< is
k assumed
< N andthat y(Xthe )convex
= −1 hulls
for k of = vectors
k
and human(1951).
Mosteller learning developed
Rosenblatt in psychology,
himself performed see Bush & these1,
X ..., N .
belonging It is assumed
to different thatclassesthe convex
do not hulls
intersect, of so A +
vectors
that
(1962),the first
mathematical A k
However,
mathematical the studies
Novikoff
studies of
of the
the perceptron
theorem
perceptron Novikoff Rosenblatt
Rosenblatt (1959).
which
(1959). X
1,
X k..., N
sets cantobedifferent
k belonging
.
belonging It is assumed
to
separated
different thatclasses
classes
by do
the
a hyperplane.
convex
do not
not intersect,
hulls
intersect, of
It someans
that
vectors
so that
Mosteller
mathematical
However, (1951).
the studiesRosenblatt
Novikoff of the himself
perceptron performed
Rosenblatt the first
(1959). these
k sets can be separated by a hyperplane. It means
wn )TT
gives
However,the conditions
the Novikoff for theorem
the convergence
theorem Novikoff
Novikoff of(1962),
a perceptron
(1962), which
which that
these
X
these
there
sets
belonging
sets
are abevector
can
can to be separated
different
separated
of weights
classesby
by do
w = (w1 , w2 ,It...,means
aa hyperplane.
not intersect,
hyperplane. It so that
means
mathematical
However,
gives the the studies
Novikoff
conditions of the perceptron
theorem Novikoff Rosenblatt
(1962), (1959).
which that
k there are a vector of weights w = (w , w , ..., w
k nn ))=
in for the convergence of a perceptron and athere number wvector T
learning
gives the algorithm
conditions a finite
for the number of steps,
convergence of a has become that
perceptron are aabe 0 , such of that W (X
weights w k ) (w
= >1 , w
1 0 2 ,for
2 ..., w T
However,
gives
learningthe the Novikoff
conditions
algorithm in for
a theorem
the
finite Novikoff
convergence
number of of(1962),
steps, a which 1,
perceptron
has become these
that
and a sets
there can
numberare wvectorseparated
, such of thatby
weights Wa hyperplane.
w
(X = ) (w> , w
0
1 ..., 2N,It
...,
for means
w
k n )=
better
learning known.
algorithm in a finite number of steps, has become and 2, ...k
a A and
number W w(X 0
0 , ) < 0
k such that W (XA at k = k k
k )+ >1, 0 ,
for where
k =T
gives
learning
better the
known.
Supported
conditions
algorithm in for the convergence
a finite number of steps, of a has become that
perceptron and
1,
W
1, 2,
(X)
2,
athere
...k
...k Anumber
=
are a w
and
wTT XW
and W
vector
+(X
(X 0w,k )such
0). <
< of0weights
Thus, 0
that
at
at thek
k
W =
=
wkA=
(X k )+
problem
k +
(w> 1 ,...,
1,
1,
w
is..., N
,for
0reduced
2N ..., w
k n )to
,, where
where
=
better
learning
better known. by
algorithm
known. the Government of Russian Federation,
in a finite number of steps, has become the project 08- and
1,
W 2,
(X) a A
...k Anumber
= and
w TX W+ w(X 0w
k
,k0)such
.< Thus, 0 that
atthe kW (X A
=problemk )+ >
kbased
A 1,is 0reduced
..., N for k to
, values
where =
Supported
Supported by
08 the Government of Russian Federation, project 08- W (X) approximation
= w TX + w of the
0).. < Thus, function
the problem on
is its
reduced to
better
known. by the Government of Russian Federation, project 08- 1,
W
the 2,
(X) ...k = and
w
approximation
A X W + (X wkof
0 Thus,
the 0 at the
functionk = k
problem
based
A + 1,is
on..., N
reduced
its , where
values to
on
the aapproximation
finite set. Gradient-type algorithms for perceptron
+ wof the function based on its
its values
08Supported
1 Cyberneticsbyisthe Government
understood of Russian
here as control Federation,
theory in project
a broad08- T
W (X)
the = wset.
aaapproximation X Gradient-type0 . the
of Thus, the problem is reduced to
offunction based for on values
08 on finite

1
08 Cybernetics
Supported
sense,
1 including
Cybernetics byisis
understood
the Government
system
understood
here as control
of Russian
identification,
here as control,
control
theory
Federation, in
optimization
theory in
a broad
project
a 08-
and
broad learning
on finiteconstruct
set. some
Gradient-type thealgorithms
separating
algorithms for perceptron
hyperplanes,
perceptron
1 Cybernetics is understood here as control theory in a broad
sense,
08 including under system identification, control, optimization and the
on a
learningapproximation
finite set.
construct of
Gradient-type
some the of function
the algorithms based
separating on
for its
perceptron
hyperplanes, values
and, as was
learning shown insome
construct Novikoff of (1962); Yakubovich (1963),
of the separating forhyperplanes,
other
sense, disciplines
including system research within control
identification, control,community.
optimization and
1
sense,
other includingisunder
disciplines
Cybernetics system identification,
research
understood within
here control,
as control
control optimization
community.
theory in a broad and on
and, a as
learning finite
was set.
construct
shown Gradient-type
insome
Novikoff the algorithms
(1962);separating
Yakubovich perceptron
hyperplanes,(1963),
other disciplines under research within control community.
other disciplines under research within control community.
and,
learning as
and, as was was shown
construct
shown insome in Novikoff
Novikoff of the (1962);
(1962); Yakubovich
separating
Yakubovich hyperplanes,(1963),
(1963),
sense, including system identification, control, optimization and
other disciplines
2405-8963 under
Copyright research
© 2020 The within
Authors.control
This iscommunity. and,the
an open access article under asCC
was shown inlicense
BY-NC-ND Novikoff
. (1962); Yakubovich (1963),
Peer review under responsibility of International Federation of Automatic Control.
10.1016/j.ifacol.2020.12.1888
1386 Alexander L. Fradkov et al. / IFAC PapersOnLine 53-2 (2020) 1385–1390
under some conditions perform it in a finite number of description and analysis of scenes Kozinets et al, (1966);
steps. Kharichev et al, (1973).
Let us make the transition to the conjugate space of weight Another powerful method was proposed in 1964 (again
vectors (w, w0 ). Then the problem is transformed into a in LSU!) by young mathematician Lev Bregman (he was
dual problem of finding the intersection of a finite num- 23 years old in 1964). Bregman proposed a recurrent
ber of half-spaces (w, w0 ) : y(Xk )(wT Xk + w0 ) > 0, k = algorithm for finding a point in the intersection of the finite
1, 2, ...N . The dual problem can also be solved by different number of convex sets in a Hilbert space. The problem was
methods. If the objects are presented to machine sequen- motivated by an application problem of city planning that
tially, then a class of gradient-type recurrent algorithms had nothing to do with learning. The Bregmans method
can be used. They look as follows consisted in evaluation of consecutive projections onto the
nearest point of the set taken in the cyclic order. Weak
wk+1 = wk − γk y(Xk )Xk , w0,k+1 = w0,k − γk y(Xk ), (1) convergence of the projections to the intersection of all
sets was proved in the paper Bregman (1965) (the paper
and require knowledge of only one image coordinates at was recommended to publication by Leonid Kantorovich,
each step. Different approaches have been proposed to future Nobel prize winner).
select the size γk of the steps. In particular, it is possible
In his next paper Bregman (1966) Bregman proposed a
to project the current vector of weights onto the boundary
useful functional transformation. Let f (x) be a strictly
hyperplane at each iteration if the presented inequality
convex twice differentiable function, x ∈ Rn . Let D(x, y) =
is not satisfied, i.e. if the current object is classified
f (x) − f (y) − (gradf (y), x − y), where gradf (x) is the
incorrectly Yakubovich (1965, 1966) Since the weights
gradient of the function f (x). Function D(x, y) turns
are not corrected if the presented inequality is satisfied,
out to be convenient to use for the convergence proofs
the inequality defined a deadzone for the algorithm in the
as the Lyapunov function candidate. It was later called
space of weights.
Bregman divergence. In the paper Bregman (1967) Breg-
Another approach is to choose a vector of weights w∗ in man extended his previous results to present an elegant
such a way that the corresponding hyperplane {x : w∗T x+ framework for convex optimization. It was later widely
w0 = 0} is a supporting hyperplane to the convex hull of used for machine learning, clasterization, image deblur-
the available set of vectors y(Xk )Xk , k = 1, 2, ...N , so that ring, image segmentation, data reconstruction, etc. The
the minimum distance from it to the convex hull of classes terms Bregman divergence and Bregman method are now
is maximal. This idea formed the basis of the celebrated widely used: the number of the papers in the journals in-
support vector machine (SVM) method. It is mentioned dexed in Scopus that have those terms (and related terms
in many sources that this idea was first proposed in the Bregman projection, Bregman iteration, etc. in the title
work Vapnik and Chervonenkis (1964), which had an exceeds 700 in October, 2019. As for the paper Bregman
English translation and has become widely known. It is (1967) itself it has more than 1250 citations in Scopus. It
a historical fact, however, that in the same year 1964, is interesting that in the paper by Gubin et al (1967) a
an employee of the laboratory of theoretical cybernetics similar problem was studied and a number of results on
of LSU Boris Kozinets published another fairly simple strong convergence and convergence rate in Hilbert space
recurrent algorithm converging to an optimal supporting were obtained without use of Bregman divergence. Also
hyperplane Kozinets (1964). Recurrent property of the algorithms with incomplete relaxation were proposed and
algorithm means that it is based on processing objects convergence in a finite number of steps was established
(images) one by one, depending on availability (unlike as well as some applications to Chebyshev approximation
nonrecurrent algorithms that need to have all the objects and optimal control. The results of the paper by Gubin
in the memory to process them). et al (1967) are also used by many authors in machine
learning and related areas of optimization: it has about
Another algorithm, called MDM, was proposed at LSU 500 citations in Scopus.
by V. N. Malozemov, V. F. Demyanov and D. Mitchell
Mitchell et al (1974). It is based on the formulation 3.2 Stochastic approaches
of the problem as minimax optimization and the use of
nonsmooth optimization methods developed in the group
The above mentioned methods are based upon determin-
of V.F.Demyanov working mostly on nonsmooth optimiza-
istic approach where an uncertainty is modeled as an ele-
tion problems. In 1965-66 V.A. Yakubovich developed a
ment of a bounded set. However many papers are devoted
systematic approach to the mathematical theory of pat-
to statistical approach which roots can be traced back
tern recognition, called ”the method of recurrent goal
to statistical formulations of communications and decision
inequalities” Yakubovich (1965, 1966). It is based on the
theory Marill & Green (1960); Widrow (1959).
reduction of the problem to the solution of a system of
inequalities constructed for a given purpose of functioning The most systematic framework based on average risk
of the system and allows one to find a solution to an minimization was developed by Yakov Tsypkin Tsypkin
infinite number of previously not shown inequalities. This (1966, 1971) who has brilliantly demonstrated that it
allowed Yakubovich with coauthors to apply the approach encompasses a large number of algorithms proposed by
to solving problems of learning regulators for dynamical different authors as special cases. Although the idea of
systems under uncertainty, i.e. to solve problems of adap- average risk minimization appeared earlier in operation
tive control Yakubovich (1972). A number of applica- research and was introduced into control of uncertain
tion problems have been solved: training in handwriting systems by Feldbaum (1960), Tsypkin was the first who
recognition, aerial photographs, signal isolation from noise, proposed a unified approach to adaptation and learning us-
Alexander L. Fradkov et al. / IFAC PapersOnLine 53-2 (2020) 1385–1390 1387
ing stochastic approximation machinery. Parameter choice An intensive advertisement of the success of backpropa-
and convergence results then follow from the results on gation and other computational advances produced great
stochastic approximation obtained earlier in mathematical hope for future successes. However real successes were less
statistics. After the Tsypkins works the stochastic approx- that the expected ones and the investments in the area
imation has become a standard instrument in study of of machine learning decreased again in early 1990s. This
adaptation and learning algorithms. An important unify- period is sometimes called the second winter of AI.
ing idea introduced by Tsypkin (Tsypkin 1966, 1968) is in
Nevertheless an interest in studying neural networks as an
the formulation of the adaptation and learning problem in
instrument of machine learning was growing in the 1990s.
terms of a performance index which is the average of the
A real breakthrough was produced by the paper Cortes &
cost function Q(x, w):
Vapnik (1995) where a new version of the SVM algorithm
J(w) = Q(x, w)p(x)dx = Ex Q(x, w) for the general non-separable case was introduced under
the name support-vector networks. It has been widely used
X
and been shown to be effective in practice. The algorithm
That is the problem is formulated as and its theory have had a profound impact on theoretical
minw J(w) and applied machine learning and inspired research on a
and its solution may be based on stochastic gradient variety of topics Mohri et al (2014). The paper Cortes &
algorithms: Vapnik (1995) has got more than 22000 citations in Scopus
by 2019.
w[n] = w[n − 1] − γ[n]∇Q(x[n], w[n − 1])
Choosing the cost function in appropriate way allowed the 5. THE GOLD RUSH OF MACHINE LEARNING
author to design different classes of algorithms described
previously in the literature as well as a number of new ones.
Convergence of the algorithms can be established based on The beginning of the first decade of the XXI century
the stochastic approximation scheme under conditions of turned out to be a turning point in the history of ML,
convexity and bounded growth of J(w), as well as the so and this is explained by three synchronous trends, which
called Robbins-Monro conditions on the sequence of γ[n], together gave a noticeable synergistic effect. The first trend
namely is Big Data. Amount of the data has became so big that
new approaches were brought to life by practical necessity
γ[n] > 0, γ[n] = ∞, γ[n]2 < ∞. rather than by curiosity of scientists.
n n
The second trend is to reduce the cost of parallel comput-
4. TWO WINTERS OF AI AND SECOND BOOM OF ing and memory. This trend was discovered in 2004, when
MACHINE LEARNING Google unveiled its MapReduce technology, followed by
its open-source counterpart Hadoop (2006), and together
they made it possible to distribute the processing of huge
In 1969 the book by M.Minsky and S.Papert Minsky &
amounts of data between simple processors. At the same
Papert (1969) was published where some limitations for
time, Nvidia made a breakthrough in the GPU market:
complexity of the problem that can be solved by percep-
if earlier in the gaming segment it could compete with
trons were established. Namely, the authors emphasized
AMD/ATI, then in the segment of GPUs that can be
that perceptrons cannot represent some logical functions
used for machine learning purposes, it proved to be a
like XOR or NXOR. As a result, very little research was
monopolist. And at the same time, the cost of RAM has
done in this area until about the 1980s.The book has
significantly decreased, which opened the possibility to
triggered the reduction of funding of AI research in the
work with large amounts of data in memory and, as a
world for more than two decades. This period was later
result, there are numerous new types of databases, in-
called the first winter of AI. Nevertheless, study of more
cluding NoSQL. Finally, in 2014, the Apache Spark frame-
complex learning algorithms was still continuing.
work for distributed processing of unstructured and weakly
Further studies of structures and learning abilities of structured data appeared, which was convenient for the
multilayer neural networks took place in 1970-1980s. In implementation of machine learning algorithms.
1980 Kunihiko Fukushima proposed a hierarchical multi-
The third trend is development of the new algorithms of
layered convolutional neural network, known as neocog-
deep machine learning, inheriting and developing the idea
nitron Fukushima (1980). A significant impact was made
of perceptron in combination with a successful scientific
by the invention of backpropagation learning algorithm by
PR-campaign. After many years of study of multilayer
several authors in mid-eighties Rumelhart et al (1986a,b),
neural networks a concept a technology of deep neural
although its initial ideas were proposed still in the early
networks (DNN) was born. It is believed that the term deep
1960s, see Dreyfus (1990); Widrow & Lehr (1990); Werbos
learning was proposed in 1986 by Rina Dechter Dechter
(1990). Compared to a standard gradient descent, which
(1986) although the history of its appearance is apparently
updates all the parameters with respect to error simultane-
more complicated. A detailed and preferably objective
ously, backpropagation first propagates the error term at
analysis of the events of this period is still waiting for its
output layer, back to the layer at which parameters need
researcher. Different points of view on the deep learning
to be updated, and then uses the chain rule to update
can be found, e.g. on websites Foote (2017); Kurenkov
parameters with respect to the propagated error. Some
(2015); Wang & Raj (2017).
drawbacks of backpropagation were discovered, see Brady
et al (1989); Gori & Tesi (1992). Particularly it may fail By the middle of the last decade, a critical mass of knowl-
in the case when the classes cannot be linearly separated. edge in the field of DNN was accumulated, and, as always
in such cases, someone has broken away from the peloton. of convergence by “heavy ball” method or averaging is
In this case, the leader was Geoffrey Hinton, a British known in optimization and control since the 1960s Polyak
scientist who continued his career in Canada. Since 2006, (1964); Tsypkin (1971); Nesterov (1983); Polyak & Judit-
he and his colleagues began to publish numerous articles sky (1992). However 300 out of about 320 citations of the
on DNN, including papers in the popular multidisciplinary seminal paper Polyak (1964) were made in the last five
journal Nature. This earned him a lifetime fame as a years.
classic. Around him a strong and cohesive community
Looking into the future we see new expansion of the
has formed that worked for several years ”in the invisible
field. The top international conference on ML ‘Neural
mode”. Its members called themselves ”Deep Learning
Information Processing Systems’ (NeurIPS, formely NIPS,
Conspiracy” or even ”Canadian Maffia”. A leading trio
neurips.cc) had more than 8000 participants in 2018,
was formed: Ian LeCun, Yehoshua Benjo and Geoffrey
while second top conference ‘International Conference on
Hinton, they are also called LBH (LeCun & Bengio &
Machine Learning’ (ICML, icml.cc) got more than 6000
Hinton). LBH’s exit from the underground was well pre-
participants in 2019.
pared and supported by Google, Facebook and Microsoft.
Andrew Ng, who worked at MIT and Berkeley and is now What is a challenge for control theorists is that there are
the head of artificial intelligence research at Baidu lab, very few rigorous mathematical proofs in a new deep learn-
worked extensively with LBH. He linked deep learning to ing ’tsunami’. Since machine learning researchers needed
GPUs. Finally Ian LeCun, Yehoshua Benjo and Geoffrey means to compare the effectiveness of their methods and it
Hinton were awarded with Turing Award in 2018 “for was hard to find solid mathematical means to test perfor-
conceptual and engineering breakthroughs that have made mance of the numerous proposed methods 2 , the standard
deep neural networks a critical component of computing”. datasets of training and testing sets that could be used to
The leader of the trio Geoffrey Hinton has more than evaluate machine learning algorithms were proposed.
300,000 citations in Google Scholar by November, 2019.
Quite a number of books and textbooks, are devoted The author would be happy if this paper would encourage
to machine learning, see e.g. Bishop (2006); Mohri et al the readers to seek for new links between “old good”
(2014); Theodoridis (2015), including Vapnik (1998) that adaptation and learning theory of control/cybernetics and
has more than 83,000 citations in Google Scholar. However new paradigms and trends of machine learning and deep
books are becoming obsolete very rapidly and the best neural networks. Strengthening such links would be of
way to feel the state-of-the-art is to participate in one of mutual benefit for both fields.
flagman conferences, e.g. NeuIPS or ICML. Canada has
become the Mecca of ML: it will host NeurIPS in 2019 REFERENCES
and 2020.
Abramson N., D. Braverman and G. Sebestyen (1963).
Pattern recognition and machine learning IEEE Trans-
6. DISCUSSION actions on Information Theory. V.: 9, Is.: 4.
Aizerman M.A. (1963). The problem of training an au-
Looking through the several decades of the history of tomaton to perform classification of input situations
adaptation and learning it is seen that during the first (pattern recognition), Theory Self-Adapt. Contr. Syst.
few decades those two areas were very close to each other Proc. IFAC Symp. 2nd 1963.
and a fruitful cooperation between them can be observed. Aizerman, M.A., Braverman, E.M., and Rozonoer, L.I.
However the dramatic development of the machine learn- (1964a). Theoretical foundations of the potential func-
ing during the last two decades leads to the divergence tion method in the problem of training automata to clas-
of the two sister fields which is an issue of discussions in sify input situations, Automat. Remote Contr. (USSR)
the control community. However control related views have 25 (6).
been published only for the more narrow areas of iterative Aizerman, M.A., Braverman, E.M., and Rozonocr, L.I.
learning Bristow et al (2006) and reinforcement learning (1964b) The probabilistic problem of training automata
Recht (2018). to recognize patterns and the potential function method,
Automat. Remote Contr. (USSR) 25 (9).
This paper is an attempt to present a general control Belhumeur P.N., Hespanha J.P. and D.J. Kriegman (1997).
related view of the issue in the historical perspective. Eigenfaces vs. Fisherfaces: recognition using class spe-
Attention is focused mainly on the first decades of the cific linear projection IEEE Transactions on Pattern
history of machine learning. The possibility of more tight Analysis and Machine Intelligence, V.: 19, Is.: 7.
interaction between adaptive control and machine learning Bishop C.M. (2006).Pattern Recognition and Machine
in future is still unclear although some papers coauthored Learning. Springer.
by both experts in control and experts in machine learning Bongard M. M. (1961). Simulation of the recognition
are published and well accepted. E.g. paper by Belhumeur process on a digital computing machine. Biophysics, vol.
et al (1997) has more than 7000 citations after 20 years, 4, No. 2.
while paper by Wright et al (2009) got more that 5400 Brady M.L., Raghavan R., and J. Slawny. Back propaga-
citations after ten years. tion fails to separate where perceptrons succeed. IEEE
Looking into the past one can see a few success stories of Transactions on Circuits and Systems, 36(5):665–674,
applying the ideas well known in optimization and con- 1989.
trol area to machine learning. One of them is application 2 Perhaps the only exclusion is the Vapnik-Chervonenkis theory of
of multistep tuning algorithms to increase the conver- uniform convergence of the frequencies to the probabilities produced
gence rate of gradient type algorithms. E.g. acceleration the notion of Vapnik-Chervonenkis (VC) dimension.
Alexander L. Fradkov et al. / IFAC PapersOnLine 53-2 (2020) 1385–1390 1389
Braverman, E.M. (1962). The experiments with training a T. Marill and D. M. Green (1960). Statistical Recognition
machine to recognize patterns, Automat. Remote Contr. Functions and the Design of Pattern Recognizers. IRE
23 (3). Transactions on Electronic Computers. V.: EC-9, Is.: 4.
Bregman L.M. (1966) Relaxation method of finding a M. Minsky, and S. Papert (1969) Perceptrons: An Intro-
point common to some given convex sets and its use in duction to Computational Geometry MIT Press, Cam-
solving optimization problems Doklady Akademii Nauk bridge, MA, USA.
SSSR Vol.: 171 Issue: 5 pp.: 1019–1023. Mitchell B.F., Dem’yanov V.F., Malozemov V.N. (1974)
Bregman, L.M. (1965) Use of consecutive projection Finding The Point Of A Polyhedron Closest To The
for finding a common point of convex sets Doklady Origin. SIAM J Control 12(1), pp. 19-26 (Translated
Akademii Nauk SSSR Volume: 162 Issue: 3 pp. 487–490. from: Vestnik Of Leningrad Univerity. 1971. Is. 19, pp.
Bregman L.M. (1967) The relaxation method of finding 38-45.)
the common point of convex sets and its application to Mohri M., Rostamizadeh A. and Talwalkar A. (2014)
the solution of problems in convex programming. USSR Foundations of Machine Learning, MIT Press.
Computational Mathematics and Mathematical Physics Nesterov Y. (1983). A method of solving a convex pro-
7(3), pp. 200–217. gramming problem with convergence rate O(1/k 2 ). So-
Bristow D.A., Tharayil M., G. Alleyne A.G. (2006) A viet Mathematics Doklady, 27(2), 372-376.
Survey of Iterative Learning Control.” IEEE Control Novikoff, A. B. (1962). On convergence proofs on per-
Systems Magazine, Is.1, pp.96–114. ceptrons. Symposium on the Mathematical Theory of
Bush, R. R., and Mosteller, F. (1951). A mathematical Automata, Polytechnic Institute of Brooklyn 12, 615–
model for simple learning. Psychological Review, 58(5), 622.
313–323. Polyak, B.T. ( 1964). Some methods of speeding up the
Cortes C, and Vapnik, V.(1995). Support-vector networks. convergence of iteration methods, USSR Computational
Machine Learning. V.: 20 Is.: 3, pp. 273–297, 1995. Mathematics and Mathematical Physics 4(5), . 1–17.
Dechter R. (1986) Learning while searching in constraint- Polyak, B.T. and A.B. Juditsky. (1992) Acceleration of
satisfaction problems. AAAI-86 Proceedings, pp.179– stochastic approximation by averaging. SIAM Journal
183. on Control and Optimization 30(4), . 838–855.
Dreyfus, S. E. (1990). ”Artificial Neural Networks, Back Recht B. (2018) A Tour of Reinforcement Learning: The
Propagation, and the Kelley-Bryson Gradient Proce- View from Continuous Control.
dure”. Journal of Guidance, Control, and Dynamics. 13 arXiv:1806.09460v2 [math.OC] 10 Nov 2018.
(5): 926-928. Rosenblatt F.(1957). The Perceptron, A Perceiving and
Feldbaum, A. A. (1960). Dual Control Theory, Automa- Recognizing Automaton, Project Para Report No. 85-
tion and Remote Control , Vol. 21, Parts I, II, April 460-1, Cornell Aeronautical Laboratory (CAL).
1961, pp. 874-880, and May 1961, pp. 1033–1039. Rosenblatt F. (1959). Two Theorems of Statistical Sepa-
Foote K.D. (2017) A Brief History of Deep Learning. on rability in the Perceptron. Symposium of the Mechani-
February 7, 2017 https://www.dataversity.net/ sation of Thought Processes. National Physical Labora-
brief-history-deep-learning/ tory, Teddington, UK, Nov. 1958, Vol I, H.M. Stationery
Fukushima K. (1980). Neocognitron: A self-organizing Office, London.
neural network model for a mechanism of pattern recog- Rosenblatt, F. (1960). Perceptron Simulation Experi-
nition unaffected by shift in position. Biological cyber- ments. Proc. Inst. Radio Engineers V. 18, 3 March 1960,
netics, 36(4): 193–202. 301–309.
Gori M. and A. Tesi. (1992). On the problem of local min- Rumelhart D.E., Hinton G.E. and R. J. Williams (1986a)
ima in backpropagation. IEEE Transactions on Pattern Learning internal representations by error propagation,
Analysis and Machine Intelligence, 14(1):76–86. in Parallel Distributed Processing, vol. 1, ch. 8, D. E.
Gubin L.G., Polyak B.T., Raik E.V. (1967). The method of Rumelhart and J. L. McClelland, Eds., Cambridge, MA:
projections for finding the common point of convex sets. M.I.T. Press, 1986.
USSR Computational Mathematics and Mathematical Rumelhart D.E., Hinton G.E. and R. J. Williams (1986b)
Physics 7(6), pp. 1–24. ”Learning representations by back-propagating errors”.
Kharichev V.V., Schmidt A.A. and V. A. Yakubovich Nature. 323 (6088), 533-536.
(1973). New problem in pattern recognition. Autom. Theodoridis S. (2015). Machine Learning. A Bayesian and
Remote Control, V.34, Is. 1, 98–109. Optimization Perspective, Elsevier.
Kozinets B.N. (1964) On one algoritm for learning a Tsypkin, Y.Z. (1966) Adaptation, Training And Self-
linear perceptron. In Vichislitelnaia Tekhnika i Voprosi Organization In Automatic Systems. Automation And
Programirovania, Vol. 3. Leningrad State University Remote Control, V. 27, Is.: 1, pp. 16–51.
Press, Leningrad. (In Russian.) Tsypkin Ya.Z. Adaptation and learning in automated
Kozinets, B.N; Lantsman, R.M; Yakubovich V.A. Use systems NY, Academic Press, 1971 (Russian original
Of Electron Computers In Criminalistics For Differen- 1968).
tiation Between Very Similar Handwritings. Doklady Vapnik V.N. (1998) The Nature of Statistical Learning
Akademii Nauk SSSR V.: 167 Is.: 5 pp. 1008–1011 : 1966. Theory. Springer.
Kurenkov A. (2015). A ’Brief’ History Vapnik V. N. and A. Ya. Lerner. Recognition of Patterns
of Neural Nets and Deep Learning with help of Generalized Portraits, Avtomat. i Tele-
https://www.andreykurenkov.com/writing/ai/ mekh., 24:6 (1963), 774-780.
a-brief-history-of-neural-nets-and-deep-learning/ Vapnik, V. and Chervonenkis, A. (1964). On a class of
perceptrons. Automat. Remote Control 25, 103-109.
Wang H., and B. Raj (2017) On the Origin of Deep

Learning arXiv:1702.07800v4 [cs.LG] 3 Mar 2017
Werbos P.J. (1990). Backpropagation through time: what
it does and how to do it. Proceedings of the IEEE,
78(10):1550–1560.
Widrow, B. (1959). Adaptive sampled-data systems. A
statistical theory of adaptation, WESCON Conv. Rec.
3 (4).
Widrow, B. (1961). Self-adaptive discrete systems, Theory
Self Adapf. Contr. Syst. Proc. IFAC Symp. 1st 1961.
Widrow, B. (1964). Pattern recognition and adaptive con-
trol, IEEE Trans. Appl. Ind. 83 (74).
Widrow B. and M.Lehr. (1990) 30 Years of Adaptive
Neural Networks: Perceptron, Madaline, and Backprop-
agation. Proceedings IEEE, V. 78, Is.9, 1990.
Wright J., Yang A.Y., Ganesh A., Sastry S., Yi Ma (2009).
Robust Face Recognition via Sparse Representation.
IEEE Transactions on Pattern Analysis and Machine
Intelligence V.: 31, Is.: 2.
Yakubovich, V.A. (1963). Machines that learn to recognize
patterns. In Metodi Vichisleniy, Vol. 2. Leningrad
State University Press, Leningrad. (In Russian. To be
translated from the reprint: Vestnik of Saint-Petersburg
University. Mathematics, 4, 2019).
Yakubovich, V.A. (1965). Certain general theoretical prin-
ciples in the design of learning pattern recognition sys-
tems, Part I. In Vichislitelnaia Tekhnika i Voprosi Pro-
gramirovania, Vol. 4. Leningrad State University Press,
Leningrad. (In Russian.)
Yakubovich, V.A. (1966). Recurrent finitely convergent al-
gorithms for solving systems of inequalities. Sov. Math.
Doklady 1966, V. 7, pp. 300–304.
Yakubovich V.A. On a method of adaptive control un-
der conditions of great uncertainty. Prepr. 5th World
Congress IFAC (Paris). 1972. V.37. N3. pp.1–6.

2020 Early History of Machine Learning

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2020 Early History of Machine Learning

Uploaded by

Copyright:

Available Formats

Available online at www.sciencedirect.

Wang H., and B. Raj (2017) On the Origin of Deep

You might also like