Chapter 3

Stochastic Learning Automata
An automaton is a machine or control mechanism designed to automatically follow a
predetermined sequence of operations or respond to encoded instructions. The term stochastic
emphasizes the adaptive nature of the automaton we describe here. The automaton described here
do not follow predetermined rules, but adapts to changes in its environment. This adaptation is
the result of the learning process described in this chapter.
“The concept of learning automaton grew out of a fusion of the work of
psychologists in modeling observed behavior, the efforts of statisticians to model the
choice of experiments based on past observations, the attempts of operation
researchers to implement optimal strategies in the context of the two-armed bandit
problem, and the endeavors of system theorists to make rational decisions in random
environments” [Narendra89].
In classical control theory, the control of a process is based on complete knowledge of the
process/system. The mathematical model is assumed to be known, and the inputs to the process
are deterministic functions of time. Later developments in control theory considered the
uncertainties present in the system. Stochastic control theory assumes that some of the
characteristics of the uncertainties are known. However, all those assumptions on uncertainties
and/or input functions may be insufficient to successfully control the system if changes. It is
then necessary to observe the process in operation and obtain further knowledge of the system,
i.e., additional information must be acquired on-line since a priori assumptions are not sufficient.
One approach is to view these as problems in learning.
Rule-based systems, although performing well on many control problems, have the
disadvantage of requiring modifications, even for a minor change in the problem space.
Furthermore, rule-based approach, especially expert systems, cannot handle unanticipated
situations. The idea behind designing a learning system is to guarantee robust behavior without
the complete knowledge, if any, of the system/environment to be controlled. A crucial advantage
of reinforcement learning compared to other learning approaches is that it requires no information
about the environment except for the reinforcement signal [Narendra89, Marsh93].
A reinforcement learning system is slower than other approaches for most applications
since every action needs to be tested a number of times for a satisfactory performance. Either the
learning process must be much faster than the environment changes (as is the case in our
Cem Ünsal Chapter 3. Stochastic Learning Automata 30
application), or the reinforcement learning must be combined with an adaptive forward model
that anticipates the changes in the environment [Peng93].
Learning
1
is defined as any permanent change in behavior as a result of past experience,
and a learning system should therefore have the ability to improve its behavior with time, toward
a final goal. In a purely mathematical context, the goal of a learning system is the optimization of
a functional not known explicitly [Narendra74].
In the 1960’s, Y. Z. Tsypkin [Tsypkin71] introduced a method to reduce the problem to
the determination of an optimal set of parameters and then apply stochastic hill climbing
techniques. M.L. Tsetlin and colleagues [Tsetlin73] started the work on learning automata during
the same period. An alternative approach to applying stochastic hill-climbing techniques,
introduced by Narendra and Viswanathan [Narendra72], is to regard the problem as one of finding
an optimal action out of a set of allowable actions and to achieve this using stochastic automata.
The difference between the two approaches is that the former updates the parameter space at
each iteration while the later updates the probability space.
The stochastic automaton attempts a solution of the problem without any information on
the optimal action (initially, equal probabilities are attached to all the actions). One action is
selected at random, the response from the environment is observed, action probabilities are
updated based on that response, and the procedure is repeated. A stochastic automaton acting as
described to improve its performance is called a learning automaton.
3.1 Earlier Works
The first learning automata models were developed in mathematical psychology. Early research in
this area is surveyed by Bush and Mosteller [Bush58] and Atkinson et al. [Atkinson65]. Tsetlin
[Tsetlin73] introduced deterministic automaton operating in random environments as a model of
learning. Most of the works done on learning automata theory has followed the trend set by
Tsetlin. Varshavski and Vorontsova [Varshavski63] described the use of stochastic automata
with updating of action probabilities which results in reduction in the number of states in
comparison with deterministic automata.
Fu and colleagues [Fu65a, Fu65b, Fu67, Fu69a, Fu69b, Fu71] were the first researchers
to introduce stochastic automata into the control literature. Applications to parameter estimation,
pattern recognition and game theory were initially considered by this school. Properties of linear
updating schemes and the concept of a ‘growing’ automaton are defined by McLaren
[McLaren66]. Chandrasekaran and Shen [Chand68,Chand69] studied nonlinear updating schemes,
nonstationary environments and games of automata. Narendra and Thathachar have studied the
theory and applications of learning autoamta and carried out simulation studies in the area. Their
book Learning Automata [Narendra89] is an introduction to learning automata theory which
surveys all the research done on the subject until the end of the 1980s.

1
Webster defines ‘learning’ as ‘to gain knowledge or understanding of a skill by study, instruction, or experience.’
Cem Ünsal Chapter 3. Stochastic Learning Automata 31
The most recent (and second most comprehensive) book since the Narendra-Thathachar
collaboration on learning automata theory and applications is published by Najim and Pznyak in
1994 [Najim94]. This book also includes several applications and examples of learning automata.
Until recently, the applications of learning automata to control problems were rare.
Consequently, successful updating algorithms and theorems for advanced automata applications
were not readily available. Nario Baba’s work [Baba85] on learning behaviors of stochastic
automata under a nonstationary multi-teacher environment, and general linear reinforcement
schemes [Bush58] are taken as a starting point for the research presented in this dissertation.
Recent applications of learning automata to real life problems include control of
absorption columns [Najim91], bioreactors [Gilbert92], control of manufacturing plants
[Sequeria91], pattern recognition [Oommen94a], graph partitioning [Oommen94b], active vehicle
suspension [Marsh93, 95], path planning for manipulators [Naruse93], distributed fuzzy logic
processor training [Ikonen97], and path planning [Tsoularis93] and action selection [Aoki95] for
autonomous mobile robots.
Recent theoretical results on learning algorithms and techniques can be found in
[Najim91b], [Najim94], [Sastry93], [Papadimitriou94], [Rajaraman96], [Najim96], and
[Poznyak96].
3.2 The Environment and the Automaton
The learning paradigm the learning automaton presents may be stated as follows: a finite number
of actions can be performed in a random environment. When a specific action is performed the
environment provides a random response which is either favorable or unfavorable. The objective
in the design of the automaton is to determine how the choice of the action at any stage should be
guided by past actions and responses. The important point to note is that the decisions must be
made with very little knowledge concerning the “nature” of the environment. It may have time-
varying characteristics, or the decision maker may be a part of a hierarchical decision structure
but unaware of its role in the hierarchy. Furthermore, the uncertainty may be due to the fact that
the output of the environment is influenced by the actions of other agents unknown to the
decision maker.
The environment in which the automaton “lives” responds to the action of the automaton
by producing a response, belonging to a set of allowable responses, which is probabilistically
related to the automaton action. The term environment
2
is not easy to define in the context of
learning automata. The definition encompasses a large class of unknown random media in which
an automaton can operate. Mathematically, an environment is represented by a triple {α, c, β}
where α represents a finite action/output set, β represents a (binary) input/response set, and c is
a set of penalty probabilities, where each element c
i
corresponds to one action α
i
of the set α.
The output (action) α(n) of the automaton belongs to the set α, and is applied to the
environment at time t = n. The input β(n) from the environment is an element of the set β and can

2
Commonly refers to the aggregate of all the external conditions and influences affecting the life and development of
an organism.
Cem Ünsal Chapter 3. Stochastic Learning Automata 32
take on one of the values β
1
and β
2
. In the simplest case, the values β
i
are chosen to be 0 and 1,
where 1 is associated with failure/penalty response. The elements of c are defined as:
{ } Prob β α α ( ) ( ) ( , ,...) n n c i
i i
· · · · 1 12 (3.1)
Therefore c
i
is the probability that the action α
i
will result in a penalty input from the
environment. When the penalty probabilities c
i
are constant, the environment is called a
stationary environment.
There are several models defined by the response set of the environment. Models in
which the input from the environment can take only one of two values, 0 or 1, are referred to as
P-models
3
. In this simplest case, the response value of 1 corresponds to an “unfavorable”
(failure, penalty) response, while output of 0 means the action is “favorable.” A further
generalization of the environment allows finite response sets with more than two elements that
may take finite number of values in an interval [a, b]. Such models are called Q-models. When the
input from the environment is a continuous random variable with possible values in an interval [a,
b], the model is named S-model.
Input Output
State
Transition Function
Output Function α
F
G β
φ
Automaton
Environment
Penalty probabilities c
Figure 3.1. The automaton and the environment.
The automaton can be represented by a quintuple {Φ, α, β, F(•,•), H(•,•)} where:
w Φ is a set of internal states. At any instant n, the state φ(n) is an element of the finite
set Φ = {φ
1
, φ
2
,..., φ
s
}
w α is a set of actions (or outputs of the automaton). The output or action of an
automaton an the instant n, denoted by α(n), is an element of the finite set α = {α
1,
α
2
,...,
α
r
}
w β is a set of responses (or inputs from the environment). The input from the
environment β(n) is an element of the set β which could be either a finite set or an infinite
set, such as an interval on the real line:
β β β β β · · { , ,..., } {( , )}
1 2 m
a b or (3.2)

3
We will follow the notation used in [Narendra89].
Cem Ünsal Chapter 3. Stochastic Learning Automata 33
w F(•,•): Φ × β → Φ is a function that maps the current state and input into the next
state. F can be deterministic or stochastic:
φ φ β ( ) [ ( ), ( )] n F n n + · 1 (3.3)
w H(•,•): Φ × β → α is a function that maps the current state and input into the current
output. If the current output depends on only the current state, the automaton is referred
to as state-output automaton. In this case, the function H(•,•) is replaced by an output
function G(•): Φ → α, which can be either deterministic or stochastic:
α φ ( ) [ ( )] n G n · (3.4)
For our applications, we choose the function G(•) as the identity function, i.e., the states
of the automata are also the actions.
3.3 The Stochastic Automaton
We now introduce the stochastic automaton in which at least one of the two mappings F and G is
stochastic. If the transition function F is stochastic, the elements f
ij
β
of F represent the
probability that the automaton moves from state φ
i
to state φ
j
following an input β
4
:
f n n n i j s
ij j i m

β
φ φ φ φ β β β β β β · + · · · · · Pr{ ( ) ( ) , ( ) } , , ,..., , ,..., 1 12
1 2
(3.5)
For the mapping G, the definition is similar:
g n n i j r
ij j i
· · · · Pr{ ( ) ( ) } , , ,..., α α φ φ 12 (3.6)
Since f
ij
β
are probabilities, they lie in the closed interval [a, b]
5
; and to conserve probability
measure we must have:
f
ij
j
s

β
·

·
1
1 for each β and i. (3.7)
Example
States: φ
1
, φ
2
Inputs: β
1

2
Transition matrices:
F
F
( )
. .
. .
( )
. .
. .
β
β
1
2
050 050
025 075
01 09
075 025
·

¸

1
]
1
·

¸

1
]
1

4
In Chapter 4, where we define our initial algorithm, the probabilities fij
β
will be the same for all inputs β and all
states i given a state φj. Furthermore, the output mapping G is chosen as identity mapping.
5
[a, b]=[0, 1] in most cases.
Cem Ünsal Chapter 3. Stochastic Learning Automata 34
Transition Graphs:
β · β
β · β
Φ Φ
1 2
Φ
1
Φ
2
1
2
0.90
0.10
0.75
0.25
0.50
0.50
0.75
0.25
In the previous example, the conditional probabilities f
ij
β
were assumed to be constant,
i.e., independent of the time step n, and the input sequence. Such a stochastic automaton is
referred to as a fixed-structure automaton. As we will discuss later, it is useful to update the
transition probabilities at each step n on the basis of the environment response at that instant.
Such an automaton is called variable-structure automaton.
Furthermore, in the case of variable-structure automaton, the above definitions of the
transition functions F and G are not used explicitly. Instead of transition matrices, a vector of
action probabilities p(n) is defined to describe the reinforcement schemes — as we introduce in
the next sections.
If the variable-structure automaton is a state-output automaton, and the probabilities of
transition from one state to another f
ij
β
do not depend on current state and environment input,
then the relation between the action probability vector and the transition matrices is as follows:
• Since the automaton is a state-output automaton, we omit the transition matrix G, and
only consider the transition matrix F.
• The transition does not depend on the environment response, therefore there is only
one state transition matrix F where the elements are f
ij
. For example, we must have:
F F F
f f
f f
( ) ( ) β β
1 2
11 12
21 22
· ≡ ·

¸

1
]
1
(3.8)
• Furthermore, the probability of being in one state (or generating one action) does not
depend on the initial/previous state of the automaton. Therefore, the transition matrix reduces to:
[ ] F f f f f f f p · · ≡ · ≡ ≡
11 21 1 12 22 2
(3.9)
where vector p consists of probabilities p
i
of the automaton being in the i
th
state (or choosing the
i
th
output/action).
3.4 Variable Structure Automaton and Its Performance Evaluation
Before introducing the variable structure automata, we will introduce the definitions necessary to
evaluate the performance of a learning variable-structure automaton. A learning automaton
generates a sequence of actions on the basis of its interaction with the environment. If the
automaton is “learning” in the process, its performance must be superior to “intuitive” methods.
Cem Ünsal Chapter 3. Stochastic Learning Automata 35
To judge the performance of the automaton, we need to set up quantitative norms of behavior.
The quantitative basis for assessing the learning behavior is quite complex, even in the simplest
P-model and stationary random environments. To introduce the definitions for “norms of
behavior”, we will consider this simplest case. Further definitions for other models and non-
stationary environments will be given whenever necessary.
3.4.1 Norms of Behavior
If no prior information is available, there is no basis in which the different actions α
i
can be
distinguished. In such a case, all action probabilities would be equal — a “pure chance” situation.
For an r-action automaton, the action probability vector p(n) = Pr {α(n) = α
i
} is given by:
p n
r
i r
i
( ) , ,..., · ·
1
12 (3.10)
Such an automaton is called “pure chance automaton,” and will be used as the standard for
comparison. Any learning automaton must at least do better than a pure-chance automaton.
Consider a stationary random environment with penalty probabilities
{ } c c c
r 1 2
, ,..., where { } c n n
i i
· · · Pr ( ) ( ) β α α 1 . We define a quantity M(n) as the average
penalty for a given action probability vector:
M n E n p n n p n
n n n
c p n
i
i
r
i
i i
i
r
( ) [ ( ) ( )] Pr{ ( ) ( )}
Pr{ ( ) ( ) } Pr{ ( ) }
( )
· · ·
· · · ⋅ ·
·
·
·


β β
β α α α α
1
1
1
1


(3.11)
For the pure-chance automaton, M(n) is a constant denoted by M
o
:
M
r
c
o i
i
r
·
·

1
1
(3.12)
Also note that:
E M n E E n p n
E n
[ ( )] { [ ( ) ( )]}
[ ( )]
·
·
β
β
(3.13)
i.e., E[M(n)] is the average input to the automaton. Using above definitions, we have the
following:
Definition 3.1: A learning automaton is expedient if lim [ ( )]
n
o
E M n M
→∞
< .
Since p n
i
i
r
( ) ·
·

1
1
, we can write inf ( ) inf { ( )} min { }
( )
M n c p n c c
p n i i i i
i
r
l
· · ≡
·

1
.
Definition 3.2: A learning automaton is said to be optimal if:
lim [ ( )]
n
l
E M n c
→∞
· (3.14)
Cem Ünsal Chapter 3. Stochastic Learning Automata 36
Optimality implies that the action α
l
is associated with the minimum penalty probability c
l
is
chosen asymptotically with probability one. In spite of efforts of many researchers, the general
algorithm which ensures optimality has not been found [Baba83, Kushner72, Narendra89]. It
may not be possible to achieve optimality in every given situation. In this case, a suboptimal
behavior is defined.
Definition 3.3: A learning automaton is e-optimal if:
lim [ ( )]
n
l
E M n c
→∞
· + ε (3.15)
where e is arbitrarily small positive number.
Some automata satisfy the conditions stated in the definitions for specified initial
conditions and for certain sets of penalty probabilities. However, we may be interested in
automata that exhibit a desired behavior in arbitrary environments and for arbitrary initial
conditions. These requirements are partially met by an absolutely expedient automaton, which is
defined as:
Definition 3.4: A learning automaton is absolutely expedient if:
E[M(n+1) | p(n)] < M(n) (3.16)
for all n, all p
i
(n) ∈ (0, 1) and for all possible sets { } c c c
r 1 2
, ,..., . Taking the expectations of both
sides, we can further see that:
E M n E M n [ ( )] [ ( )] + < 1 (3.17)
In the application of learning automata techniques to intelligent control, we are mainly
interested in the case where there is one “optimal” action α
j
with c
j
= 0
6
. In such an environment,
a learning automaton is expected to reach a “pure optimal” strategy [Najim94]:
p n e
j
r
( ) → (3.18)
where e
j
r
is the unit vector with j
th
component equal to 1. In other words, the automaton will be
optimal since [ ] E M n c p n c
i i
i
r
l
( ) ( ) · ∑ →
·1
.
3.4.2 Variable Structure Automata
A more flexible learning automaton model can be created by considering more general stochastic
systems in which the action probabilities (or the state transitions) are updated at every stage
using a reinforcement scheme. In this section, we will introduce the general description of a
reinforcement scheme, linear and nonlinear. For simplicity, we assume that each state

6
We will discuss the feasibility of the case later.
Cem Ünsal Chapter 3. Stochastic Learning Automata 37
corresponds to one action, i.e., the automaton is a state-output automaton. The simulation
examples in later chapters also use the same identity state-output mapping.
In general terms, a reinforcement scheme can be represented as follows:
p n T p n n n ( ) [ ( ), ( ), ( )] + · 1
1
α β (3.19a)
or
f n T f n n n n
ij ij

β β
φ φ β ( ) [ ( ), ( ), ( ) ( )] + · + 1 1
2
(3.19b)
where T
1
and T
2
are mappings. Again, α is the action, β is the input from environment, and φ is
the state of the automaton. From now on, we will use the first mathematical description (3.19a)
for reinforcement schemes. If p(n+1) is a linear function of p(n), the reinforcement scheme is said
to be linear; otherwise it is termed nonlinear.
The simplest case
Consider a variable-structure automaton with r actions in a stationary environment with
{ } β · 0 1 , . The general scheme for updating action probabilities is as follows:
If α(n) = α
i
when β = 0
p n p n g p n j i
p n p n g p n
j j i
i i k
k
k i
r
( + ) = ( ) ( ( )) for all
( + ) = ( ) ( ( ))
1
1
1
− ≠
+
·


(3.20)
when β = 1
p n p n h p n j i
p n p n h p n
j j i
i i k
k
k i
r
( + ) = ( ) ( ( )) for all
( + ) = ( ) ( ( ))
1
1
1
+ ≠

·


where g
k
and h
k
(k=1,2,...,r) are continuous, nonnegative functions with the following
assumptions:
0
0 1
1
< <
< + <
·


g p n p n
p n h p n
k k
k k
k
k i
r
( ( )) ( )
[ ( ) ( ( ))]
(3.21)
for all i=1,2,...,r and all probabilities p
k
in the open interval (0,1). The two constraints on update
functions guarantees that the sum of all action probabilities is 1 at every time step.
3.5 Reinforcement Schemes
The reinforcement scheme is the basis of the learning process for learning automata. These
schemes are categorized based on their linearity. A variety of linear, nonlinear and hybrid
schemes exists for variable structure learning automata [Narendra89]. The linearity characteristic
of a reinforcement scheme is defined by the linearity of the update functions g
k
and h
k
(see the
Cem Ünsal Chapter 3. Stochastic Learning Automata 38
examples in the next section). It is also possible to divide all possible algorithms for stochastic
learning automata into the following two classes [Najim94]:
• nonprojectional algorithms, where individual probabilities are updated based on their previous
values:
( ) p n p n k p n n n
i i i i
( ) ( ) ( ), ( ), ( ) + · + ⋅ 1 δ α β (3.22)
where δ is a small real number and k
i
are the mapping functions for the update term.
• projectional algorithms, where the probabilities are updated by a function which maps the
probability vector to a specific value for each element:
( ) p n k p n n n
i i
( ) ( ), ( ), ( ) + · 1 α β (3.23)
where the functions k
i
are the mapping functions.
The main difference between these two subclasses is that nonprojectional algorithms can
only be used with binary environment response (i.e., in a P-model environment). Projectional
algorithms may be used in all environment types and are more complex. Furthermore,
implementation of projectional algorithms are computationally more tasking.
Early studies of reinforcement schemes were centered around linear schemes for reasons
of analytical simplicity. The need for more complex and efficient reinforcement schemes
eventually lead researchers to nonlinear (and hybrid) schemes. We will first introduce several
well-known linear schemes, and then close this section with absolutely expedient nonlinear
schemes.
3.5.1 Linear Reinforcement Schemes
For an r-action learning automaton, the general definition of linear reinforcement schemes can be
obtained by the substitution of the functions in Equation 3.20 as follows:
g p n a p n
h p n
b
r
b p n
a b
k k
k k
( ( )) ( )
( ( )) ( )
,
· ⋅
·

− ⋅
< <
1
0 1
(3.24)
Therefore, the scheme corresponding to general linear schemes is given as:
If α(n) = α
i
,
when β = 0
p n a p n j i
p n p n a p n
j j
i i
( + ) = ( ) ( ) for all
( + ) = ( )
i
1 1
1 1
− ⋅ ≠
+ ⋅ −
¹
'
¹
[ ( )]
when β = 1
p n
b
r
b p n j i
p n b p n
j j
i i
( + ) = +( ) ( ) for all
( + ) = ( ) ( )
1
1
1
1 1

− ⋅ ≠
− ⋅
¹
'
¹
¹
¹
(3.25)
As seen from the definition, the parameter a is associated with reward response, and the
parameter b with penalty response. If the learning parameters a and b are equal, the scheme is
Cem Ünsal Chapter 3. Stochastic Learning Automata 39
called the linear reward-penalty scheme L
R-P
[Bush58]. In this case, the update rate of the
probability vector is the same at every time step, regardless of the environment response. This
scheme is the earliest scheme considered in mathematical psychology.
For the linear reward-penalty scheme L
R-P
, E[p(n)], the expected value of the probability
vector at time step n, can be easily evaluated, and, by analyzing eigenvalues of the resulting
difference equation, it can be shown that asymptotic solution of the set of difference equations
enables us to conclude [Narendra89]:
lim [ ( )]
n
k
k
r
k
k
r
o
E M n
r
c
c
r
M
→∞
·
·
· < ·


1
1
1
(3.26)
Therefore, from Definition 3.1, the multi-action automaton using the L
R-P
scheme is expedient for
all initial action probabilities and in all stationary random environments.
Expediency is a relatively weak condition on the learning behavior of a variable-structure
automaton. An expedient automaton will do better than a pure chance automaton, but it is not
guaranteed to reach the optimal solution. In order to obtain a better learning mechanism, the
parameters of the linear reinforcement scheme are changed as follows: if the learning parameter b
is set to 0, then the scheme is named the linear reward-inaction scheme L
R-I
. This means that the
action probabilities are updated in the case of a reward response from the environment, but no
penalties are assessed.
For this scheme, it is possible to show that p
l
(n), the probability of the action α
l
with
minimum penalty probability c
l
, monotonically approaches 1. By making the parameter a
arbitrarily small, we have Pr lim
n→∞
p
l
(n) · 1
{ }
as close to unity as desired. This makes the learning
automata ε -optimal [Narendra89].
3.5.2 Nonlinear Learning Algorithms: Absolutely Expedient Schemes
Although early studies of automata learning were done mostly on linear schemes, a few attempts
were made to study nonlinear schemes [Vor65, Chand68, Shapiro69, Vis72]. Evolution of early
reinforcement schemes occurred mostly in a heuristic manner. For two-action automata, design
and evaluation of schemes was almost straight forward, since actually only one action was being
updated. Generalization of such schemes to multi-action case was not straightforward
[Narendra89]. This problem led to a synthesis approach toward reinforcement schemes.
Researchers started asking the question: “What are the conditions on the updating functions that
guarantee a desired behavior?” The problem viewed in the light of this approach resulted in a new
concept of absolute expediency. While searching for a reinforcement scheme that guarantees
optimality and/or absolute expediency, researchers eventually started considering schemes that
can be characterized as nonlinear. The update functions g
k
and h
k
are nonlinear functions of
Cem Ünsal Chapter 3. Stochastic Learning Automata 40
penalty probabilities. For example, the following update functions are defined by previous
researchers for nonlinear reinforcement schemes
7
:
• g p n h p n
a
r
p n p n
j j i i
( ( )) ( ( )) ( )( ( )) · ·


1
1
This scheme is absolutely expedient for restricted initial conditions and for restricted
environments [Shapiro69].

g p n p n p n
h p n
p n p n
r
j j
j
i i
( ( )) ( ) ( ( ))
( ( ))
( ) ( ( ))
· −
·


¹
'
¹
¹
¹
φ
φ
1
whereφ( ) x ax
m
· (m = 2,3,…). Again, for expediency several conditions on the penalty
probabilities must be satisfied (e.g., c
m
l
<
1
and c
m
j
>
1
for j ≠ l). [Chand68]

g p n a p n p n p p n
h p n b p n p n p p n
j j j
j j j
( ( )) ( ( ), ( )) ( ( ))
( ( )) ( ( ), ( )) ( ( ))
· − −
· − −
¹
'
¹
¹
¹
+
+
φ
φ
θ θ
θ θ
1 1
1
1 1
1
1 1
1 1
where φ φ ( , ) ( , ) p p p p
1 1 1 1
1 1 − · − is a nonlinear function which can be suitably selected, and
θ ≥ 1. Parameters a and b must be chosen properly to satisfy the conditions in Equation
3.21. Sufficient conditions for ensuring optimality is given in [Vor65], but they are valid only
for two-action automata.
The general solution for absolutely expedient schemes is found in early 1970’s by
Lakshmivarahan and Thathachar [Lakshmivarahan73]. Absolutely expedient learning schemes are
presently the only class of schemes for which necessary and sufficient conditions of design are
available. This class of schemes can be considered as the generalization of the L
R-I
scheme.
Consider the general reinforcement scheme in Equation 3.20. A learning automaton using this
scheme is absolutely expedient if and only if the functions g
k
(p) and h
k
(p) satisfy the following
conditions:
g p
p
g p
p
g p
p
p
h p
p
h p
p
h p
p
p
r
r
r
r
1
1
2
2
1
1
2
2
( ) ( ) ( )
( )
( ) ( ) ( )
( )
· ·⋅ ⋅ ⋅ · ·
· ·⋅ ⋅ ⋅ · ·
λ
µ
(3.27)
where λ(p) and µ(p) are arbitrary functions satisfying
8
:

7
Again, α α ( ) n
i
· is assumed. We give only the updating functions for action probabilities other than the current
action; the function for current action can be obtained by using the fact that the sum of probabilities is equal to 1.
8
The reason for the conditions on the update functions will be explained in detail in Chapter 6.
Cem Ünsal Chapter 3. Stochastic Learning Automata 41
0 1
0
1 1
< <
< <

¹
'
¹
¹
¹
¹
;
¹
¹
¹ ·
λ
µ
( )
( ) min
,...,
p
p
p
p j r
j
j
(3.28)
Detailed proof of this theorem is given in both [Baba84] and [Narendra89]. A similar proof (for a
new nonlinear absolutely expedient scheme) is given in Chapter 6.
3.6 Extensions of the Basic Automata-Environment Model
3.6.1 S-Model Environments
Up to this point we based our discussion of learning automata only on the P-model environment,
where the input to the automaton is either 0 or 1. We now consider the possibility of input
values over the unit interval [a, b]. Narendra and Thathachar state that such a model is very
relevant in control systems with a continuous valued performance index. In previous sections, the
algorithms (for P-model) were stated in terms of the functions g
k
and h
k
which are the reward and
penalty functions respectively. Since in S- (and Q-) models the outputs lie in the interval [0,1],
and therefore are neither totally favorable or totally unfavorable, the main problem is to
determine how the probabilities of all actions are to be updated. Narendra and Thathachar
[Narendra89] show that all the principal results derived for the P-model carry over the more
realistic S- and Q-models also. Here, we will not discuss the Q-model; all definitions for the Q-
model are similar to the P-model.
In the case where the environment outputs β
k
are not in the interval [0,1], but in [a, b] for
some a ,b ∈ IR, it is always possible to map the output into the unit interval using:
β
β
k
k
a
b a
·


(3.29)
where a = min
i

i
} and b = max
i

i
}
9
. Therefore, only the normalized S-model will be
considered.
The average penalty M(n) is still defined as earlier, but instead of c
i
∈{0,1}, s
i
∈[0,1] are
used in the definition:
M n E n p n
E n p n n n
s p
k
k
r
k
k k
k
r
( ) [ ( ) ( )]
[ ( ) ( ), ( ) ] Pr[ ( ) ]
·
· · ⋅ ·
·
·
·


β
β α α α α
1
1
(3.30)
where:
s E n n
k k
· · [ ( ) ( ) ] β α α (3.31)

9
Note that this normalization procedure requires the knowledge of a and b.
Cem Ünsal Chapter 3. Stochastic Learning Automata 42
In the S-model, s
k
plays the same role as the penalty probabilities in the P-model, and are
called penalty strengths. Again, for a pure-chance automaton with all the action probabilities
equal, the average penalty M
o
is given by:
M
r
s
o i
i
r
·
·

1
1
(3.32)
The definitions of expediency, optimality, ε -optimality, and absolute expediency are the
same, except the fact that c
i
’s are replaced by s
i
’s. Now, consider the general reinforcement
scheme (Equation 3.20). The direct generalization of this scheme to S-model gives:
p n p n n g p n n h p n if n
p n p n n g p n n p h p n if n
i i i i i
i i j
j i
i j
j i
i
( ) ( ) ( ( )) ( ( )) ( ) ( ( )) ( )
( ) ( ) ( ( )) ( ( )) ( ) ( ( )) ( )
+ · − − + ≠
+ · + − − ·
≠ ≠
∑ ∑
1 1
1 1
β β α α
β β α α


(3.33)
The previous algorithm (Equation 3.20) can easily be obtained from the above definition
by substituting β( ) n · 0 and 1. Again, the functions g
i
and h
i
can be associated with reward and
penalty respectively. The value 1− β( ) n is an indication of how far β( ) n is away from the
maximum possible value of unity, and is maximum when β( ) n is 0. The functions g
i
and h
i
have to
satisfy the conditions for Equation 3.21 for p
i
(n) to remain in the interval [0,1] at each step. The
non-negativity condition on g
i
and h
i
maintains the reward-penalty character of these functions,
though it is not necessary.
For example the L
R-P
reinforcement scheme (Equation 3.25) can be rewritten for S-model
(SL
R-P
) as:
p n p n n a p n n
a
r
a p n if n
p n p n n a p n n a p n if n
i i i i
i i i i i
( ) ( ) [ ( )] ( ) ( ) [ ( )] ( )
( ) ( ) [ ( )] [ ( )] ( ) ( )
+ · − − ⋅ ⋅ + ⋅

− ⋅ ≠
+ · + − ⋅ ⋅ − − ⋅ ⋅ ·
1 1
1
1 1 1
β β α α
β β α α

( )
(3.35)
Similarly, the asymptotic value of the average penalty is:
lim [ ( )]
n
i
k
r
E M n
r
s
→∞
·
·

1
1
(3.36)
Thus, the SL
R-P
scheme has analogous properties to the L
R-P
scheme. It is expedient in all
stationary random environments and the limiting expectations of the action probabilities are
inversely proportional to the corresponding s
i
. The conditions of Equations 3.27 and 3.28 for
absolute expediency also applies to updating algorithms for S-model environment.
3.6.2 Nonstationary Environments
Thus far we have considered only the automaton in a stationary environment. However, the need
for learning and adaptation in systems is mainly due to the fact that the environment changes
Cem Ünsal Chapter 3. Stochastic Learning Automata 43
with time. The performance of a learning automaton should be judged in such a context. If a
learning automaton with a fixed strategy is used in a nonstationary environment, it may become
less expedient, and even non-expedient. The learning scheme must have sufficient flexibility to
track the better actions. The aim in these cases is not to evolve to a single action which is
optimal, but to choose the actions to minimize the expected penalty. As in adaptive control
theory, the practical justification for the study of stationary environments is based on the
assumption that if the convergence of a scheme is sufficiently fast, then acceptable performance
can be achieved in slowly changing environments.
An environment is nonstationary if the penalty probabilities c
i
(or s
i
) vary with time. The
success of the learning procedure then depends on the environment changes as well as the prior
information we have about the environment. From an analytic standpoint, the simplest case is the
one in which penalty probability vector c(n) can have a finite number of values. We can visualize
this situation as an automaton operating in a finite set of stationary environments, and assign a
different automaton A
i
to each environment E
i
. If each automaton uses an absolutely expedient
scheme, then ε -optimality can be achieved asymptotically for the overall system. However, this
method requires that the system be aware of the “environmental changes” in order to assign the
“correct” automaton, and the number of environments must not be large. Furthermore, since the
automaton A
i
is updated only when the environment is E
i
, updating strategies may happen
infrequently so that the corresponding action probabilities do not converge [Narendra89]. In our
application, a similar approach is employed as explained in Chapter 4.
There are two classes of environments that have been analytically investigated:
Markovian switching environments, and state-dependent nonstationary environments. If we
assume that the environments E
i
are states of a Markov chain described by a state transition
matrix, then it is possible to view the variable-structure automaton in the Markovian switching
environment as a homogeneous Markov chain. Therefore, analytical investigations are possible.
Secondly, there are four different situations wherein the states of the nonstationary environment
vary with the step n, and analytical tools developed earlier are adequate to obtain a fairly
complete descriptions of the learning process. These are:
w Fixed-structure automata in Markovian environments
w Variable-structure automata in state-dependent nonstationary environments
w Hierarchy of automata
w Nonstationary environments with fixed optimal actions
The fundamental concepts such as expediency, optimality, and absolute expediency have
to be re-examined for nonstationary environments, since, for example, the optimal action can
change with time. Here, we again consider the P-model for purposes of stating the definitions.
The average penalty M(n) becomes:
M n c n p n
i i
i
r
( ) ( ) ( ) ·
·

1
(3.37)
Cem Ünsal Chapter 3. Stochastic Learning Automata 44
The pure-chance automaton has an average penalty M
o
as a function of time:
M n
r
c n
o i
i
r
( ) ( ) ·
·

1
1
(3.38)
Using these descriptions, we now have the following:
Definition 3.5: A learning automaton in nonstationary environment is expedient if there exists a
n
o
such that for all n > n
o
, we have:
E M n M n
o
[ ( )] ( ) − < 0 (3.39)
The previous concept of optimality can be applied to some specific nonstationary
environments. For example, if there exists a fixed l such that c
l
(n) < c
i
(n) for all i = 1,2,...,r, i ≠ l,
and for all n (or at least for all n > n
1
for some n
1
), then α
l
is the optimal action; the original
definitions of optimality andε -optimality still hold. In general, we have:
Definition 3.6: In a nonstationary environment where there is no fixed optimal action, an
automaton can be defined as optimal if it minimizes:
lim [ ( )]
T
n
T
T
E n
→∞
·

1
1
β (3.40)
The definition of absolute expediency can be retained as before, with time-varying
penalty probabilities and the new description in Equation 3.37.
3.6.3 Multi-Teacher Environments
All the learning schemes and behavior norms discussed up to this point have dealt with a single
automaton in a single environment. However, it is possible to have an environment in which the
actions of an automaton evoke a vector of responses due to multiple performance criteria. We
will describe such environments as multi-teacher environments. Then, the automata have to
“find” an action that satisfies all the teachers.
In a multi-teacher environment, the automaton is connected to N teachers (or N single-
teacher environments). Then, a single-teacher environment is described by the action set α, the
output set β
i
={0,1} (for a P-model), and a set of penalty probabilities { } c c c
i i
r
i
1 2
, ,..., where
{ }
c n n
j
i i
j
· · · Pr ( ) ( ) β α α 1 . The action set of the automaton is of course the same for all
teachers/environments. Baba discussed the problem of a variable-structure automaton operating
in a multi-teacher environment [Baba85]. Conditions for absolute expediency are given in his
works (see also Chapter 6 for new nonlinear reinforcement schemes).
Some difficulty may arise while formulating a mathematical model of the learning
automaton in a multi-teacher environment. Since we have multiple responses, the task of
“interpreting” the output vector is important. Are the outputs from different teachers to be
summed after normalization? Can we introduce weight factors associated with specific teachers?
Cem Ünsal Chapter 3. Stochastic Learning Automata 45
If the environment is of a P-model, how should we combine the outputs? In the simplest case
that all teachers agree on the ordering of the actions, i.e., one action is optimal for all
environments, the updating schemes are easily defined. However, this will not be the case in our
application.
The elements of the environment output vector must be combined in some fashion to
form the input to the automaton. One possibility is to use to use an AND-gate as described in
[Thathachar77]. Another method is taking the average, or weighted average of the all responses
[Narendra89, Baba85]. This is valid for Q- or S-model environments. Our application uses logic
gates (OR) and additional if-then conditions as described in Chapters 4 and 5.
3.6.4 Interconnected Automata
Although we based our discussion to a single automaton, it is possible that there are more than
one automata in an environment. If the interaction between different automata is provided by the
environment, the case of multi-automata is not different than a single automaton case. The
environment reacts to the actions of multiple automata, and the environment output is a result of
the combined effect of actions chosen by all automata. If there is direct interaction between the
automata, such as the hierarchical (or sequential) automata models, the actions of some automata
directly depend on the actions of others.
It is generally recognized that the potential of learning automata can be increased if
specific rules for interconnections can be established. There are two main interconnected
automata models: synchronous and sequential. An important consequence of the synchronous
model is that the resulting configurations can be viewed as games of automata with particular
payoff structures. In the sequential model, only one automaton acts at each time step. The action
chosen determines which automaton acts at the next time step. Such a sequential model can be
viewed as a network of automata in which control passes from one automaton to another. At
each step, one decision maker controls the state transitions of the decision process, and the goal
of the whole group is to optimize some overall performance criterion.
In Chapters 4 and 5, the interconnection between automata is considered. Since each
vehicle’s planning layer will include two automata — one for lateral, the other for longitudinal
actions — the interdependence of these two sets of actions automatically results in an
interconnected automata network. Furthermore, multiple autonomous vehicles with multiple
automata as path controllers constitute a more complex automata system as described in Chapter
7.

Their book Learning Automata [Narendra89] is an introduction to learning automata theory which surveys all the research done on the subject until the end of the 1980s. 3. Tsypkin [Tsypkin71] introduced a method to reduce the problem to the determination of an optimal set of parameters and then apply stochastic hill climbing techniques. Fu71] were the first researchers to introduce stochastic automata into the control literature. or the reinforcement learning must be combined with an adaptive forward model that anticipates the changes in the environment [Peng93]. toward a final goal. the goal of a learning system is the optimization of a functional not known explicitly [Narendra74]. A stochastic automaton acting as described to improve its performance is called a learning automaton. Tsetlin [Tsetlin73] introduced deterministic automaton operating in random environments as a model of learning. instruction. action probabilities are updated based on that response. Most of the works done on learning automata theory has followed the trend set by Tsetlin. Fu69b. In a purely mathematical context. Applications to parameter estimation. Learning1 is defined as any permanent change in behavior as a result of past experience. The difference between the two approaches is that the former updates the parameter space at each iteration while the later updates the probability space.’ . Narendra and Thathachar have studied the theory and applications of learning autoamta and carried out simulation studies in the area. is to regard the problem as one of finding an optimal action out of a set of allowable actions and to achieve this using stochastic automata. Stochastic Learning Automata 30 application). nonstationary environments and games of automata. One action is selected at random. and a learning system should therefore have the ability to improve its behavior with time. M. or experience. Properties of linear updating schemes and the concept of a ‘growing’ automaton are defined by McLaren [McLaren66]. [Atkinson65]. Fu and colleagues [Fu65a. Fu69a. Y.1 Earlier Works The first learning automata models were developed in mathematical psychology.Chand69] studied nonlinear updating schemes. introduced by Narendra and Viswanathan [Narendra72]. Varshavski and Vorontsova [Varshavski63] described the use of stochastic automata with updating of action probabilities which results in reduction in the number of states in comparison with deterministic automata. In the 1960’s. and the procedure is repeated. Early research in this area is surveyed by Bush and Mosteller [Bush58] and Atkinson et al. Chandrasekaran and Shen [Chand68. pattern recognition and game theory were initially considered by this school. Fu67. Z. equal probabilities are attached to all the actions). 1 Webster defines ‘learning’ as ‘to gain knowledge or understanding of a skill by study. the response from the environment is observed.Cem Ünsal Chapter 3. The stochastic automaton attempts a solution of the problem without any information on the optimal action (initially.L. Fu65b. Tsetlin and colleagues [Tsetlin73] started the work on learning automata during the same period. An alternative approach to applying stochastic hill-climbing techniques.

or the decision maker may be a part of a hierarchical decision structure but unaware of its role in the hierarchy.2 The Environment and the Automaton The learning paradigm the learning automaton presents may be stated as follows: a finite number of actions can be performed in a random environment. [Najim94]. The input β(n) from the environment is an element of the set β and can 2 Commonly refers to the aggregate of all the external conditions and influences affecting the life and development of an organism. This book also includes several applications and examples of learning automata. and path planning [Tsoularis93] and action selection [Aoki95] for autonomous mobile robots. [Papadimitriou94]. Consequently. [Sastry93]. the uncertainty may be due to the fact that the output of the environment is influenced by the actions of other agents unknown to the decision maker. Stochastic Learning Automata 31 The most recent (and second most comprehensive) book since the Narendra-Thathachar collaboration on learning automata theory and applications is published by Najim and Pznyak in 1994 [Najim94]. and is applied to the environment at time t = n. The output (action) α(n) of the automaton belongs to the set α. distributed fuzzy logic processor training [Ikonen97]. When a specific action is performed the environment provides a random response which is either favorable or unfavorable. Nario Baba’s work [Baba85] on learning behaviors of stochastic automata under a nonstationary multi-teacher environment. which is probabilistically related to the automaton action. Mathematically. successful updating algorithms and theorems for advanced automata applications were not readily available. [Rajaraman96]. [Najim96]. where each element ci corresponds to one action α i of the set α. The important point to note is that the decisions must be made with very little knowledge concerning the “nature” of the environment. Recent theoretical results on learning algorithms and techniques can be found in [Najim91b]. an environment is represented by a triple {α. The term environment2 is not easy to define in the context of learning automata. path planning for manipulators [Naruse93]. bioreactors [Gilbert92]. β} where α represents a finite action/output set. .Cem Ünsal Chapter 3. c. 95]. graph partitioning [Oommen94b]. The environment in which the automaton “lives” responds to the action of the automaton by producing a response. and [Poznyak96]. Until recently. and general linear reinforcement schemes [Bush58] are taken as a starting point for the research presented in this dissertation. It may have timevarying characteristics. control of manufacturing plants [Sequeria91]. belonging to a set of allowable responses. Furthermore. The definition encompasses a large class of unknown random media in which an automaton can operate. 3. pattern recognition [Oommen94a]. The objective in the design of the automaton is to determine how the choice of the action at any stage should be guided by past actions and responses. active vehicle suspension [Marsh93. and c is a set of penalty probabilities. β represents a (binary) input/response set. the applications of learning automata to control problems were rare. Recent applications of learning automata to real life problems include control of absorption columns [Najim91].

the state φ(n) is an element of the finite set Φ = {φ1. while output of 0 means the action is “favorable. the values β i are chosen to be 0 and 1.•)} where: w Φ is a set of internal states. the response value of 1 corresponds to an “unfavorable” (failure. is an element of the finite set α = {α 1.. In the simplest case..1) Environment Penalty probabilities c Figure 3. There are several models defined by the response set of the environment. β.) . The automaton and the environment.1.. αr} w β is a set of responses (or inputs from the environment). Stochastic Learning Automata 32 take on one of the values β 1 and β 2. In this simplest case.2) 3 We will follow the notation used in [Narendra89]. the model is named S-model. are referred to as P-models3. The output or action of an automaton an the instant n. penalty) response. Such models are called Q-models..” A further generalization of the environment allows finite response sets with more than two elements that may take finite number of values in an interval [a. φs} w α is a set of actions (or outputs of the automaton).. such as an interval on the real line: β = {β 1 . H(•. the environment is called a stationary environment... φ2. . b].Cem Ünsal Chapter 3... F(•. The input from the environment β(n) is an element of the set β which could be either a finite set or an infinite set.•)... denoted by α(n)... At any instant n. b]. State φ Transition Function F Input β Output Function G Automaton Output α { }= c i (i = 12 . The automaton can be represented by a quintuple {Φ. (3. Models in which the input from the environment can take only one of two values. The elements of c are defined as: Prob β (n) = 1 α (n) = α i Therefore ci is the probability that the action α i will result in a penalty input from the environment. β m } or β = {(a. 0 or 1.. α.. where 1 is associated with failure/penalty response. When the penalty probabilities ci are constant. b)} (3. β 2 . α 2. When the input from the environment is a continuous random variable with possible values in an interval [a.

3 The Stochastic Automaton We now introduce the stochastic automaton in which at least one of the two mappings F and G is stochastic. 1] in most cases. β 2 . the elements fijβ of F represent the probability that the automaton moves from state φi to state φj following an input β 4: f β ij = Pr{φ (n + 1) = φ j φ (n) = φ i .. j = 12 . b]5. In this case. 3. β = β 1 . 4 .4) For our applications.β 2 . we choose the function G(•) as the identity function. β (n)] (3. (3..•): Φ × β → Φ is a function that maps the current state and input into the next state. and to conserve probability measure we must have: β ∑ f ij = 1 j =1 s for each β and i.. the automaton is referred to as state-output automaton.7) Example States: φ1. b]=[0..   075 025 Transition matrices: In Chapter 4..e. φ2 Inputs: β 1 . which can be either deterministic or stochastic: α (n) = G[φ ( n)] (3. If the transition function F is stochastic. If the current output depends on only the current state.6) Since fijβ are probabilities.5) For the mapping G.•): Φ × β → α is a function that maps the current state and input into the current output. β (n) = β } i . s . (3. 09  .. 050 050  F( β 1 ) =  . 5 [a.•) is replaced by an output function G(•): Φ → α. the definition is similar: g ij = Pr{α (n) = α j φ (n) = φ i } i . the function H(•. the output mapping G is chosen as identity mapping..  01 F( β 2 ) =  . i. F can be deterministic or stochastic: φ (n + 1) = F[φ (n). the probabilities fij β will be the same for all inputs β and all states i given a state φj ..Cem Ünsal Chapter 3. ..  025 075  .. they lie in the closed interval [a.. . .. j = 12 . where we define our initial algorithm.3) w H(•.. the states of the automata are also the actions. β m (3. r . Stochastic Learning Automata 33 w F(•. Furthermore.

it is useful to update the transition probabilities at each step n on the basis of the environment response at that instant. and only consider the transition matrix F. Therefore. a vector of action probabilities p(n) is defined to describe the reinforcement schemes — as we introduce in the next sections. [ ] 3. its performance must be superior to “intuitive” methods. If the automaton is “learning” in the process. Stochastic Learning Automata 34 Transition Graphs: 0. we will introduce the definitions necessary to evaluate the performance of a learning variable-structure automaton.8)   f 21 f 22  • Furthermore. independent of the time step n. and the input sequence. Furthermore.10 Φ1 0.25 Φ2 0. the transition matrix reduces to: F = f 11 = f 21 ≡ f 1 f 12 = f 22 ≡ f 2 ≡ p (3.75 Φ2 0. i.75 In the previous example.50 Φ1 0. As we will discuss later. For example. . we must have:  f 11 f 12  F (β1 ) = F ( β2 ) ≡ F =  (3. Such a stochastic automaton is referred to as a fixed-structure automaton. Such an automaton is called variable-structure automaton. A learning automaton generates a sequence of actions on the basis of its interaction with the environment. If the variable-structure automaton is a state-output automaton.Cem Ünsal Chapter 3. and the probabilities of transition from one state to another fijβ do not depend on current state and environment input. then the relation between the action probability vector and the transition matrices is as follows: • Since the automaton is a state-output automaton. therefore there is only one state transition matrix F where the elements are fij.4 Variable Structure Automaton and Its Performance Evaluation Before introducing the variable structure automata. we omit the transition matrix G.. the above definitions of the transition functions F and G are not used explicitly. the conditional probabilities fijβ were assumed to be constant.e.9) th where vector p consists of probabilities pi of the automaton being in the i state (or choosing the ith output/action). Instead of transition matrices. • The transition does not depend on the environment response. the probability of being in one state (or generating one action) does not depend on the initial/previous state of the automaton.90 β = β2 0.50 β = β1 0.25 0. in the case of variable-structure automaton.

r .4. c . (3. Stochastic Learning Automata 35 To judge the performance of the automaton. 3. Any learning automaton must at least do better than a pure-chance automaton. there is no basis in which the different actions α i can be distinguished. i =1 i =1 r r Definition 3. E[M(n)] is the average input to the automaton. We define a quantity M(n) as the average { } penalty for a given action probability vector: M ( n) = E[ β (n ) p(n)] = Pr{β (n) = 1 p( n)} = ∑ Pr{β (n) = 1α (n ) = α i } ⋅ Pr{α (n) = α i } i =1 r r (3.e...14) . c } where 1 2 r ci = Pr β ( n) = 1 α (n) = α i .11) = ∑ ci pi ( n) i =1 For the pure-chance automaton. the action probability vector p(n) = Pr {α(n) = α i } is given by: 1 pi (n ) = i = 12 .12) E [ M (n)] = E {E[ β (n ) p(n)]} (3...Cem Ünsal Chapter 3.... even in the simplest P-model and stationary random environments.1: A learning automaton is expedient if lim E [ M (n)] < M o . all action probabilities would be equal — a “pure chance” situation. Consider a stationary random environment with penalty probabilities {c .2: A learning automaton is said to be optimal if: lim E [ M (n)] = cl n→ ∞ (3.13) = E [β (n)] i. we will consider this simplest case. Using above definitions. To introduce the definitions for “norms of behavior”. In such a case. Further definitions for other models and nonstationary environments will be given whenever necessary.10) r Such an automaton is called “pure chance automaton.. we have the following: Definition 3. n→ ∞ Since ∑ pi ( n) = 1. For an r-action automaton.” and will be used as the standard for comparison. M(n) is a constant denoted by Mo: Mo = Also note that: 1 r ∑ ci r i =1 (3. The quantitative basis for assessing the learning behavior is quite complex.. we need to set up quantitative norms of behavior. we can write inf M (n) = inf p (n ) {∑ ci pi (n )} = mini {ci } ≡ c l .1 Norms of Behavior If no prior information is available.

In such an environment.. we assume that each state 6 We will discuss the feasibility of the case later.17) In the application of learning automata techniques to intelligent control. we may be interested in automata that exhibit a desired behavior in arbitrary environments and for arbitrary initial conditions. which is defined as: Definition 3. In spite of efforts of many researchers. all pi(n) ∈ (0. It may not be possible to achieve optimality in every given situation..4: A learning automaton is absolutely expedient if: E[M(n+1) | p(n)] < M(n) (3..2 Variable Structure Automata A more flexible learning automaton model can be created by considering more general stochastic systems in which the action probabilities (or the state transitions) are updated at every stage using a reinforcement scheme.3: A learning automaton is e-optimal if: lim E [ M ( n)] = cl + ε n →∞ (3. These requirements are partially met by an absolutely expedient automaton. For simplicity. we can further see that: E [ M (n + 1)] < E[ M (n )] (3. Stochastic Learning Automata 36 Optimality implies that the action α l is associated with the minimum penalty probability cl is chosen asymptotically with probability one. Kushner72. a suboptimal behavior is defined. a learning automaton is expected to reach a “pure optimal” strategy [Najim94]: p( n) → e r (3. Taking the expectations of both c sides. r i =1 3. Definition 3. . Some automata satisfy the conditions stated in the definitions for specified initial conditions and for certain sets of penalty probabilities. In other words. Narendra89]. we will introduce the general description of a reinforcement scheme. In this case. we are mainly interested in the case where there is one “optimal” action α j with cj = 06. the automaton will be optimal since E [M (n) ]= ∑ ci pi (n) → cl . However. linear and nonlinear.4. 1) and for all possible sets { 1 . In this section. the general algorithm which ensures optimality has not been found [Baba83. c 2 .Cem Ünsal Chapter 3.15) where e is arbitrarily small positive number. cr } .16) for all n.18) j where erj is the unit vector with jth component equal to 1..

The simulation examples in later chapters also use the same identity state-output mapping. the automaton is a state-output automaton.1).e.2. If p(n+1) is a linear function of p(n). The two constraints on update functions guarantees that the sum of all action probabilities is 1 at every time step. a reinforcement scheme can be represented as follows: p( n + 1) = T1 [ p (n)..21) for all i=1. i. φ (n).. 3. otherwise it is termed nonlinear.. The simplest case Consider a variable-structure automaton with r actions in a stationary environment with β = { .2. From now on.19a) for reinforcement schemes. The linearity characteristic of a reinforcement scheme is defined by the linearity of the update functions gk and hk (see the .. A variety of linear.5 Reinforcement Schemes The reinforcement scheme is the basis of the learning process for learning automata. In general terms.α (n).19a) or β β f ij (n + 1) = T2 [ f ij (n). The general scheme for updating action probabilities is as follows: 0 If α(n) = α i when β = 0 p j (n + 1) = p j (n) − gi ( p( n)) pi ( n + 1) = pi (n) + ∑ g k ( p(n)) k =1 k ≠i r for all j ≠ i (3.r and all probabilities pk in the open interval (0. β is the input from environment.r) are continuous. These schemes are categorized based on their linearity.Cem Ünsal Chapter 3.. nonnegative functions with the following assumptions: 0 < gk ( p(n )) < pk (n) 0 < ∑ [ pk ( n) + hk ( p( n))] < 1 k =1 k ≠i r (3. φ (n + 1) β ( n)] (3. and φ is the state of the automaton. 1} ... Again. nonlinear and hybrid schemes exists for variable structure learning automata [Narendra89]. the reinforcement scheme is said to be linear. Stochastic Learning Automata 37 corresponds to one action..19b) where T1 and T2 are mappings. β (n)] (3.20) p j (n + 1) = p j (n) + hi ( p (n)) when β = 1 pi (n + 1) = pi (n) − ∑ hk ( p(n)) k =1 k ≠i r for all j ≠ i where gk and hk (k=1. we will use the first mathematical description (3. α is the action..

the scheme corresponding to general linear schemes is given as: If α(n) = α i .1 Linear Reinforcement Schemes For an r-action learning automaton.α (n). where the probabilities are updated by a function which maps the probability vector to a specific value for each element: pi (n + 1) = k i (p( n).5. The main difference between these two subclasses is that nonprojectional algorithms can only be used with binary environment response (i.25) when β = 1 As seen from the definition. the general definition of linear reinforcement schemes can be obtained by the substitution of the functions in Equation 3. • projectional algorithms. β ( n)) (3.23) where the functions k i are the mapping functions. implementation of projectional algorithms are computationally more tasking. when β = 0  p j (n + 1) = (1 − a) ⋅ p j (n) for all j ≠ i   pi (n + 1) = pi (n) + a ⋅ [1 − pi (n)] b   p j (n + 1) = +(1 − b ) ⋅ p j (n )  r −1  pi (n + 1) = (1 − b ) ⋅ pi (n)  for all j ≠ i (3. β (n)) (3. where individual probabilities are updated based on their previous values: pi (n + 1) = pi (n) + δ ⋅ k i (pi (n ). and then close this section with absolutely expedient nonlinear schemes. If the learning parameters a and b are equal. the parameter a is associated with reward response.24) r −1 0 < a ..e.α ( n).b < 1 Therefore. Early studies of reinforcement schemes were centered around linear schemes for reasons of analytical simplicity. the scheme is . 3. and the parameter b with penalty response. We will first introduce several well-known linear schemes. Projectional algorithms may be used in all environment types and are more complex. Stochastic Learning Automata 38 examples in the next section).Cem Ünsal Chapter 3. Furthermore. It is also possible to divide all possible algorithms for stochastic learning automata into the following two classes [Najim94]: • nonprojectional algorithms. The need for more complex and efficient reinforcement schemes eventually lead researchers to nonlinear (and hybrid) schemes. in a P-model environment).22) where δ is a small real number and k i are the mapping functions for the update term.20 as follows: g k ( p( n)) = a ⋅ p k (n) b hk ( p(n)) = − b ⋅ p k (n ) (3.

the multi-action automaton using the LR-P scheme is expedient for all initial action probabilities and in all stationary random environments. a few attempts were made to study nonlinear schemes [Vor65. Stochastic Learning Automata 39 called the linear reward-penalty scheme LR-P [Bush58]. Researchers started asking the question: “What are the conditions on the updating functions that guarantee a desired behavior?” The problem viewed in the light of this approach resulted in a new concept of absolute expediency. While searching for a reinforcement scheme that guarantees optimality and/or absolute expediency. Generalization of such schemes to multi-action case was not straightforward [Narendra89].5.Cem Ünsal Chapter 3. the expected value of the probability vector at time step n. the probability of the action α l with minimum penalty probability cl. and. In this case. can be easily evaluated. For two-action automata. The update functions gk and h k are nonlinear functions of .1. Chand68. design and evaluation of schemes was almost straight forward. This means that the action probabilities are updated in the case of a reward response from the environment. For this scheme. { } 3. we have Pr lim pl (n) = 1 as close to unity as desired. the update rate of the probability vector is the same at every time step. This problem led to a synthesis approach toward reinforcement schemes. then the scheme is named the linear reward-inaction scheme LR-I . but it is not guaranteed to reach the optimal solution. it can be shown that asymptotic solution of the set of difference equations enables us to conclude [Narendra89]: lim E[ M(n)] = r k=1 n →∞ ∑ r < ∑ ck k =1 r 1 ck r = Mo (3. For the linear reward-penalty scheme LR-P . This scheme is the earliest scheme considered in mathematical psychology. Expediency is a relatively weak condition on the learning behavior of a variable-structure automaton. since actually only one action was being updated. it is possible to show that pl(n). This makes the learning n →∞ automata ε -optimal [Narendra89]. by analyzing eigenvalues of the resulting difference equation.26) Therefore. but no penalties are assessed. Evolution of early reinforcement schemes occurred mostly in a heuristic manner. Shapiro69. By making the parameter a arbitrarily small. regardless of the environment response.2 Nonlinear Learning Algorithms: Absolutely Expedient Schemes Although early studies of automata learning were done mostly on linear schemes. the parameters of the linear reinforcement scheme are changed as follows: if the learning parameter b is set to 0. researchers eventually started considering schemes that can be characterized as nonlinear. In order to obtain a better learning mechanism. monotonically approaches 1. Vis72]. E[p(n)]. An expedient automaton will do better than a pure chance automaton. from Definition 3.

The general solution for absolutely expedient schemes is found in early 1970’s by Lakshmivarahan and Thathachar [Lakshmivarahan73].Cem Ünsal Chapter 3. This class of schemes can be considered as the generalization of the LR-I scheme. [Chand68] m m g j ( p( n)) = aφ ( p1 ( n). Again. Parameters a and b must be chosen properly to satisfy the conditions in Equation 3. Stochastic Learning Automata 40 penalty probabilities.1 − p1 (n )) p j (1 − p j (n)) θ  • • where φ ( p1 . For example. Absolutely expedient learning schemes are presently the only class of schemes for which necessary and sufficient conditions of design are available. Sufficient conditions for ensuring optimality is given in [Vor65]. 7 . the function for current action can be obtained by using the fact that the sum of probabilities is equal to 1.3. A learning automaton using this scheme is absolutely expedient if and only if the functions gk (p) and hk (p) satisfy the following conditions: g1 ( p) g2 ( p) g ( p) = =⋅⋅⋅= r = λ ( p) p1 p2 pr (3. cl < and c j > for j ≠ l). α (n) = α i is assumed.27) h1 ( p) h2 ( p) hr ( p) = =⋅⋅⋅ = = µ ( p) p1 p2 pr where λ(p) and µ(p) are arbitrary functions satisfying 8: Again. for expediency several conditions on the penalty 1 1 probabilities must be satisfied (e.g..21. and θ ≥ 1. 8 The reason for the conditions on the update functions will be explained in detail in Chapter 6. but they are valid only for two-action automata. the following update functions are defined by previous researchers for nonlinear reinforcement schemes7: a • g j ( p( n)) = h j ( p(n)) = p ( n)(1 − pi ( n)) r−1 i This scheme is absolutely expedient for restricted initial conditions and for restricted environments [Shapiro69].1 − p1 (n )) p θ+1 (1 − p j (n)) θ  j  θ +1  h j ( p(n)) = bφ ( p1 ( n).1 − p1 ) = φ (1 − p1 .…). p1 ) is a nonlinear function which can be suitably selected.  g j ( p (n)) = p j ( n) − φ ( p(n ))  p (n) − φ ( pi ( n))  h j ( p(n)) = i   r −1 m whereφ( x ) = ax (m = 2.20. Consider the general reinforcement scheme in Equation 3. We give only the updating functions for action probabilities other than the current action.

b ∈ IR. only the normalized S-model will be considered. The average penalty M(n) is still defined as earlier. it is always possible to map the output into the unit interval using: β −a βk = k (3. Stochastic Learning Automata 41 0 < λ ( p) < 1  pj  (3. all definitions for the Qmodel are similar to the P-model. b]. and therefore are neither totally favorable or totally unfavorable. . In previous sections.30) = ∑ sk pk k =1 where: sk = E[β (n) α (n) = α k ] (3. but in [a. b] for some a .1 S-Model Environments Up to this point we based our discussion of learning automata only on the P-model environment. where the input to the automaton is either 0 or 1.. Therefore. Here. A similar proof (for a new nonlinear absolutely expedient scheme) is given in Chapter 6.and Q-models also.Cem Ünsal Chapter 3.31) 9 Note that this normalization procedure requires the knowledge of a and b.r 1 − pj    Detailed proof of this theorem is given in both [Baba84] and [Narendra89].6 Extensions of the Basic Automata-Environment Model 3.. Narendra and Thathachar [Narendra89] show that all the principal results derived for the P-model carry over the more realistic S. In the case where the environment outputs β k are not in the interval [0.1].1] are used in the definition: M ( n) = E[ β (n) p (n)] = ∑ E [β (n) p(n ).6. We now consider the possibility of input values over the unit interval [a. Narendra and Thathachar state that such a model is very relevant in control systems with a continuous valued performance index.1}. the main problem is to determine how the probabilities of all actions are to be updated.α (n) = α k ] ⋅ Pr[α (n) = α k ] k =1 r r (3. we will not discuss the Q-model. but instead of ci ∈{0.(and Q-) models the outputs lie in the interval [0.28)   0 < µ ( p) < min   j=1.. the algorithms (for P-model) were stated in terms of the functions gk and hk which are the reward and penalty functions respectively. 3. si∈[0. Since in S.29) b−a where a = mini{β i} and b = maxi{β i}9.1]..

The direct generalization of this scheme to S-model gives: pi (n + 1) = pi ( n) − (1 − β ( n)) gi ( p(n)) + β ( n)hi ( p (n)) if α ( n) ≠ α i (3. Again.20).Cem Ünsal Chapter 3.32) The previous algorithm (Equation 3.27 and 3. ε -optimality. the SLR-P scheme has analogous properties to the LR-P scheme. Stochastic Learning Automata 42 In the S-model. The non-negativity condition on gi and hi maintains the reward-penalty character of these functions. sk plays the same role as the penalty probabilities in the P-model. the asymptotic value of the average penalty is: r lim E[ M( n)] = r n →∞ ∑ 1si k= 1 (3.21 for pi(n) to remain in the interval [0. though it is not necessary.36) Thus. optimality. for a pure-chance automaton with all the action probabilities equal.33) pi (n + 1) = pi ( n) + (1 − β ( n))∑ g j ( p(n)) − β ( n) pi ∑ h j ( p( n)) if α (n ) = α i j≠ i j ≠i 1 r ∑s r i =1 i (3. 3.2 Nonstationary Environments Thus far we have considered only the automaton in a stationary environment. except the fact that ci’s are replaced by s i’s.28 for absolute expediency also applies to updating algorithms for S-model environment.25) can be rewritten for S-model (SLR-P) as: a pi (n + 1) = pi (n) − [1 − β ( n)] ⋅ a ⋅ pi (n) + β (n) ⋅ [ − a ⋅ p( n)] if α ( n) ≠ α i r −1 (3. the average penalty Mo is given by: Mo = The definitions of expediency. consider the general reinforcement scheme (Equation 3. For example the LR-P reinforcement scheme (Equation 3.6. However.20) can easily be obtained from the above definition by substituting β( n) = 0 and 1. The value 1− β (n ) is an indication of how far β( n) is away from the maximum possible value of unity.35) pi (n + 1) = pi (n) + [1 − β ( n)] ⋅ a ⋅ [1 − pi (n )] − β ( n) ⋅ a ⋅ pi (n) if α (n) = αi Similarly. the functions gi and hi can be associated with reward and penalty respectively. and is maximum when β( n) is 0. The conditions of Equations 3. and are called penalty strengths. Again. The functions gi and hi have to satisfy the conditions for Equation 3. the need for learning and adaptation in systems is mainly due to the fact that the environment changes . It is expedient in all stationary random environments and the limiting expectations of the action probabilities are inversely proportional to the corresponding si.1] at each step. and absolute expediency are the same. Now.

analytical investigations are possible. Stochastic Learning Automata 43 with time. As in adaptive control theory. From an analytic standpoint. this method requires that the system be aware of the “environmental changes” in order to assign the “correct” automaton. If each automaton uses an absolutely expedient scheme. The aim in these cases is not to evolve to a single action which is optimal. and even non-expedient. However. optimality. and assign a different automaton Ai to each environment Ei. and absolute expediency have to be re-examined for nonstationary environments. and the number of environments must not be large. the optimal action can change with time. it may become less expedient.Cem Ünsal Chapter 3. then ε -optimality can be achieved asymptotically for the overall system. We can visualize this situation as an automaton operating in a finite set of stationary environments. The average penalty M(n) becomes: M(n) = ∑ ci (n) pi (n ) i =1 r (3. These are: w Fixed-structure automata in Markovian environments w Variable-structure automata in state-dependent nonstationary environments w Hierarchy of automata w Nonstationary environments with fixed optimal actions The fundamental concepts such as expediency.37) . The success of the learning procedure then depends on the environment changes as well as the prior information we have about the environment. then acceptable performance can be achieved in slowly changing environments. In our application. a similar approach is employed as explained in Chapter 4. the simplest case is the one in which penalty probability vector c(n) can have a finite number of values. the practical justification for the study of stationary environments is based on the assumption that if the convergence of a scheme is sufficiently fast. updating strategies may happen infrequently so that the corresponding action probabilities do not converge [Narendra89]. Secondly. we again consider the P-model for purposes of stating the definitions. If a learning automaton with a fixed strategy is used in a nonstationary environment. since the automaton Ai is updated only when the environment is Ei. Therefore. Furthermore. If we assume that the environments E i are states of a Markov chain described by a state transition matrix. There are two classes of environments that have been analytically investigated: Markovian switching environments. and state-dependent nonstationary environments. for example. An environment is nonstationary if the penalty probabilities ci (or si) vary with time. and analytical tools developed earlier are adequate to obtain a fairly complete descriptions of the learning process. since. Here. but to choose the actions to minimize the expected penalty. The performance of a learning automaton should be judged in such a context. there are four different situations wherein the states of the nonstationary environment vary with the step n. then it is possible to view the variable-structure automaton in the Markovian switching environment as a homogeneous Markov chain. The learning scheme must have sufficient flexibility to track the better actions.

then α l is the optimal action. with time-varying penalty probabilities and the new description in Equation 3. Baba discussed the problem of a variable-structure automaton operating in a multi-teacher environment [Baba85].1} (for a P-model). The action set of the automaton is of course the same for all teachers/environments..Cem Ünsal Chapter 3. it is possible to have an environment in which the actions of an automaton evoke a vector of responses due to multiple performance criteria...39) The previous concept of optimality can be applied to some specific nonstationary environments. we now have the following: (3. Then. Some difficulty may arise while formulating a mathematical model of the learning automaton in a multi-teacher environment.. the automaton is connected to N teachers (or N singleteacher environments).5: A learning automaton in nonstationary environment is expedient if there exists a no such that for all n > no. Since we have multiple responses. an automaton can be defined as optimal if it minimizes: 1 T lim ∑ E[ β ( n)] (3. the original definitions of optimality andε -optimality still hold..40) T →∞ T n =1 The definition of absolute expediency can be retained as before. and for all n (or at least for all n > n 1 for some n1). In general.6: In a nonstationary environment where there is no fixed optimal action. Then. We will describe such environments as multi-teacher environments. the task of “interpreting” the output vector is important. and a set of penalty probabilities c ij = Pr β i ( n) = 1 α ( n) = α j .. the automata have to “find” an action that satisfies all the teachers. However. if there exists a fixed l such that cl(n) < ci(n) for all i = 1.38) Definition 3. c . 3. we have: E[ M(n)] − Mo (n) < 0 (3.6. Stochastic Learning Automata 44 The pure-chance automaton has an average penalty Mo as a function of time: 1 r Mo (n ) = ∑ ci (n) r i =1 Using these descriptions.. Are the outputs from different teachers to be summed after normalization? Can we introduce weight factors associated with specific teachers? { } {c . For example. we have: Definition 3.37.. In a multi-teacher environment.2. a single-teacher environment is described by the action set α. c }where i 1 i 2 i r .3 Multi-Teacher Environments All the learning schemes and behavior norms discussed up to this point have dealt with a single automaton in a single environment. Conditions for absolute expediency are given in his works (see also Chapter 6 for new nonlinear reinforcement schemes).r. the output set β i ={0. i ≠ l.

the actions of some automata directly depend on the actions of others. the interconnection between automata is considered.6. it is possible that there are more than one automata in an environment. This is valid for Q.4 Interconnected Automata Although we based our discussion to a single automaton. An important consequence of the synchronous model is that the resulting configurations can be viewed as games of automata with particular payoff structures.Cem Ünsal Chapter 3. 3. the updating schemes are easily defined. or weighted average of the all responses [Narendra89.or S-model environments. one action is optimal for all environments.e. one decision maker controls the state transitions of the decision process. Baba85]. the case of multi-automata is not different than a single automaton case. and the goal of the whole group is to optimize some overall performance criterion. Another method is taking the average. only one automaton acts at each time step. this will not be the case in our application. However. At each step. The action chosen determines which automaton acts at the next time step. The environment reacts to the actions of multiple automata. In the sequential model. Our application uses logic gates (OR) and additional if-then conditions as described in Chapters 4 and 5. Furthermore. One possibility is to use to use an AND-gate as described in [Thathachar77]. The elements of the environment output vector must be combined in some fashion to form the input to the automaton. the other for longitudinal actions — the interdependence of these two sets of actions automatically results in an interconnected automata network. Stochastic Learning Automata 45 If the environment is of a P-model. If there is direct interaction between the automata.. such as the hierarchical (or sequential) automata models. If the interaction between different automata is provided by the environment. It is generally recognized that the potential of learning automata can be increased if specific rules for interconnections can be established. how should we combine the outputs? In the simplest case that all teachers agree on the ordering of the actions. i. There are two main interconnected automata models: synchronous and sequential. and the environment output is a result of the combined effect of actions chosen by all automata. In Chapters 4 and 5. Such a sequential model can be viewed as a network of automata in which control passes from one automaton to another. . multiple autonomous vehicles with multiple automata as path controllers constitute a more complex automata system as described in Chapter 7. Since each vehicle’s planning layer will include two automata — one for lateral.

Master your semester with Scribd & The New York Times

Special offer for students: Only $4.99/month.

Master your semester with Scribd & The New York Times

Cancel anytime.