## Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

**Stochastic Learning Automata
**

An automaton is a machine or control mechanism designed to automatically follow a

predetermined sequence of operations or respond to encoded instructions. The term stochastic

emphasizes the adaptive nature of the automaton we describe here. The automaton described here

do not follow predetermined rules, but adapts to changes in its environment. This adaptation is

the result of the learning process described in this chapter.

“The concept of learning automaton grew out of a fusion of the work of

psychologists in modeling observed behavior, the efforts of statisticians to model the

choice of experiments based on past observations, the attempts of operation

researchers to implement optimal strategies in the context of the two-armed bandit

problem, and the endeavors of system theorists to make rational decisions in random

environments” [Narendra89].

In classical control theory, the control of a process is based on complete knowledge of the

process/system. The mathematical model is assumed to be known, and the inputs to the process

are deterministic functions of time. Later developments in control theory considered the

uncertainties present in the system. Stochastic control theory assumes that some of the

characteristics of the uncertainties are known. However, all those assumptions on uncertainties

and/or input functions may be insufficient to successfully control the system if changes. It is

then necessary to observe the process in operation and obtain further knowledge of the system,

i.e., additional information must be acquired on-line since a priori assumptions are not sufficient.

One approach is to view these as problems in learning.

Rule-based systems, although performing well on many control problems, have the

disadvantage of requiring modifications, even for a minor change in the problem space.

Furthermore, rule-based approach, especially expert systems, cannot handle unanticipated

situations. The idea behind designing a learning system is to guarantee robust behavior without

the complete knowledge, if any, of the system/environment to be controlled. A crucial advantage

of reinforcement learning compared to other learning approaches is that it requires no information

about the environment except for the reinforcement signal [Narendra89, Marsh93].

A reinforcement learning system is slower than other approaches for most applications

since every action needs to be tested a number of times for a satisfactory performance. Either the

learning process must be much faster than the environment changes (as is the case in our

Cem Ünsal Chapter 3. Stochastic Learning Automata 30

application), or the reinforcement learning must be combined with an adaptive forward model

that anticipates the changes in the environment [Peng93].

Learning

1

is defined as any permanent change in behavior as a result of past experience,

and a learning system should therefore have the ability to improve its behavior with time, toward

a final goal. In a purely mathematical context, the goal of a learning system is the optimization of

a functional not known explicitly [Narendra74].

In the 1960’s, Y. Z. Tsypkin [Tsypkin71] introduced a method to reduce the problem to

the determination of an optimal set of parameters and then apply stochastic hill climbing

techniques. M.L. Tsetlin and colleagues [Tsetlin73] started the work on learning automata during

the same period. An alternative approach to applying stochastic hill-climbing techniques,

introduced by Narendra and Viswanathan [Narendra72], is to regard the problem as one of finding

an optimal action out of a set of allowable actions and to achieve this using stochastic automata.

The difference between the two approaches is that the former updates the parameter space at

each iteration while the later updates the probability space.

The stochastic automaton attempts a solution of the problem without any information on

the optimal action (initially, equal probabilities are attached to all the actions). One action is

selected at random, the response from the environment is observed, action probabilities are

updated based on that response, and the procedure is repeated. A stochastic automaton acting as

described to improve its performance is called a learning automaton.

3.1 Earlier Works

The first learning automata models were developed in mathematical psychology. Early research in

this area is surveyed by Bush and Mosteller [Bush58] and Atkinson et al. [Atkinson65]. Tsetlin

[Tsetlin73] introduced deterministic automaton operating in random environments as a model of

learning. Most of the works done on learning automata theory has followed the trend set by

Tsetlin. Varshavski and Vorontsova [Varshavski63] described the use of stochastic automata

with updating of action probabilities which results in reduction in the number of states in

comparison with deterministic automata.

Fu and colleagues [Fu65a, Fu65b, Fu67, Fu69a, Fu69b, Fu71] were the first researchers

to introduce stochastic automata into the control literature. Applications to parameter estimation,

pattern recognition and game theory were initially considered by this school. Properties of linear

updating schemes and the concept of a ‘growing’ automaton are defined by McLaren

[McLaren66]. Chandrasekaran and Shen [Chand68,Chand69] studied nonlinear updating schemes,

nonstationary environments and games of automata. Narendra and Thathachar have studied the

theory and applications of learning autoamta and carried out simulation studies in the area. Their

book Learning Automata [Narendra89] is an introduction to learning automata theory which

surveys all the research done on the subject until the end of the 1980s.

1

Webster defines ‘learning’ as ‘to gain knowledge or understanding of a skill by study, instruction, or experience.’

Cem Ünsal Chapter 3. Stochastic Learning Automata 31

The most recent (and second most comprehensive) book since the Narendra-Thathachar

collaboration on learning automata theory and applications is published by Najim and Pznyak in

1994 [Najim94]. This book also includes several applications and examples of learning automata.

Until recently, the applications of learning automata to control problems were rare.

Consequently, successful updating algorithms and theorems for advanced automata applications

were not readily available. Nario Baba’s work [Baba85] on learning behaviors of stochastic

automata under a nonstationary multi-teacher environment, and general linear reinforcement

schemes [Bush58] are taken as a starting point for the research presented in this dissertation.

Recent applications of learning automata to real life problems include control of

absorption columns [Najim91], bioreactors [Gilbert92], control of manufacturing plants

[Sequeria91], pattern recognition [Oommen94a], graph partitioning [Oommen94b], active vehicle

suspension [Marsh93, 95], path planning for manipulators [Naruse93], distributed fuzzy logic

processor training [Ikonen97], and path planning [Tsoularis93] and action selection [Aoki95] for

autonomous mobile robots.

Recent theoretical results on learning algorithms and techniques can be found in

[Najim91b], [Najim94], [Sastry93], [Papadimitriou94], [Rajaraman96], [Najim96], and

[Poznyak96].

3.2 The Environment and the Automaton

The learning paradigm the learning automaton presents may be stated as follows: a finite number

of actions can be performed in a random environment. When a specific action is performed the

environment provides a random response which is either favorable or unfavorable. The objective

in the design of the automaton is to determine how the choice of the action at any stage should be

guided by past actions and responses. The important point to note is that the decisions must be

made with very little knowledge concerning the “nature” of the environment. It may have time-

varying characteristics, or the decision maker may be a part of a hierarchical decision structure

but unaware of its role in the hierarchy. Furthermore, the uncertainty may be due to the fact that

the output of the environment is influenced by the actions of other agents unknown to the

decision maker.

The environment in which the automaton “lives” responds to the action of the automaton

by producing a response, belonging to a set of allowable responses, which is probabilistically

related to the automaton action. The term environment

2

is not easy to define in the context of

learning automata. The definition encompasses a large class of unknown random media in which

an automaton can operate. Mathematically, an environment is represented by a triple {α, c, β}

where α represents a finite action/output set, β represents a (binary) input/response set, and c is

a set of penalty probabilities, where each element c

i

corresponds to one action α

i

of the set α.

The output (action) α(n) of the automaton belongs to the set α, and is applied to the

environment at time t = n. The input β(n) from the environment is an element of the set β and can

2

Commonly refers to the aggregate of all the external conditions and influences affecting the life and development of

an organism.

Cem Ünsal Chapter 3. Stochastic Learning Automata 32

take on one of the values β

1

and β

2

. In the simplest case, the values β

i

are chosen to be 0 and 1,

where 1 is associated with failure/penalty response. The elements of c are defined as:

{ } Prob β α α ( ) ( ) ( , ,...) n n c i

i i

· · · · 1 12 (3.1)

Therefore c

i

is the probability that the action α

i

will result in a penalty input from the

environment. When the penalty probabilities c

i

are constant, the environment is called a

stationary environment.

There are several models defined by the response set of the environment. Models in

which the input from the environment can take only one of two values, 0 or 1, are referred to as

P-models

3

. In this simplest case, the response value of 1 corresponds to an “unfavorable”

(failure, penalty) response, while output of 0 means the action is “favorable.” A further

generalization of the environment allows finite response sets with more than two elements that

may take finite number of values in an interval [a, b]. Such models are called Q-models. When the

input from the environment is a continuous random variable with possible values in an interval [a,

b], the model is named S-model.

Input Output

State

Transition Function

Output Function α

F

G β

φ

Automaton

Environment

Penalty probabilities c

Figure 3.1. The automaton and the environment.

The automaton can be represented by a quintuple {Φ, α, β, F(•,•), H(•,•)} where:

w Φ is a set of internal states. At any instant n, the state φ(n) is an element of the finite

set Φ = {φ

1

, φ

2

,..., φ

s

}

w α is a set of actions (or outputs of the automaton). The output or action of an

automaton an the instant n, denoted by α(n), is an element of the finite set α = {α

1,

α

2

,...,

α

r

}

w β is a set of responses (or inputs from the environment). The input from the

environment β(n) is an element of the set β which could be either a finite set or an infinite

set, such as an interval on the real line:

β β β β β · · { , ,..., } {( , )}

1 2 m

a b or (3.2)

3

We will follow the notation used in [Narendra89].

Cem Ünsal Chapter 3. Stochastic Learning Automata 33

w F(•,•): Φ × β → Φ is a function that maps the current state and input into the next

state. F can be deterministic or stochastic:

φ φ β ( ) [ ( ), ( )] n F n n + · 1 (3.3)

w H(•,•): Φ × β → α is a function that maps the current state and input into the current

output. If the current output depends on only the current state, the automaton is referred

to as state-output automaton. In this case, the function H(•,•) is replaced by an output

function G(•): Φ → α, which can be either deterministic or stochastic:

α φ ( ) [ ( )] n G n · (3.4)

For our applications, we choose the function G(•) as the identity function, i.e., the states

of the automata are also the actions.

3.3 The Stochastic Automaton

We now introduce the stochastic automaton in which at least one of the two mappings F and G is

stochastic. If the transition function F is stochastic, the elements f

ij

β

of F represent the

probability that the automaton moves from state φ

i

to state φ

j

following an input β

4

:

f n n n i j s

ij j i m

β

φ φ φ φ β β β β β β · + · · · · · Pr{ ( ) ( ) , ( ) } , , ,..., , ,..., 1 12

1 2

(3.5)

For the mapping G, the definition is similar:

g n n i j r

ij j i

· · · · Pr{ ( ) ( ) } , , ,..., α α φ φ 12 (3.6)

Since f

ij

β

are probabilities, they lie in the closed interval [a, b]

5

; and to conserve probability

measure we must have:

f

ij

j

s

β

·

∑

·

1

1 for each β and i. (3.7)

Example

States: φ

1

, φ

2

Inputs: β

1

,β

2

Transition matrices:

F

F

( )

. .

. .

( )

. .

. .

β

β

1

2

050 050

025 075

01 09

075 025

·

¸

1

]

1

·

¸

1

]

1

4

In Chapter 4, where we define our initial algorithm, the probabilities fij

β

will be the same for all inputs β and all

states i given a state φj. Furthermore, the output mapping G is chosen as identity mapping.

5

[a, b]=[0, 1] in most cases.

Cem Ünsal Chapter 3. Stochastic Learning Automata 34

Transition Graphs:

β · β

β · β

Φ Φ

1 2

Φ

1

Φ

2

1

2

0.90

0.10

0.75

0.25

0.50

0.50

0.75

0.25

In the previous example, the conditional probabilities f

ij

β

were assumed to be constant,

i.e., independent of the time step n, and the input sequence. Such a stochastic automaton is

referred to as a fixed-structure automaton. As we will discuss later, it is useful to update the

transition probabilities at each step n on the basis of the environment response at that instant.

Such an automaton is called variable-structure automaton.

Furthermore, in the case of variable-structure automaton, the above definitions of the

transition functions F and G are not used explicitly. Instead of transition matrices, a vector of

action probabilities p(n) is defined to describe the reinforcement schemes — as we introduce in

the next sections.

If the variable-structure automaton is a state-output automaton, and the probabilities of

transition from one state to another f

ij

β

do not depend on current state and environment input,

then the relation between the action probability vector and the transition matrices is as follows:

• Since the automaton is a state-output automaton, we omit the transition matrix G, and

only consider the transition matrix F.

• The transition does not depend on the environment response, therefore there is only

one state transition matrix F where the elements are f

ij

. For example, we must have:

F F F

f f

f f

( ) ( ) β β

1 2

11 12

21 22

· ≡ ·

¸

1

]

1

(3.8)

• Furthermore, the probability of being in one state (or generating one action) does not

depend on the initial/previous state of the automaton. Therefore, the transition matrix reduces to:

[ ] F f f f f f f p · · ≡ · ≡ ≡

11 21 1 12 22 2

(3.9)

where vector p consists of probabilities p

i

of the automaton being in the i

th

state (or choosing the

i

th

output/action).

3.4 Variable Structure Automaton and Its Performance Evaluation

Before introducing the variable structure automata, we will introduce the definitions necessary to

evaluate the performance of a learning variable-structure automaton. A learning automaton

generates a sequence of actions on the basis of its interaction with the environment. If the

automaton is “learning” in the process, its performance must be superior to “intuitive” methods.

Cem Ünsal Chapter 3. Stochastic Learning Automata 35

To judge the performance of the automaton, we need to set up quantitative norms of behavior.

The quantitative basis for assessing the learning behavior is quite complex, even in the simplest

P-model and stationary random environments. To introduce the definitions for “norms of

behavior”, we will consider this simplest case. Further definitions for other models and non-

stationary environments will be given whenever necessary.

3.4.1 Norms of Behavior

If no prior information is available, there is no basis in which the different actions α

i

can be

distinguished. In such a case, all action probabilities would be equal — a “pure chance” situation.

For an r-action automaton, the action probability vector p(n) = Pr {α(n) = α

i

} is given by:

p n

r

i r

i

( ) , ,..., · ·

1

12 (3.10)

Such an automaton is called “pure chance automaton,” and will be used as the standard for

comparison. Any learning automaton must at least do better than a pure-chance automaton.

Consider a stationary random environment with penalty probabilities

{ } c c c

r 1 2

, ,..., where { } c n n

i i

· · · Pr ( ) ( ) β α α 1 . We define a quantity M(n) as the average

penalty for a given action probability vector:

M n E n p n n p n

n n n

c p n

i

i

r

i

i i

i

r

( ) [ ( ) ( )] Pr{ ( ) ( )}

Pr{ ( ) ( ) } Pr{ ( ) }

( )

· · ·

· · · ⋅ ·

·

·

·

∑

∑

β β

β α α α α

1

1

1

1

(3.11)

For the pure-chance automaton, M(n) is a constant denoted by M

o

:

M

r

c

o i

i

r

·

·

∑

1

1

(3.12)

Also note that:

E M n E E n p n

E n

[ ( )] { [ ( ) ( )]}

[ ( )]

·

·

β

β

(3.13)

i.e., E[M(n)] is the average input to the automaton. Using above definitions, we have the

following:

Definition 3.1: A learning automaton is expedient if lim [ ( )]

n

o

E M n M

→∞

< .

Since p n

i

i

r

( ) ·

·

∑

1

1

, we can write inf ( ) inf { ( )} min { }

( )

M n c p n c c

p n i i i i

i

r

l

· · ≡

·

∑

1

.

Definition 3.2: A learning automaton is said to be optimal if:

lim [ ( )]

n

l

E M n c

→∞

· (3.14)

Cem Ünsal Chapter 3. Stochastic Learning Automata 36

Optimality implies that the action α

l

is associated with the minimum penalty probability c

l

is

chosen asymptotically with probability one. In spite of efforts of many researchers, the general

algorithm which ensures optimality has not been found [Baba83, Kushner72, Narendra89]. It

may not be possible to achieve optimality in every given situation. In this case, a suboptimal

behavior is defined.

Definition 3.3: A learning automaton is e-optimal if:

lim [ ( )]

n

l

E M n c

→∞

· + ε (3.15)

where e is arbitrarily small positive number.

Some automata satisfy the conditions stated in the definitions for specified initial

conditions and for certain sets of penalty probabilities. However, we may be interested in

automata that exhibit a desired behavior in arbitrary environments and for arbitrary initial

conditions. These requirements are partially met by an absolutely expedient automaton, which is

defined as:

Definition 3.4: A learning automaton is absolutely expedient if:

E[M(n+1) | p(n)] < M(n) (3.16)

for all n, all p

i

(n) ∈ (0, 1) and for all possible sets { } c c c

r 1 2

, ,..., . Taking the expectations of both

sides, we can further see that:

E M n E M n [ ( )] [ ( )] + < 1 (3.17)

In the application of learning automata techniques to intelligent control, we are mainly

interested in the case where there is one “optimal” action α

j

with c

j

= 0

6

. In such an environment,

a learning automaton is expected to reach a “pure optimal” strategy [Najim94]:

p n e

j

r

( ) → (3.18)

where e

j

r

is the unit vector with j

th

component equal to 1. In other words, the automaton will be

optimal since [ ] E M n c p n c

i i

i

r

l

( ) ( ) · ∑ →

·1

.

3.4.2 Variable Structure Automata

A more flexible learning automaton model can be created by considering more general stochastic

systems in which the action probabilities (or the state transitions) are updated at every stage

using a reinforcement scheme. In this section, we will introduce the general description of a

reinforcement scheme, linear and nonlinear. For simplicity, we assume that each state

6

We will discuss the feasibility of the case later.

Cem Ünsal Chapter 3. Stochastic Learning Automata 37

corresponds to one action, i.e., the automaton is a state-output automaton. The simulation

examples in later chapters also use the same identity state-output mapping.

In general terms, a reinforcement scheme can be represented as follows:

p n T p n n n ( ) [ ( ), ( ), ( )] + · 1

1

α β (3.19a)

or

f n T f n n n n

ij ij

β β

φ φ β ( ) [ ( ), ( ), ( ) ( )] + · + 1 1

2

(3.19b)

where T

1

and T

2

are mappings. Again, α is the action, β is the input from environment, and φ is

the state of the automaton. From now on, we will use the first mathematical description (3.19a)

for reinforcement schemes. If p(n+1) is a linear function of p(n), the reinforcement scheme is said

to be linear; otherwise it is termed nonlinear.

The simplest case

Consider a variable-structure automaton with r actions in a stationary environment with

{ } β · 0 1 , . The general scheme for updating action probabilities is as follows:

If α(n) = α

i

when β = 0

p n p n g p n j i

p n p n g p n

j j i

i i k

k

k i

r

( + ) = ( ) ( ( )) for all

( + ) = ( ) ( ( ))

1

1

1

− ≠

+

·

≠

∑

(3.20)

when β = 1

p n p n h p n j i

p n p n h p n

j j i

i i k

k

k i

r

( + ) = ( ) ( ( )) for all

( + ) = ( ) ( ( ))

1

1

1

+ ≠

−

·

≠

∑

where g

k

and h

k

(k=1,2,...,r) are continuous, nonnegative functions with the following

assumptions:

0

0 1

1

< <

< + <

·

≠

∑

g p n p n

p n h p n

k k

k k

k

k i

r

( ( )) ( )

[ ( ) ( ( ))]

(3.21)

for all i=1,2,...,r and all probabilities p

k

in the open interval (0,1). The two constraints on update

functions guarantees that the sum of all action probabilities is 1 at every time step.

3.5 Reinforcement Schemes

The reinforcement scheme is the basis of the learning process for learning automata. These

schemes are categorized based on their linearity. A variety of linear, nonlinear and hybrid

schemes exists for variable structure learning automata [Narendra89]. The linearity characteristic

of a reinforcement scheme is defined by the linearity of the update functions g

k

and h

k

(see the

Cem Ünsal Chapter 3. Stochastic Learning Automata 38

examples in the next section). It is also possible to divide all possible algorithms for stochastic

learning automata into the following two classes [Najim94]:

• nonprojectional algorithms, where individual probabilities are updated based on their previous

values:

( ) p n p n k p n n n

i i i i

( ) ( ) ( ), ( ), ( ) + · + ⋅ 1 δ α β (3.22)

where δ is a small real number and k

i

are the mapping functions for the update term.

• projectional algorithms, where the probabilities are updated by a function which maps the

probability vector to a specific value for each element:

( ) p n k p n n n

i i

( ) ( ), ( ), ( ) + · 1 α β (3.23)

where the functions k

i

are the mapping functions.

The main difference between these two subclasses is that nonprojectional algorithms can

only be used with binary environment response (i.e., in a P-model environment). Projectional

algorithms may be used in all environment types and are more complex. Furthermore,

implementation of projectional algorithms are computationally more tasking.

Early studies of reinforcement schemes were centered around linear schemes for reasons

of analytical simplicity. The need for more complex and efficient reinforcement schemes

eventually lead researchers to nonlinear (and hybrid) schemes. We will first introduce several

well-known linear schemes, and then close this section with absolutely expedient nonlinear

schemes.

3.5.1 Linear Reinforcement Schemes

For an r-action learning automaton, the general definition of linear reinforcement schemes can be

obtained by the substitution of the functions in Equation 3.20 as follows:

g p n a p n

h p n

b

r

b p n

a b

k k

k k

( ( )) ( )

( ( )) ( )

,

· ⋅

·

−

− ⋅

< <

1

0 1

(3.24)

Therefore, the scheme corresponding to general linear schemes is given as:

If α(n) = α

i

,

when β = 0

p n a p n j i

p n p n a p n

j j

i i

( + ) = ( ) ( ) for all

( + ) = ( )

i

1 1

1 1

− ⋅ ≠

+ ⋅ −

¹

'

¹

[ ( )]

when β = 1

p n

b

r

b p n j i

p n b p n

j j

i i

( + ) = +( ) ( ) for all

( + ) = ( ) ( )

1

1

1

1 1

−

− ⋅ ≠

− ⋅

¹

'

¹

¹

¹

(3.25)

As seen from the definition, the parameter a is associated with reward response, and the

parameter b with penalty response. If the learning parameters a and b are equal, the scheme is

Cem Ünsal Chapter 3. Stochastic Learning Automata 39

called the linear reward-penalty scheme L

R-P

[Bush58]. In this case, the update rate of the

probability vector is the same at every time step, regardless of the environment response. This

scheme is the earliest scheme considered in mathematical psychology.

For the linear reward-penalty scheme L

R-P

, E[p(n)], the expected value of the probability

vector at time step n, can be easily evaluated, and, by analyzing eigenvalues of the resulting

difference equation, it can be shown that asymptotic solution of the set of difference equations

enables us to conclude [Narendra89]:

lim [ ( )]

n

k

k

r

k

k

r

o

E M n

r

c

c

r

M

→∞

·

·

· < ·

∑

∑

1

1

1

(3.26)

Therefore, from Definition 3.1, the multi-action automaton using the L

R-P

scheme is expedient for

all initial action probabilities and in all stationary random environments.

Expediency is a relatively weak condition on the learning behavior of a variable-structure

automaton. An expedient automaton will do better than a pure chance automaton, but it is not

guaranteed to reach the optimal solution. In order to obtain a better learning mechanism, the

parameters of the linear reinforcement scheme are changed as follows: if the learning parameter b

is set to 0, then the scheme is named the linear reward-inaction scheme L

R-I

. This means that the

action probabilities are updated in the case of a reward response from the environment, but no

penalties are assessed.

For this scheme, it is possible to show that p

l

(n), the probability of the action α

l

with

minimum penalty probability c

l

, monotonically approaches 1. By making the parameter a

arbitrarily small, we have Pr lim

n→∞

p

l

(n) · 1

{ }

as close to unity as desired. This makes the learning

automata ε -optimal [Narendra89].

3.5.2 Nonlinear Learning Algorithms: Absolutely Expedient Schemes

Although early studies of automata learning were done mostly on linear schemes, a few attempts

were made to study nonlinear schemes [Vor65, Chand68, Shapiro69, Vis72]. Evolution of early

reinforcement schemes occurred mostly in a heuristic manner. For two-action automata, design

and evaluation of schemes was almost straight forward, since actually only one action was being

updated. Generalization of such schemes to multi-action case was not straightforward

[Narendra89]. This problem led to a synthesis approach toward reinforcement schemes.

Researchers started asking the question: “What are the conditions on the updating functions that

guarantee a desired behavior?” The problem viewed in the light of this approach resulted in a new

concept of absolute expediency. While searching for a reinforcement scheme that guarantees

optimality and/or absolute expediency, researchers eventually started considering schemes that

can be characterized as nonlinear. The update functions g

k

and h

k

are nonlinear functions of

Cem Ünsal Chapter 3. Stochastic Learning Automata 40

penalty probabilities. For example, the following update functions are defined by previous

researchers for nonlinear reinforcement schemes

7

:

• g p n h p n

a

r

p n p n

j j i i

( ( )) ( ( )) ( )( ( )) · ·

−

−

1

1

This scheme is absolutely expedient for restricted initial conditions and for restricted

environments [Shapiro69].

•

g p n p n p n

h p n

p n p n

r

j j

j

i i

( ( )) ( ) ( ( ))

( ( ))

( ) ( ( ))

· −

·

−

−

¹

'

¹

¹

¹

φ

φ

1

whereφ( ) x ax

m

· (m = 2,3,…). Again, for expediency several conditions on the penalty

probabilities must be satisfied (e.g., c

m

l

<

1

and c

m

j

>

1

for j ≠ l). [Chand68]

•

g p n a p n p n p p n

h p n b p n p n p p n

j j j

j j j

( ( )) ( ( ), ( )) ( ( ))

( ( )) ( ( ), ( )) ( ( ))

· − −

· − −

¹

'

¹

¹

¹

+

+

φ

φ

θ θ

θ θ

1 1

1

1 1

1

1 1

1 1

where φ φ ( , ) ( , ) p p p p

1 1 1 1

1 1 − · − is a nonlinear function which can be suitably selected, and

θ ≥ 1. Parameters a and b must be chosen properly to satisfy the conditions in Equation

3.21. Sufficient conditions for ensuring optimality is given in [Vor65], but they are valid only

for two-action automata.

The general solution for absolutely expedient schemes is found in early 1970’s by

Lakshmivarahan and Thathachar [Lakshmivarahan73]. Absolutely expedient learning schemes are

presently the only class of schemes for which necessary and sufficient conditions of design are

available. This class of schemes can be considered as the generalization of the L

R-I

scheme.

Consider the general reinforcement scheme in Equation 3.20. A learning automaton using this

scheme is absolutely expedient if and only if the functions g

k

(p) and h

k

(p) satisfy the following

conditions:

g p

p

g p

p

g p

p

p

h p

p

h p

p

h p

p

p

r

r

r

r

1

1

2

2

1

1

2

2

( ) ( ) ( )

( )

( ) ( ) ( )

( )

· ·⋅ ⋅ ⋅ · ·

· ·⋅ ⋅ ⋅ · ·

λ

µ

(3.27)

where λ(p) and µ(p) are arbitrary functions satisfying

8

:

7

Again, α α ( ) n

i

· is assumed. We give only the updating functions for action probabilities other than the current

action; the function for current action can be obtained by using the fact that the sum of probabilities is equal to 1.

8

The reason for the conditions on the update functions will be explained in detail in Chapter 6.

Cem Ünsal Chapter 3. Stochastic Learning Automata 41

0 1

0

1 1

< <

< <

−

¹

'

¹

¹

¹

¹

;

¹

¹

¹ ·

λ

µ

( )

( ) min

,...,

p

p

p

p j r

j

j

(3.28)

Detailed proof of this theorem is given in both [Baba84] and [Narendra89]. A similar proof (for a

new nonlinear absolutely expedient scheme) is given in Chapter 6.

3.6 Extensions of the Basic Automata-Environment Model

3.6.1 S-Model Environments

Up to this point we based our discussion of learning automata only on the P-model environment,

where the input to the automaton is either 0 or 1. We now consider the possibility of input

values over the unit interval [a, b]. Narendra and Thathachar state that such a model is very

relevant in control systems with a continuous valued performance index. In previous sections, the

algorithms (for P-model) were stated in terms of the functions g

k

and h

k

which are the reward and

penalty functions respectively. Since in S- (and Q-) models the outputs lie in the interval [0,1],

and therefore are neither totally favorable or totally unfavorable, the main problem is to

determine how the probabilities of all actions are to be updated. Narendra and Thathachar

[Narendra89] show that all the principal results derived for the P-model carry over the more

realistic S- and Q-models also. Here, we will not discuss the Q-model; all definitions for the Q-

model are similar to the P-model.

In the case where the environment outputs β

k

are not in the interval [0,1], but in [a, b] for

some a ,b ∈ IR, it is always possible to map the output into the unit interval using:

β

β

k

k

a

b a

·

−

−

(3.29)

where a = min

i

{β

i

} and b = max

i

{β

i

}

9

. Therefore, only the normalized S-model will be

considered.

The average penalty M(n) is still defined as earlier, but instead of c

i

∈{0,1}, s

i

∈[0,1] are

used in the definition:

M n E n p n

E n p n n n

s p

k

k

r

k

k k

k

r

( ) [ ( ) ( )]

[ ( ) ( ), ( ) ] Pr[ ( ) ]

·

· · ⋅ ·

·

·

·

∑

∑

β

β α α α α

1

1

(3.30)

where:

s E n n

k k

· · [ ( ) ( ) ] β α α (3.31)

9

Note that this normalization procedure requires the knowledge of a and b.

Cem Ünsal Chapter 3. Stochastic Learning Automata 42

In the S-model, s

k

plays the same role as the penalty probabilities in the P-model, and are

called penalty strengths. Again, for a pure-chance automaton with all the action probabilities

equal, the average penalty M

o

is given by:

M

r

s

o i

i

r

·

·

∑

1

1

(3.32)

The definitions of expediency, optimality, ε -optimality, and absolute expediency are the

same, except the fact that c

i

’s are replaced by s

i

’s. Now, consider the general reinforcement

scheme (Equation 3.20). The direct generalization of this scheme to S-model gives:

p n p n n g p n n h p n if n

p n p n n g p n n p h p n if n

i i i i i

i i j

j i

i j

j i

i

( ) ( ) ( ( )) ( ( )) ( ) ( ( )) ( )

( ) ( ) ( ( )) ( ( )) ( ) ( ( )) ( )

+ · − − + ≠

+ · + − − ·

≠ ≠

∑ ∑

1 1

1 1

β β α α

β β α α

(3.33)

The previous algorithm (Equation 3.20) can easily be obtained from the above definition

by substituting β( ) n · 0 and 1. Again, the functions g

i

and h

i

can be associated with reward and

penalty respectively. The value 1− β( ) n is an indication of how far β( ) n is away from the

maximum possible value of unity, and is maximum when β( ) n is 0. The functions g

i

and h

i

have to

satisfy the conditions for Equation 3.21 for p

i

(n) to remain in the interval [0,1] at each step. The

non-negativity condition on g

i

and h

i

maintains the reward-penalty character of these functions,

though it is not necessary.

For example the L

R-P

reinforcement scheme (Equation 3.25) can be rewritten for S-model

(SL

R-P

) as:

p n p n n a p n n

a

r

a p n if n

p n p n n a p n n a p n if n

i i i i

i i i i i

( ) ( ) [ ( )] ( ) ( ) [ ( )] ( )

( ) ( ) [ ( )] [ ( )] ( ) ( )

+ · − − ⋅ ⋅ + ⋅

−

− ⋅ ≠

+ · + − ⋅ ⋅ − − ⋅ ⋅ ·

1 1

1

1 1 1

β β α α

β β α α

( )

(3.35)

Similarly, the asymptotic value of the average penalty is:

lim [ ( )]

n

i

k

r

E M n

r

s

→∞

·

·

∑

1

1

(3.36)

Thus, the SL

R-P

scheme has analogous properties to the L

R-P

scheme. It is expedient in all

stationary random environments and the limiting expectations of the action probabilities are

inversely proportional to the corresponding s

i

. The conditions of Equations 3.27 and 3.28 for

absolute expediency also applies to updating algorithms for S-model environment.

3.6.2 Nonstationary Environments

Thus far we have considered only the automaton in a stationary environment. However, the need

for learning and adaptation in systems is mainly due to the fact that the environment changes

Cem Ünsal Chapter 3. Stochastic Learning Automata 43

with time. The performance of a learning automaton should be judged in such a context. If a

learning automaton with a fixed strategy is used in a nonstationary environment, it may become

less expedient, and even non-expedient. The learning scheme must have sufficient flexibility to

track the better actions. The aim in these cases is not to evolve to a single action which is

optimal, but to choose the actions to minimize the expected penalty. As in adaptive control

theory, the practical justification for the study of stationary environments is based on the

assumption that if the convergence of a scheme is sufficiently fast, then acceptable performance

can be achieved in slowly changing environments.

An environment is nonstationary if the penalty probabilities c

i

(or s

i

) vary with time. The

success of the learning procedure then depends on the environment changes as well as the prior

information we have about the environment. From an analytic standpoint, the simplest case is the

one in which penalty probability vector c(n) can have a finite number of values. We can visualize

this situation as an automaton operating in a finite set of stationary environments, and assign a

different automaton A

i

to each environment E

i

. If each automaton uses an absolutely expedient

scheme, then ε -optimality can be achieved asymptotically for the overall system. However, this

method requires that the system be aware of the “environmental changes” in order to assign the

“correct” automaton, and the number of environments must not be large. Furthermore, since the

automaton A

i

is updated only when the environment is E

i

, updating strategies may happen

infrequently so that the corresponding action probabilities do not converge [Narendra89]. In our

application, a similar approach is employed as explained in Chapter 4.

There are two classes of environments that have been analytically investigated:

Markovian switching environments, and state-dependent nonstationary environments. If we

assume that the environments E

i

are states of a Markov chain described by a state transition

matrix, then it is possible to view the variable-structure automaton in the Markovian switching

environment as a homogeneous Markov chain. Therefore, analytical investigations are possible.

Secondly, there are four different situations wherein the states of the nonstationary environment

vary with the step n, and analytical tools developed earlier are adequate to obtain a fairly

complete descriptions of the learning process. These are:

w Fixed-structure automata in Markovian environments

w Variable-structure automata in state-dependent nonstationary environments

w Hierarchy of automata

w Nonstationary environments with fixed optimal actions

The fundamental concepts such as expediency, optimality, and absolute expediency have

to be re-examined for nonstationary environments, since, for example, the optimal action can

change with time. Here, we again consider the P-model for purposes of stating the definitions.

The average penalty M(n) becomes:

M n c n p n

i i

i

r

( ) ( ) ( ) ·

·

∑

1

(3.37)

Cem Ünsal Chapter 3. Stochastic Learning Automata 44

The pure-chance automaton has an average penalty M

o

as a function of time:

M n

r

c n

o i

i

r

( ) ( ) ·

·

∑

1

1

(3.38)

Using these descriptions, we now have the following:

Definition 3.5: A learning automaton in nonstationary environment is expedient if there exists a

n

o

such that for all n > n

o

, we have:

E M n M n

o

[ ( )] ( ) − < 0 (3.39)

The previous concept of optimality can be applied to some specific nonstationary

environments. For example, if there exists a fixed l such that c

l

(n) < c

i

(n) for all i = 1,2,...,r, i ≠ l,

and for all n (or at least for all n > n

1

for some n

1

), then α

l

is the optimal action; the original

definitions of optimality andε -optimality still hold. In general, we have:

Definition 3.6: In a nonstationary environment where there is no fixed optimal action, an

automaton can be defined as optimal if it minimizes:

lim [ ( )]

T

n

T

T

E n

→∞

·

∑

1

1

β (3.40)

The definition of absolute expediency can be retained as before, with time-varying

penalty probabilities and the new description in Equation 3.37.

3.6.3 Multi-Teacher Environments

All the learning schemes and behavior norms discussed up to this point have dealt with a single

automaton in a single environment. However, it is possible to have an environment in which the

actions of an automaton evoke a vector of responses due to multiple performance criteria. We

will describe such environments as multi-teacher environments. Then, the automata have to

“find” an action that satisfies all the teachers.

In a multi-teacher environment, the automaton is connected to N teachers (or N single-

teacher environments). Then, a single-teacher environment is described by the action set α, the

output set β

i

={0,1} (for a P-model), and a set of penalty probabilities { } c c c

i i

r

i

1 2

, ,..., where

{ }

c n n

j

i i

j

· · · Pr ( ) ( ) β α α 1 . The action set of the automaton is of course the same for all

teachers/environments. Baba discussed the problem of a variable-structure automaton operating

in a multi-teacher environment [Baba85]. Conditions for absolute expediency are given in his

works (see also Chapter 6 for new nonlinear reinforcement schemes).

Some difficulty may arise while formulating a mathematical model of the learning

automaton in a multi-teacher environment. Since we have multiple responses, the task of

“interpreting” the output vector is important. Are the outputs from different teachers to be

summed after normalization? Can we introduce weight factors associated with specific teachers?

Cem Ünsal Chapter 3. Stochastic Learning Automata 45

If the environment is of a P-model, how should we combine the outputs? In the simplest case

that all teachers agree on the ordering of the actions, i.e., one action is optimal for all

environments, the updating schemes are easily defined. However, this will not be the case in our

application.

The elements of the environment output vector must be combined in some fashion to

form the input to the automaton. One possibility is to use to use an AND-gate as described in

[Thathachar77]. Another method is taking the average, or weighted average of the all responses

[Narendra89, Baba85]. This is valid for Q- or S-model environments. Our application uses logic

gates (OR) and additional if-then conditions as described in Chapters 4 and 5.

3.6.4 Interconnected Automata

Although we based our discussion to a single automaton, it is possible that there are more than

one automata in an environment. If the interaction between different automata is provided by the

environment, the case of multi-automata is not different than a single automaton case. The

environment reacts to the actions of multiple automata, and the environment output is a result of

the combined effect of actions chosen by all automata. If there is direct interaction between the

automata, such as the hierarchical (or sequential) automata models, the actions of some automata

directly depend on the actions of others.

It is generally recognized that the potential of learning automata can be increased if

specific rules for interconnections can be established. There are two main interconnected

automata models: synchronous and sequential. An important consequence of the synchronous

model is that the resulting configurations can be viewed as games of automata with particular

payoff structures. In the sequential model, only one automaton acts at each time step. The action

chosen determines which automaton acts at the next time step. Such a sequential model can be

viewed as a network of automata in which control passes from one automaton to another. At

each step, one decision maker controls the state transitions of the decision process, and the goal

of the whole group is to optimize some overall performance criterion.

In Chapters 4 and 5, the interconnection between automata is considered. Since each

vehicle’s planning layer will include two automata — one for lateral, the other for longitudinal

actions — the interdependence of these two sets of actions automatically results in an

interconnected automata network. Furthermore, multiple autonomous vehicles with multiple

automata as path controllers constitute a more complex automata system as described in Chapter

7.

Their book Learning Automata [Narendra89] is an introduction to learning automata theory which surveys all the research done on the subject until the end of the 1980s. 3. Tsypkin [Tsypkin71] introduced a method to reduce the problem to the determination of an optimal set of parameters and then apply stochastic hill climbing techniques. Fu71] were the first researchers to introduce stochastic automata into the control literature. or the reinforcement learning must be combined with an adaptive forward model that anticipates the changes in the environment [Peng93]. toward a final goal. the goal of a learning system is the optimization of a functional not known explicitly [Narendra74]. A stochastic automaton acting as described to improve its performance is called a learning automaton. Tsetlin [Tsetlin73] introduced deterministic automaton operating in random environments as a model of learning. instruction. action probabilities are updated based on that response. Most of the works done on learning automata theory has followed the trend set by Tsetlin. Fu69b. In a purely mathematical context. Applications to parameter estimation. Learning1 is defined as any permanent change in behavior as a result of past experience. The difference between the two approaches is that the former updates the parameter space at each iteration while the later updates the probability space.’ . Narendra and Thathachar have studied the theory and applications of learning autoamta and carried out simulation studies in the area. is to regard the problem as one of finding an optimal action out of a set of allowable actions and to achieve this using stochastic automata. Stochastic Learning Automata 30 application). nonstationary environments and games of automata. One action is selected at random. and a learning system should therefore have the ability to improve its behavior with time. M. or experience. Properties of linear updating schemes and the concept of a ‘growing’ automaton are defined by McLaren [McLaren66]. [Atkinson65]. Fu and colleagues [Fu65a. Fu69a. Y.1 Earlier Works The first learning automata models were developed in mathematical psychology.Chand69] studied nonlinear updating schemes. introduced by Narendra and Viswanathan [Narendra72]. Varshavski and Vorontsova [Varshavski63] described the use of stochastic automata with updating of action probabilities which results in reduction in the number of states in comparison with deterministic automata. In the 1960’s. and the procedure is repeated. Early research in this area is surveyed by Bush and Mosteller [Bush58] and Atkinson et al. Chandrasekaran and Shen [Chand68. pattern recognition and game theory were initially considered by this school. Fu67. Z. equal probabilities are attached to all the actions). 1 Webster defines ‘learning’ as ‘to gain knowledge or understanding of a skill by study. the response from the environment is observed.Cem Ünsal Chapter 3. The stochastic automaton attempts a solution of the problem without any information on the optimal action (initially.L. Fu65b. Tsetlin and colleagues [Tsetlin73] started the work on learning automata during the same period. An alternative approach to applying stochastic hill-climbing techniques.

or the decision maker may be a part of a hierarchical decision structure but unaware of its role in the hierarchy.2 The Environment and the Automaton The learning paradigm the learning automaton presents may be stated as follows: a finite number of actions can be performed in a random environment. [Najim94]. The input β(n) from the environment is an element of the set β and can 2 Commonly refers to the aggregate of all the external conditions and influences affecting the life and development of an organism. This book also includes several applications and examples of learning automata. and path planning [Tsoularis93] and action selection [Aoki95] for autonomous mobile robots. [Papadimitriou94]. Consequently. [Sastry93]. the uncertainty may be due to the fact that the output of the environment is influenced by the actions of other agents unknown to the decision maker. Stochastic Learning Automata 31 The most recent (and second most comprehensive) book since the Narendra-Thathachar collaboration on learning automata theory and applications is published by Najim and Pznyak in 1994 [Najim94]. and is applied to the environment at time t = n. The output (action) α(n) of the automaton belongs to the set α. distributed fuzzy logic processor training [Ikonen97]. When a specific action is performed the environment provides a random response which is either favorable or unfavorable. Nario Baba’s work [Baba85] on learning behaviors of stochastic automata under a nonstationary multi-teacher environment. which is probabilistically related to the automaton action. Mathematically. successful updating algorithms and theorems for advanced automata applications were not readily available. [Rajaraman96]. [Najim96]. where each element ci corresponds to one action α i of the set α. The important point to note is that the decisions must be made with very little knowledge concerning the “nature” of the environment. Recent theoretical results on learning algorithms and techniques can be found in [Najim91b]. an environment is represented by a triple {α. The term environment2 is not easy to define in the context of learning automata. path planning for manipulators [Naruse93]. bioreactors [Gilbert92]. β} where α represents a finite action/output set. .Cem Ünsal Chapter 3. c. 95]. graph partitioning [Oommen94b]. The environment in which the automaton “lives” responds to the action of the automaton by producing a response. and [Poznyak96]. Until recently. and general linear reinforcement schemes [Bush58] are taken as a starting point for the research presented in this dissertation. It may have timevarying characteristics. control of manufacturing plants [Sequeria91]. belonging to a set of allowable responses. Furthermore. The definition encompasses a large class of unknown random media in which an automaton can operate. 3. pattern recognition [Oommen94a]. The objective in the design of the automaton is to determine how the choice of the action at any stage should be guided by past actions and responses. active vehicle suspension [Marsh93. and c is a set of penalty probabilities. β represents a (binary) input/response set. the applications of learning automata to control problems were rare. Recent applications of learning automata to real life problems include control of absorption columns [Najim91].

the state φ(n) is an element of the finite set Φ = {φ1. while output of 0 means the action is “favorable. the values β i are chosen to be 0 and 1.•)} where: w Φ is a set of internal states. the response value of 1 corresponds to an “unfavorable” (failure. is an element of the finite set α = {α 1.. In the simplest case..1) Environment Penalty probabilities c Figure 3. There are several models defined by the response set of the environment. β.) . The automaton and the environment.1.. αr} w β is a set of responses (or inputs from the environment). Stochastic Learning Automata 32 take on one of the values β 1 and β 2. In this simplest case.2) 3 We will follow the notation used in [Narendra89]. the model is named S-model. are referred to as P-models3. The output or action of an automaton an the instant n. penalty) response. Such models are called Q-models..” A further generalization of the environment allows finite response sets with more than two elements that may take finite number of values in an interval [a. φs} w α is a set of actions (or outputs of the automaton).. such as an interval on the real line: β = {β 1 . H(•. the environment is called a stationary environment... φ2. . b].Cem Ünsal Chapter 3... F(•. The input from the environment β(n) is an element of the set β which could be either a finite set or an infinite set.•)... denoted by α(n)... At any instant n. b]. State φ Transition Function F Input β Output Function G Automaton Output α { }= c i (i = 12 . The automaton can be represented by a quintuple {Φ. (3. Models in which the input from the environment can take only one of two values. The elements of c are defined as: Prob β (n) = 1 α (n) = α i Therefore ci is the probability that the action α i will result in a penalty input from the environment. β m } or β = {(a. 0 or 1.. α.. where 1 is associated with failure/penalty response. When the penalty probabilities ci are constant. b)} (3. β 2 . α 2. When the input from the environment is a continuous random variable with possible values in an interval [a.

3 The Stochastic Automaton We now introduce the stochastic automaton in which at least one of the two mappings F and G is stochastic. 1] in most cases. β 2 . the elements fijβ of F represent the probability that the automaton moves from state φi to state φj following an input β 4: f β ij = Pr{φ (n + 1) = φ j φ (n) = φ i .. j = 12 . b]5. In this case. 3. β = β 1 . 4 .4) For our applications.β 2 . we choose the function G(•) as the identity function. β (n)] (3. (3..•): Φ × β → Φ is a function that maps the current state and input into the next state. and to conserve probability measure we must have: β ∑ f ij = 1 j =1 s for each β and i.. the automaton is referred to as state-output automaton.7) Example States: φ1. b]=[0.. 075 025 Transition matrices: In Chapter 4..e. φ2 Inputs: β 1 . which can be either deterministic or stochastic: α (n) = G[φ ( n)] (3. If the transition function F is stochastic. If the current output depends on only the current state.6) Since fijβ are probabilities.5) For the mapping G.•): Φ × β → α is a function that maps the current state and input into the current output. β (n) = β } i . s . (3. 09 .. 050 050 F( β 1 ) = . 5 [a.•) is replaced by an output function G(•): Φ → α. the definition is similar: g ij = Pr{α (n) = α j φ (n) = φ i } i . the function H(•. the output mapping G is chosen as identity mapping.. 01 F( β 2 ) = . i. F can be deterministic or stochastic: φ (n + 1) = F[φ (n). the probabilities fij β will be the same for all inputs β and all states i given a state φj ..Cem Ünsal Chapter 3. .. 025 075 .. they lie in the closed interval [a.. . .. j = 12 . where we define our initial algorithm.3) w H(•.. the states of the automata are also the actions. β m (3. r . Stochastic Learning Automata 33 w F(•. Furthermore.

it is useful to update the transition probabilities at each step n on the basis of the environment response at that instant. and only consider the transition matrix F. Therefore. a vector of action probabilities p(n) is defined to describe the reinforcement schemes — as we introduce in the next sections. [ ] 3. its performance must be superior to “intuitive” methods. If the automaton is “learning” in the process. Stochastic Learning Automata 34 Transition Graphs: 0. we will introduce the definitions necessary to evaluate the performance of a learning variable-structure automaton.8) f 21 f 22 • Furthermore. independent of the time step n. and the input sequence. Furthermore.10 Φ1 0.25 Φ2 0. the transition matrix reduces to: F = f 11 = f 21 ≡ f 1 f 12 = f 22 ≡ f 2 ≡ p (3.75 Φ2 0. i.75 In the previous example.50 Φ1 0. As we will discuss later. For example. . we must have: f 11 f 12 F (β1 ) = F ( β2 ) ≡ F = (3. Such a stochastic automaton is referred to as a fixed-structure automaton. Such an automaton is called variable-structure automaton. A learning automaton generates a sequence of actions on the basis of its interaction with the environment. If the variable-structure automaton is a state-output automaton.Cem Ünsal Chapter 3. and the probabilities of transition from one state to another fijβ do not depend on current state and environment input. then the relation between the action probability vector and the transition matrices is as follows: • Since the automaton is a state-output automaton. therefore there is only one state transition matrix F where the elements are fij.4 Variable Structure Automaton and Its Performance Evaluation Before introducing the variable structure automata. we omit the transition matrix G.. the above definitions of the transition functions F and G are not used explicitly. the conditional probabilities fijβ were assumed to be constant.e.9) th where vector p consists of probabilities pi of the automaton being in the i state (or choosing the ith output/action). Instead of transition matrices. • The transition does not depend on the environment response. the probability of being in one state (or generating one action) does not depend on the initial/previous state of the automaton.90 β = β2 0.50 β = β1 0.25 0. in the case of variable-structure automaton.

r .4. c . (3. Stochastic Learning Automata 35 To judge the performance of the automaton. 3. Any learning automaton must at least do better than a pure-chance automaton. there is no basis in which the different actions α i can be distinguished. i =1 i =1 r r Definition 3. E[M(n)] is the average input to the automaton. We define a quantity M(n) as the average { } penalty for a given action probability vector: M ( n) = E[ β (n ) p(n)] = Pr{β (n) = 1 p( n)} = ∑ Pr{β (n) = 1α (n ) = α i } ⋅ Pr{α (n) = α i } i =1 r r (3.e...14) . c } where 1 2 r ci = Pr β ( n) = 1 α (n) = α i .11) = ∑ ci pi ( n) i =1 For the pure-chance automaton. the action probability vector p(n) = Pr {α(n) = α i } is given by: 1 pi (n ) = i = 12 .12) E [ M (n)] = E {E[ β (n ) p(n)]} (3...Cem Ünsal Chapter 3.... even in the simplest P-model and stationary random environments.1: A learning automaton is expedient if lim E [ M (n)] < M o . all action probabilities would be equal — a “pure chance” situation. Consider a stationary random environment with penalty probabilities {c .2: A learning automaton is said to be optimal if: lim E [ M (n)] = cl n→ ∞ (3.13) = E [β (n)] i. we will consider this simplest case. Using above definitions. To introduce the definitions for “norms of behavior”. In such a case. Further definitions for other models and nonstationary environments will be given whenever necessary.10) r Such an automaton is called “pure chance automaton.. we have the following: Definition 3. n→ ∞ Since ∑ pi ( n) = 1. For an r-action automaton.” and will be used as the standard for comparison. M(n) is a constant denoted by Mo: Mo = Also note that: 1 r ∑ ci r i =1 (3. The quantitative basis for assessing the learning behavior is quite complex.. we need to set up quantitative norms of behavior. we can write inf M (n) = inf p (n ) {∑ ci pi (n )} = mini {ci } ≡ c l .1 Norms of Behavior If no prior information is available.

In such an environment.. we assume that each state 6 We will discuss the feasibility of the case later.17) In the application of learning automata techniques to intelligent control. we may be interested in automata that exhibit a desired behavior in arbitrary environments and for arbitrary initial conditions. which is defined as: Definition 3. In spite of efforts of many researchers. all pi(n) ∈ (0. It may not be possible to achieve optimality in every given situation..4: A learning automaton is absolutely expedient if: E[M(n+1) | p(n)] < M(n) (3..2 Variable Structure Automata A more flexible learning automaton model can be created by considering more general stochastic systems in which the action probabilities (or the state transitions) are updated at every stage using a reinforcement scheme.3: A learning automaton is e-optimal if: lim E [ M ( n)] = cl + ε n →∞ (3. These requirements are partially met by an absolutely expedient automaton. For simplicity. we can further see that: E [ M (n + 1)] < E[ M (n )] (3. Stochastic Learning Automata 36 Optimality implies that the action α l is associated with the minimum penalty probability cl is chosen asymptotically with probability one. Kushner72. a suboptimal behavior is defined. a learning automaton is expected to reach a “pure optimal” strategy [Najim94]: p( n) → e r (3. Taking the expectations of both c sides. r i =1 3. Definition 3. . Some automata satisfy the conditions stated in the definitions for specified initial conditions and for certain sets of penalty probabilities. In other words. Narendra89]. we will introduce the general description of a reinforcement scheme. In this case. we are mainly interested in the case where there is one “optimal” action α j with cj = 06. the automaton will be optimal since E [M (n) ]= ∑ ci pi (n) → cl . However. linear and nonlinear.4. 1) and for all possible sets { 1 . In this section. the general algorithm which ensures optimality has not been found [Baba83. c 2 .Cem Ünsal Chapter 3.15) where e is arbitrarily small positive number. cr } .16) for all n.18) j where erj is the unit vector with jth component equal to 1..

The simulation examples in later chapters also use the same identity state-output mapping. the automaton is a state-output automaton.1).e.2. If p(n+1) is a linear function of p(n). The two constraints on update functions guarantees that the sum of all action probabilities is 1 at every time step. a reinforcement scheme can be represented as follows: p( n + 1) = T1 [ p (n)..21) for all i=1. i. φ (n).. 3. otherwise it is termed nonlinear.. The simplest case Consider a variable-structure automaton with r actions in a stationary environment with β = { .2. From now on.19a) for reinforcement schemes. The linearity characteristic of a reinforcement scheme is defined by the linearity of the update functions gk and hk (see the .. A variety of linear.5 Reinforcement Schemes The reinforcement scheme is the basis of the learning process for learning automata. In general terms.α (n).19a) or β β f ij (n + 1) = T2 [ f ij (n). The general scheme for updating action probabilities is as follows: 0 If α(n) = α i when β = 0 p j (n + 1) = p j (n) − gi ( p( n)) pi ( n + 1) = pi (n) + ∑ g k ( p(n)) k =1 k ≠i r for all j ≠ i (3.r and all probabilities pk in the open interval (0. β is the input from environment.r) are continuous. These schemes are categorized based on their linearity.Cem Ünsal Chapter 3.. nonnegative functions with the following assumptions: 0 < gk ( p(n )) < pk (n) 0 < ∑ [ pk ( n) + hk ( p( n))] < 1 k =1 k ≠i r (3. φ (n + 1) β ( n)] (3. and φ is the state of the automaton. 1} ... Again. nonlinear and hybrid schemes exists for variable structure learning automata [Narendra89]. the reinforcement scheme is said to be linear. Stochastic Learning Automata 37 corresponds to one action..19b) where T1 and T2 are mappings. β (n)] (3.20) p j (n + 1) = p j (n) + hi ( p (n)) when β = 1 pi (n + 1) = pi (n) − ∑ hk ( p(n)) k =1 k ≠i r for all j ≠ i where gk and hk (k=1. we will use the first mathematical description (3. α is the action..

the scheme corresponding to general linear schemes is given as: If α(n) = α i .1 Linear Reinforcement Schemes For an r-action learning automaton.α (n). where the probabilities are updated by a function which maps the probability vector to a specific value for each element: pi (n + 1) = k i (p( n).5. The main difference between these two subclasses is that nonprojectional algorithms can only be used with binary environment response (i.25) when β = 1 As seen from the definition. the general definition of linear reinforcement schemes can be obtained by the substitution of the functions in Equation 3. • projectional algorithms. β ( n)) (3.23) where the functions k i are the mapping functions. implementation of projectional algorithms are computationally more tasking. when β = 0 p j (n + 1) = (1 − a) ⋅ p j (n) for all j ≠ i pi (n + 1) = pi (n) + a ⋅ [1 − pi (n)] b p j (n + 1) = +(1 − b ) ⋅ p j (n ) r −1 pi (n + 1) = (1 − b ) ⋅ pi (n) for all j ≠ i (3. β (n)) (3. where individual probabilities are updated based on their previous values: pi (n + 1) = pi (n) + δ ⋅ k i (pi (n ). and then close this section with absolutely expedient nonlinear schemes. If the learning parameters a and b are equal. the parameter a is associated with reward response.24) r −1 0 < a ..e.α ( n).b < 1 Therefore. Early studies of reinforcement schemes were centered around linear schemes for reasons of analytical simplicity. the scheme is . 3. and the parameter b with penalty response. We will first introduce several well-known linear schemes. Projectional algorithms may be used in all environment types and are more complex. Stochastic Learning Automata 38 examples in the next section).Cem Ünsal Chapter 3. Furthermore. It is also possible to divide all possible algorithms for stochastic learning automata into the following two classes [Najim94]: • nonprojectional algorithms. The need for more complex and efficient reinforcement schemes eventually lead researchers to nonlinear (and hybrid) schemes. in a P-model environment).22) where δ is a small real number and k i are the mapping functions for the update term.20 as follows: g k ( p( n)) = a ⋅ p k (n) b hk ( p(n)) = − b ⋅ p k (n ) (3.

the multi-action automaton using the LR-P scheme is expedient for all initial action probabilities and in all stationary random environments. a few attempts were made to study nonlinear schemes [Vor65. Stochastic Learning Automata 39 called the linear reward-penalty scheme LR-P [Bush58]. Researchers started asking the question: “What are the conditions on the updating functions that guarantee a desired behavior?” The problem viewed in the light of this approach resulted in a new concept of absolute expediency. While searching for a reinforcement scheme that guarantees optimality and/or absolute expediency. Generalization of such schemes to multi-action case was not straightforward [Narendra89].5.Cem Ünsal Chapter 3. the expected value of the probability vector at time step n. the probability of the action α l with minimum penalty probability cl. and. In this case. can be easily evaluated. For two-action automata. The update functions gk and h k are nonlinear functions of .1. Chand68. design and evaluation of schemes was almost straight forward. This means that the action probabilities are updated in the case of a reward response from the environment. For this scheme. { } 3. we have Pr lim pl (n) = 1 as close to unity as desired. the update rate of the probability vector is the same at every time step. This problem led to a synthesis approach toward reinforcement schemes. then the scheme is named the linear reward-inaction scheme LR-I . but it is not guaranteed to reach the optimal solution. it can be shown that asymptotic solution of the set of difference equations enables us to conclude [Narendra89]: lim E[ M(n)] = r k=1 n →∞ ∑ r < ∑ ck k =1 r 1 ck r = Mo (3. For the linear reward-penalty scheme LR-P . This scheme is the earliest scheme considered in mathematical psychology. Expediency is a relatively weak condition on the learning behavior of a variable-structure automaton. since actually only one action was being updated. it is possible to show that pl(n). This makes the learning n →∞ automata ε -optimal [Narendra89]. by analyzing eigenvalues of the resulting difference equation.26) Therefore. but no penalties are assessed. Evolution of early reinforcement schemes occurred mostly in a heuristic manner. Shapiro69. By making the parameter a arbitrarily small. regardless of the environment response.2 Nonlinear Learning Algorithms: Absolutely Expedient Schemes Although early studies of automata learning were done mostly on linear schemes. the parameters of the linear reinforcement scheme are changed as follows: if the learning parameter b is set to 0. researchers eventually started considering schemes that can be characterized as nonlinear. In order to obtain a better learning mechanism. monotonically approaches 1. Vis72]. E[p(n)]. An expedient automaton will do better than a pure chance automaton. from Definition 3.

The general solution for absolutely expedient schemes is found in early 1970’s by Lakshmivarahan and Thathachar [Lakshmivarahan73].Cem Ünsal Chapter 3. This class of schemes can be considered as the generalization of the LR-I scheme. [Chand68] m m g j ( p( n)) = aφ ( p1 ( n). Again. Parameters a and b must be chosen properly to satisfy the conditions in Equation 3. Stochastic Learning Automata 40 penalty probabilities.1 − p1 (n )) p j (1 − p j (n)) θ • • where φ ( p1 . For example. Absolutely expedient learning schemes are presently the only class of schemes for which necessary and sufficient conditions of design are available. Sufficient conditions for ensuring optimality is given in [Vor65]. 7 . the function for current action can be obtained by using the fact that the sum of probabilities is equal to 1.3. A learning automaton using this scheme is absolutely expedient if and only if the functions gk (p) and hk (p) satisfy the following conditions: g1 ( p) g2 ( p) g ( p) = =⋅⋅⋅= r = λ ( p) p1 p2 pr (3. cl < and c j > for j ≠ l). α (n) = α i is assumed.27) h1 ( p) h2 ( p) hr ( p) = =⋅⋅⋅ = = µ ( p) p1 p2 pr where λ(p) and µ(p) are arbitrary functions satisfying 8: Again. for expediency several conditions on the penalty 1 1 probabilities must be satisfied (e.g..21. and θ ≥ 1. 8 The reason for the conditions on the update functions will be explained in detail in Chapter 6. but they are valid only for two-action automata. the following update functions are defined by previous researchers for nonlinear reinforcement schemes7: a • g j ( p( n)) = h j ( p(n)) = p ( n)(1 − pi ( n)) r−1 i This scheme is absolutely expedient for restricted initial conditions and for restricted environments [Shapiro69].1 − p1 (n )) p θ+1 (1 − p j (n)) θ j θ +1 h j ( p(n)) = bφ ( p1 ( n).1 − p1 ) = φ (1 − p1 .…). p1 ) is a nonlinear function which can be suitably selected. g j ( p (n)) = p j ( n) − φ ( p(n )) p (n) − φ ( pi ( n)) h j ( p(n)) = i r −1 m whereφ( x ) = ax (m = 2.20. Consider the general reinforcement scheme in Equation 3. We give only the updating functions for action probabilities other than the current action.

b ∈ IR. only the normalized S-model will be considered. The average penalty M(n) is still defined as earlier. it is always possible to map the output into the unit interval using: β −a βk = k (3. Stochastic Learning Automata 41 0 < λ ( p) < 1 pj (3. all definitions for the Qmodel are similar to the P-model. b]. and therefore are neither totally favorable or totally unfavorable. . In previous sections.30) = ∑ sk pk k =1 where: sk = E[β (n) α (n) = α k ] (3. but in [a. b] for some a .1 S-Model Environments Up to this point we based our discussion of learning automata only on the P-model environment. where the input to the automaton is either 0 or 1.. Therefore. Here. A similar proof (for a new nonlinear absolutely expedient scheme) is given in Chapter 6.and Q-models also.Cem Ünsal Chapter 3.31) 9 Note that this normalization procedure requires the knowledge of a and b.r 1 − pj Detailed proof of this theorem is given in both [Baba84] and [Narendra89].6 Extensions of the Basic Automata-Environment Model 3.. Narendra and Thathachar [Narendra89] show that all the principal results derived for the P-model carry over the more realistic S. In the case where the environment outputs β k are not in the interval [0.1].1] are used in the definition: M ( n) = E[ β (n) p (n)] = ∑ E [β (n) p(n ).6. We now consider the possibility of input values over the unit interval [a. Narendra and Thathachar state that such a model is very relevant in control systems with a continuous valued performance index.1}. the main problem is to determine how the probabilities of all actions are to be updated.α (n) = α k ] ⋅ Pr[α (n) = α k ] k =1 r r (3. we will not discuss the Q-model. but instead of ci ∈{0.(and Q-) models the outputs lie in the interval [0.28) 0 < µ ( p) < min j=1.. the algorithms (for P-model) were stated in terms of the functions gk and hk which are the reward and penalty functions respectively. 3. si∈[0. Since in S.29) b−a where a = mini{β i} and b = maxi{β i}9.1]..

The direct generalization of this scheme to S-model gives: pi (n + 1) = pi ( n) − (1 − β ( n)) gi ( p(n)) + β ( n)hi ( p (n)) if α ( n) ≠ α i (3. Again.20).Cem Ünsal Chapter 3.32) The previous algorithm (Equation 3.27 and 3. ε -optimality. the SLR-P scheme has analogous properties to the LR-P scheme. Stochastic Learning Automata 42 In the S-model. The non-negativity condition on gi and hi maintains the reward-penalty character of these functions. sk plays the same role as the penalty probabilities in the P-model. the asymptotic value of the average penalty is: r lim E[ M( n)] = r n →∞ ∑ 1si k= 1 (3.21 for pi(n) to remain in the interval [0. though it is not necessary.36) Thus. optimality. for a pure-chance automaton with all the action probabilities equal.33) pi (n + 1) = pi ( n) + (1 − β ( n))∑ g j ( p(n)) − β ( n) pi ∑ h j ( p( n)) if α (n ) = α i j≠ i j ≠i 1 r ∑s r i =1 i (3. 3.2 Nonstationary Environments Thus far we have considered only the automaton in a stationary environment. except the fact that ci’s are replaced by s i’s.28 for absolute expediency also applies to updating algorithms for S-model environment.25) can be rewritten for S-model (SLR-P) as: a pi (n + 1) = pi (n) − [1 − β ( n)] ⋅ a ⋅ pi (n) + β (n) ⋅ [ − a ⋅ p( n)] if α ( n) ≠ α i r −1 (3. the average penalty Mo is given by: Mo = The definitions of expediency. consider the general reinforcement scheme (Equation 3. For example the LR-P reinforcement scheme (Equation 3.6. However.20) can easily be obtained from the above definition by substituting β( n) = 0 and 1. The value 1− β (n ) is an indication of how far β( n) is away from the maximum possible value of unity.35) pi (n + 1) = pi (n) + [1 − β ( n)] ⋅ a ⋅ [1 − pi (n )] − β ( n) ⋅ a ⋅ pi (n) if α (n) = αi Similarly. the functions gi and hi can be associated with reward and penalty respectively. and is maximum when β( n) is 0. The conditions of Equations 3. and are called penalty strengths. Again. The functions gi and hi have to satisfy the conditions for Equation 3. the need for learning and adaptation in systems is mainly due to the fact that the environment changes . It is expedient in all stationary random environments and the limiting expectations of the action probabilities are inversely proportional to the corresponding si.1] at each step. and absolute expediency are the same. Now.

analytical investigations are possible. Stochastic Learning Automata 43 with time. As in adaptive control theory. From an analytic standpoint. this method requires that the system be aware of the “environmental changes” in order to assign the “correct” automaton. If each automaton uses an absolutely expedient scheme. The aim in these cases is not to evolve to a single action which is optimal. and even non-expedient. However. optimality. and assign a different automaton Ai to each environment Ei. and absolute expediency have to be re-examined for nonstationary environments. and the number of environments must not be large. the optimal action can change with time. it may become less expedient.Cem Ünsal Chapter 3. then ε -optimality can be achieved asymptotically for the overall system. We can visualize this situation as an automaton operating in a finite set of stationary environments. The average penalty M(n) becomes: M(n) = ∑ ci (n) pi (n ) i =1 r (3. These are: w Fixed-structure automata in Markovian environments w Variable-structure automata in state-dependent nonstationary environments w Hierarchy of automata w Nonstationary environments with fixed optimal actions The fundamental concepts such as expediency.37) . The success of the learning procedure then depends on the environment changes as well as the prior information we have about the environment. then acceptable performance can be achieved in slowly changing environments. In our application. a similar approach is employed as explained in Chapter 4. the simplest case is the one in which penalty probability vector c(n) can have a finite number of values. the practical justification for the study of stationary environments is based on the assumption that if the convergence of a scheme is sufficiently fast. updating strategies may happen infrequently so that the corresponding action probabilities do not converge [Narendra89]. Secondly. we again consider the P-model for purposes of stating the definitions. If a learning automaton with a fixed strategy is used in a nonstationary environment. since the automaton Ai is updated only when the environment is Ei. Therefore. Furthermore. If we assume that the environments E i are states of a Markov chain described by a state transition matrix. There are two classes of environments that have been analytically investigated: Markovian switching environments. and state-dependent nonstationary environments. for example. An environment is nonstationary if the penalty probabilities ci (or si) vary with time. and analytical tools developed earlier are adequate to obtain a fairly complete descriptions of the learning process. since. Here. but to choose the actions to minimize the expected penalty. The performance of a learning automaton should be judged in such a context. there are four different situations wherein the states of the nonstationary environment vary with the step n. then it is possible to view the variable-structure automaton in the Markovian switching environment as a homogeneous Markov chain. The learning scheme must have sufficient flexibility to track the better actions.

then α l is the optimal action. with time-varying penalty probabilities and the new description in Equation 3. Baba discussed the problem of a variable-structure automaton operating in a multi-teacher environment [Baba85].1} (for a P-model). The action set of the automaton is of course the same for all teachers/environments..Cem Ünsal Chapter 3. it is possible to have an environment in which the actions of an automaton evoke a vector of responses due to multiple performance criteria...39) The previous concept of optimality can be applied to some specific nonstationary environments. we now have the following: (3. Then. Some difficulty may arise while formulating a mathematical model of the learning automaton in a multi-teacher environment.. the automaton is connected to N teachers (or N singleteacher environments).5: A learning automaton in nonstationary environment is expedient if there exists a no such that for all n > no. Since we have multiple responses. an automaton can be defined as optimal if it minimizes: 1 T lim ∑ E[ β ( n)] (3. the original definitions of optimality andε -optimality still hold..40) T →∞ T n =1 The definition of absolute expediency can be retained as before. and for all n (or at least for all n > n 1 for some n1). In general.6: In a nonstationary environment where there is no fixed optimal action. Then. We will describe such environments as multi-teacher environments. the task of “interpreting” the output vector is important. and a set of penalty probabilities c ij = Pr β i ( n) = 1 α ( n) = α j .. the automata have to “find” an action that satisfies all the teachers. However. if there exists a fixed l such that cl(n) < ci(n) for all i = 1.38) Definition 3. c . 3. we have: E[ M(n)] − Mo (n) < 0 (3.6. Stochastic Learning Automata 44 The pure-chance automaton has an average penalty Mo as a function of time: 1 r Mo (n ) = ∑ ci (n) r i =1 Using these descriptions.. Are the outputs from different teachers to be summed after normalization? Can we introduce weight factors associated with specific teachers? { } {c . For example. we have: Definition 3.37.. In a multi-teacher environment.2. a single-teacher environment is described by the action set α. c }where i 1 i 2 i r .3 Multi-Teacher Environments All the learning schemes and behavior norms discussed up to this point have dealt with a single automaton in a single environment. Conditions for absolute expediency are given in his works (see also Chapter 6 for new nonlinear reinforcement schemes).r. the output set β i ={0. i ≠ l.

the actions of some automata directly depend on the actions of others. the interconnection between automata is considered.6. it is possible that there are more than one automata in an environment. This is valid for Q.4 Interconnected Automata Although we based our discussion to a single automaton. An important consequence of the synchronous model is that the resulting configurations can be viewed as games of automata with particular payoff structures.Cem Ünsal Chapter 3. 3. the updating schemes are easily defined. or weighted average of the all responses [Narendra89.or S-model environments. one action is optimal for all environments.e. one decision maker controls the state transitions of the decision process. Baba85]. the case of multi-automata is not different than a single automaton case. and the goal of the whole group is to optimize some overall performance criterion. Another method is taking the average. only one automaton acts at each time step. this will not be the case in our application. However. At each step. The action chosen determines which automaton acts at the next time step. The environment reacts to the actions of multiple automata. In the sequential model. Our application uses logic gates (OR) and additional if-then conditions as described in Chapters 4 and 5. Furthermore. One possibility is to use to use an AND-gate as described in [Thathachar77]. The elements of the environment output vector must be combined in some fashion to form the input to the automaton. the other for longitudinal actions — the interdependence of these two sets of actions automatically results in an interconnected automata network. Stochastic Learning Automata 45 If the environment is of a P-model. If there is direct interaction between the automata.. such as the hierarchical (or sequential) automata models. If the interaction between different automata is provided by the environment. It is generally recognized that the potential of learning automata can be increased if specific rules for interconnections can be established. how should we combine the outputs? In the simplest case that all teachers agree on the ordering of the actions. i. There are two main interconnected automata models: synchronous and sequential. and the environment output is a result of the combined effect of actions chosen by all automata. In Chapters 4 and 5. Such a sequential model can be viewed as a network of automata in which control passes from one automaton to another. . multiple autonomous vehicles with multiple automata as path controllers constitute a more complex automata system as described in Chapter 7. Since each vehicle’s planning layer will include two automata — one for lateral.

- An Implementation Methodology for Supervisory
- 1-s2.0-S089982561300136X-main
- Application Specific Integrated Circuits Design and Implementation
- Modelling Wind Speed Parameters for Computer Generation of Wind Speed
- Ise Vhdl State Machine Process NEXYS3 SOE
- IEEE control Petri nets and 61850 revised1
- DDCA_Ch3
- Therory of Computation
- 1. FLC_Problems_Solutions.pdf
- Sipser_1.1
- Microsoft Word - Uart
- BC0052 Assignment 2
- 2005 Railway
- markovProcess
- test_uml
- Knight-essays on the Prediction Process-81
- Detailed Syllabus CS 2009 10 Main v to VIII
- The Random Trip Model - Tr-2006-26
- engineering dict.pdf
- tmpD60F
- International Journal of Distributed and Parallel Systems (IJDPS)
- The Analysis of Performance for Priority Distinction Double-queue and Double-server Communication Network
- Basic HDL Coding Tech
- Vl Sics 040307
- Quantum.46
- File
- 1554_117089
- Language Death - Modelling Bilingualism & Social Structure

Read Free for 30 Days

Cancel anytime.

Close Dialog## Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

Loading