Professional Documents
Culture Documents
Deep Reinforcement Learning Approaches For Process Control
Deep Reinforcement Learning Approaches For Process Control
Abstract— In this work, we have extended the current success alternative to standard algorithms and it automatically allows
of deep learning and reinforcement learning to process control for continuous online tuning of the controller.
problems. We have shown that if reward hypothesis functions Reinforcement learning (RL) has been used in process
are formulated properly, they can be used for industrial process
control. The controller setup follows the typical reinforcement control for more than a decade,[10]. However, the available
learning setup, whereby an agent (controller) interacts with algorithms do not take into account the improvements made
an environment (process) through control actions and receives through recent progress in the field of artificial intelligence
a reward in discrete time steps. Deep neural networks serve especially deep learning [1]. Remarkably, human level con-
as function approximators and are used to learn the control trol has been attained in games [2] and physical tasks[3]
policies. Once trained, the learned network acquires a policy
that maps system output to control actions. Though the policies by combining deep learning and reinforcement learning [2].
are not explicitly specified, the deep neural networks were Recently, these controllers have even learnt the optimal
able to learn policies that are different from the traditional control policy of the complex Go game [15]. However,
controllers. We evaluated our approach on Single Input Single the current literature is primarly focussed on games and
Output Systems (SISO), Multi-Input Multi-Output Systems physical tasks. The objective of these tasks differs slightly
(MIMO) and tested it under various scenarios.
from that of process control. In process control, the task
I. INTRODUCTION is to take the outputs close to the set point while meeting
certain constraints, whereas in games and physical tasks the
It is common in the process industry to use controllers objective is rather generic. For instance: in games, the goal is
ranging from proportional controllers to advanced predic- to take an action to eventually win a game; in physical tasks,
tive based controllers such as Model Predictive Controllers to make a robot walk (or) stand. This paper discusses how
(MPC). However, classical controller design procedure in- the success in deep reinforcement learning can be applied on
volves careful analysis of the process dynamics, development process control problems. In process control, action spaces
of an abstract mathematical model, and finally, derivation of are continuous and reinforcement learning for continuous
a control law that meets certain design criteria. In contrast to action spaces has not been studied until [3]. This work aims
the classical design process, reinforcement learning is geared at extending the ideas in [3] to process control applications.
towards learning appropriate closed-loop controllers by sim- The main advantages of the proposed algorithm are as
ply interacting with the process and incrementally improving follows (i) it does not involve deriving an explicit control
control behaviour. The promise of such an approach is law; (ii) it does not involve deriving first principle models as
appealing: instead of using a time consuming design process, they are learnt by simply interacting with the environment
the controller learns the process behaviour by interacting (iii) the learning policy with deep neural networks is rather
directly with the process. Moreover, the same underlying fast.
learning principle can be applied to a wide range of different The paper is organized as follows: Section II provides
process types: linear and nonlinear systems; deterministic background information. Section III highlights the system
and stochastic systems; single input/output and multi in- configuration in a reinforcment learning perspective. Section
put/output systems. In addition, from a system identification IV outlines the technical approach of the learning controller.
perspective, both model identification and controller learning Section V empirically verifies the effectiveness of our ap-
are performed simultaneously. This specific controller differs proach. Section VI includes discussions and extensions of
from traditional controllers in that it assumes no predefined our approach followed by conclusion in Section VII.
control law nor does it have an explicit predefined control
law, but rather learns the control law from experiences. II. BACKGROUND
With the proposed approach the reward function serves as Reinforcement learning (RL) is an approach to automating
an objective function indirectly. The drawbacks of standard goal-directed learning and decision-making. This approach is
control algorithms are: (i) a complex dynamic model of the meant to solve problems in which an agent interacts with an
process is required, (ii) model maintenance is often very environment and receives reward signal at each time step.
difficult and (iii) and online adaptation is rarely achieved. RL algorithms aim to find a policy, which is a mapping
The proposed reinforcement learning controller is an efficient from state to action, that maximizes the expected cumulative
reward (value function) under that policy. The two main
1 Department of Chemical and Biological Engineering, University of
approaches used to achieve this goal are (1) Policy based
British Columbia spiel@chbe.ubc.ca
2 Department of Mathematics, University of British Columbia approach: searches directly for the optimal policy which
loew@ubc.ca achieves maximum future reward; (2) Value based approach:
202
IV. L EARNING Algorithm 1 Learning Algorithm
1: Wa , Wc ← initialize random weights
2: initialize Replay memory, RM with random policies
3: for episode = 1 to E do
4: Reset the OU process noise N
5: Specify setpoint, yset at random
6: for step = 1 to T do
7: s ←< yt , yset >
8: a ← action, ut = π(s, Wa ) + Nt
9: Execute action ut on the plant
10: s0 ←< yt+1 , yset >, observe state at next instant
11: r ← reward
12: Store the tuple < s, a, s0 , r > in RM
Fig. 2: Learning Overview
13: Sample a minibatch of n tuples from RM
14: Compute y i = ri + γQit ∀i ∈ minibatch
The overview of control and learning of the control system 15: Update Critic: Pn
is shown in Fig 2. The algorithm for learning the control 16: Wc ← Wc + α( n1 i (y i −
i i
policy is given in Algorithm 1. The learning algorithm is Q(si , ai , Wc ) ∂Q(s∂W
,a ,Wc )
c
)
i i
inspired by [3] with modifications to account for set point 17: Compute ∇ip = ∂Q(s∂a ,a ,Wc )
i
tracking and other recent advancements discussed in [4],[5]. 18: i
Invert ∇p by (5)
In order to facilitate exploration, we added a noise sam- 19: Update Actor:
pled from Ornstein-Uhlenbeck (OU) process [14] discussed Pn i
20: Wa ← Wa + α n1 1 (∇ip ∂π(s ∂Wa
,Wa )
)
similarly in [3]. Apart from exploration noise, the random
21: Update Target Critic:
initialization of system output at each start of an episode
22: Wat ← τ Wa + (1 − τ )Wat
ensures that the policy is not stuck in a local optimum.
23: Update Target Actor:
An episode is terminated when it runs for 200 time steps
24: Wct ← τ Wc + (1 − τ )Wct
or if it has tracked setpoint by a factor of ε continuously
25: end for
for five time steps. For systems with higher value of time
26: end for
constant, τ larger time steps per episode is preferred. At
each time step the actor, π(s, Wa ) is queried and the tuple
< si , s0i , ai , ri > is stored in the replay memory, RM at
is updated using gradient given by,
each iteration. The motivation for using replay memory is to
break the correlated samples obtained used for training. The ∂J(Wa ) ∂Q(s, a, Wc ) ∂π(s, Wa )
=E (4)
learning algorithm uses actor-critic framework. The critic ∂Wa ∂a ∂Wa
network serves as a value estimator and it estimates the
value of current policy by Q-learning. The critic provides where ∂J(W
∂Wa
a)
is the gradient of critic with respect to the
loss function for learning the actor. The loss function of the sampled actions of mini-batches from RM. Thus, the critic
critic is given by, L(Wc ) = E (r + γQt − Q(s, a, Wc ))2 .
and actor are updated in each iteration, resulting in policy
Hence, the critic network is updated using this loss with the improvement.
gradient given by,
Target Network: Similar to [3], we use seperate target
networks for actor and critic. We freeze the target networks
∂L(Wc ) ∂Q(s, a, Wc ) and replace it with existing networks once in 500 iterations.
= E (r + γQt − Q(s, a, Wc )) (3) This makes the learning an off-policy algorithm.
∂Wc ∂Wc
where Qt = Q(s0 , π(s0 , Wat ), Wct ), denotes target values and Prioritized Experience Replay: While drawing samples
Wat , Wct are weights of target actor and critic respectively. from replay buffer, we use prioritized experienced replay
The discount factor, γ ∈ [0, 1] is the present value of future as proposed in [4]. Emprically, we found an acceleration in
rewards. A value of γ = 1 considers all rewards from future learning compared to uniform random sampling.
states whereas a value of 0 considers reward from current
state. The γ discounts rewards in future states by value of γ. Inverting Gradients: Sometimes the actor neural network,
This is a tunable parameter that makes our learning controller π(s, Wa ) produces output that exceeds the action bounds of
to check how far to look in the future. We chose γ = 0.99 the specific system. Hence, the bound on the output layer of
in our simulations. The expectation in (3) is over the mini- the actor neural network is specified by controlling gradient
batches sampled from RM . Thus the learnt value function required for actor from critic. To avoid making the actor
provides a loss function to actor and the actor updates its returning outputs that are outside action bounds, we inverted
policy in a direction that improves Q. Thus, the actor network the gradients of ∂Q(s,a,W
∂a
c)
given in [4] using the following
203
transformation,
∇p =
( 2XWSXW
(pmax − p)/(pmax − pmin ), if ∇p suggests increasing p 6HWSRLQW
∇p .
(p − pmin )/(pmax − pmin ), otherwise
(5)
2XWSXW
Where ∇p refers to parameterized gradient of critic.
The output and input response from the learned polices Fig. 4: Output response of the learned policy
are illustrated in Fig. 4-7. The final policies for the learned
SISO system, given in (6), are the result of 7,000 iterations 35
the MIMO system given in (7) the compute time on the same
25
GPU was about 30 hours requiring 50K tuples. The deep
neural networks are built using Tensorflow [6]. The learning Input
20
400
τ s+1 τ s+1
200
We choose the number of steps per episode to be 200 for
learning the control policies. When systems with larger time
0 constants are encountered, we usually require more steps per
200
episode. The reward hypothesis formulation serves as a tuner
to adjust the performance of the controller, how setpoints
400
0 10 20 30 40 50 60 should be tracked, etc. When implementing this approach
Episode #
on a real plant, effort can often be saved by warm-starting
Fig. 3: Learning Curve the deep neural network through prior training on a simulated
model. Experience also suggests using exploration noise with
The learned controller is also tested on various other cases, higher variance in the early phase of learning, to encourage
such as the response to setpoint change and the effect of high exploration.
204
3.5 1.5
Output 2XWSXW Output
Setpoint 6HWSRLQW Setpoint
3.0 1.0
2.5 0.5
Output
2XWSXW
Output
2.0 0.0
1.5 0.5
1.0 1.0
0.5 1.5
0 10 20 30 40 50 0 10 20 30 40 50
Time Steps 7LPH6WHSV Time Steps
(a) Output response with output noise of (b) Output response with output noise of (c) Output response with input noise of
σ 2 = 0.1 σ 2 = 0.3 σ 2 = 0.1
35
30
25
,QSXW
,QSXW
Input
20
15
10
5
0 10 20 30 40 50
7LPH6WHSV 7LPH6WHSV Time Steps
(d) Input response of Fig 6(a) with output (e) Input response of Fig 6(b) with output (f) Input response of Fig 6(c) with input
noise of σ 2 = 0.1 noise of σ 2 = 0.3 noise of σ 2 = 0.1
4.0 6 4.0
Output Output
Setpoint Setpoint
3.5 5 3.5 Output
Setpoint
3.0
4 3.0
2.5
Output
Output
Output
3 2.5
2.0
2 2.0
1.5
1 1.5
1.0
0.5 0 1.0
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
Time Steps Time Steps Time Steps
(g) Output response - setpoint change with (h) Output response - setpoint change with (i) Output Response - setpoints unseen during
output noise of σ 2 = 0.1 input and output noise of σ 2 = 0.1 training
,QSXW
,QSXW
,QSXW
7LPH6WHSV
7LPH6WHSV 7LPH6WHSV
205
2XWSXW 2XWSXW
6HWSRLQW 6HWSRLQW
\
\
Time 6WHSV 7LPH6WHSV
(a) Response of output y1 with output noise (b) Response of output y2 with output noise
of σ 2 = 0.01 of σ 2 = 0.01
X
X
7LPH6WHSV Time 6WHSV
206