0% found this document useful (0 votes)

112 views19 pages

Artificial Intelligence in Drilling Engineering

This research paper discuss the application of artificial intelligence in oil industry.

Uploaded by

Muhammad Ali Buriro

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

112 views19 pages

Artificial Intelligence in Drilling Engineering

This research paper discuss the application of artificial intelligence in oil industry.

Uploaded by

Muhammad Ali Buriro

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Introduction
Reservoir Mechanistic Model Description
Reinforcement Learning Approach
Waterflooding Optimization Using RL: Continuous Approach
Scenarios and Results
References
Conclusion

Engineering Applications of Artificial Intelligence 77 (2019) 98–116

Contents lists available at ScienceDirect

Engineering Applications of Artificial Intelligence

journal homepage: www.elsevier.com/locate/engappai

A reinforcement learning approach for waterflooding optimization in

petroleum reservoirs
Farzad Hourfar a ,∗, Hamed Jalaly Bidgoly a , Behzad Moshiri a , Karim Salahshoor b ,
Ali Elkamel c,d ,∗∗
a Control & Intelligent Processing Center of Excellence (CIPCE), School of Electrical & Computer Engineering, University of Tehran, Tehran, Iran
b Department of Automation & Instrumentation Engineering, Petroleum University of Technology, Ahwaz, Iran
c Department of Chemical Engineering, University of Waterloo, Ontario, Canada
d
Department of Chemical Engineering, Khalifa University, The Petroleum Institute, Abu-Dhabi, United Arab Emirates

ARTICLE INFO ABSTRACT

Keywords: Waterflooding optimization in closed-loop management of the oil reservoirs is always considered as a challenging
Waterflooding process issue due to the complicated and unpredicted dynamics of the process. The main goal in waterflooding is to
Reinforcement learning adjust the manipulated variables such that the total oil production or a defined objective function, which has
Production optimization
a strong correlation with the gained financial profit, is maximized. Fortunately, due to the recent progresses
Closed-loop reservoir management
in the computational tools and also expansion of the calculating facilities, utilization of non-conventional
Derivative-free optimization
optimization methods is feasible to achieve the desired goals. In this paper, waterflooding optimization problem
has been defined and formulated in the framework of Reinforcement Learning (RL) methodology, which is
known as a derivative-free and also model-free optimization approach. This technique prevents from the
challenges corresponding with the complex gradient calculations for handling the objective functions. So,
availability of explicit dynamic models of the reservoir for gradient computations is not mandatory to apply
the proposed method. The developed algorithm provides the facility to achieve the desired operational targets,
by appropriately defining the learning problem and the necessary variables. The fundamental learning elements
such as actions, states, and rewards have been delineated both in discrete and continuous domain. The proposed
methodology has been implemented and assessed on the Egg-model which is a popular and well-known reservoir
case study. Different configurations for active injection and production wells have been taken into account to
simulate Single-Input-Multi-Output (SIMO) as well as Multi-Input-Multi-Output (MIMO) optimization scenarios. The
results demonstrate that the ‘‘agent’’ is able to gradually, but successfully learn the most appropriate sequence of
actions tailored for each practical scenario. Consequently, the manipulated variables (actions) are set optimally
to satisfy the defined production objectives which are generally dictated by the management level or even
contractual obligations. Moreover, it has been shown that by properly adjustment of the rewarding policies
in the learning process, diverse forms of multi-objective optimization problems can be formulated, analyzed and
solved.

1. Introduction troublous task. As a result, applying reservoir management techniques

including Enhanced/Improved Oil Recovery (EOR/IOR) methods, for
It is estimated that the world energy request will reach to about raising up the production efficiency in under-operation reservoirs, is
300 million barrels per day of petroleum equivalent in 2030, which an appropriate solution for the challenges of adequate energy shortage
implies 50% increase compare to current demand (Colton, 2011; World resources (Srivastava et al., 1994; Sarma, 2006).
energy demand and economic outlook, 2016). Since hydrocarbon-based Hydrocarbon assets are discovered in subsurface reservoirs with
resources will satisfy around 60% of the mentioned global energy needs, appropriate geological structure. After performing the necessary as-
the importance of research on different issues related to hydrocarbon sessments regarding the economical profits of a specific reservoir, the
production enhancement is inevitable. existing oil and gas can be drained through drilled wells which connect
Furthermore, the available oil and gas reservoirs are mostly in the reservoir to the surface. During the production phase, the existing
the maturity phase and development of new fields is considered as a fluids in the reservoir, including gas, oil and water, enter the wells

∗ Corresponding author at: School of Electrical & Computer Engineering, University of Tehran, Tehran, Iran.
∗∗ Corresponding author at: Department of Chemical Engineering, University of Waterloo, Ontario, Canada.
E-mail addresses: f.hourfar@ut.ac.ir (F. Hourfar), aelkamel@uwaterloo.ca (A. Elkamel).

https://doi.org/10.1016/j.engappai.2018.09.019
Received 24 May 2017; Received in revised form 7 February 2018; Accepted 28 September 2018
Available online xxxx
0952-1976/© 2018 Published by Elsevier Ltd.
F. Hourfar et al. Engineering Applications of Artificial Intelligence 77 (2019) 98–116

and afterwards to the downstream facilities. Generally, production management strategies are applicable for production optimization ob-
schedule of a reservoir can be divided through different horizons varying jectives. The process ultimate goal is to maximize an objective function
from daily to life-cycle production planning, both in upstream and such as NPV, subjected to operational and economic constraints. To this
downstream (Foss et al., 2015). aim, different approaches have been developed in the literature as the
Unfortunately, at the end of natural depletion phase, in which oil production optimization solutions for flow rate adjustment of injection
is produced by natural reservoir drive mechanisms, more than 60% and production wells.
of hydrocarbon still remains intact inside the reservoir. The main General configuration of CLRM, consists of two main parts as follows:
goal in reservoir management is increasing the amount of recovered (1) model based data assimilation, which acts as reservoir parameters
hydrocarbon. To this aim, upstream production optimization techniques and states estimator, and (2) a model based optimizer. As mentioned
have attracted much attention in the recent years. So during the above, the task of optimizer is maximizing the oil recovery factor or
production, EOR/IOR techniques are generally implemented to increase other desired economical criterion such as the NPV. The required inputs
the recovery factor at the end of primary recovery period (Bai et al., for the optimizer part may be the injection and production data, the
2008; Zendehboudi et al., 2011,b,a; Guo et al., 2012; Shafiei et al., 2013; hydrocarbon price, the predicted interest rate, and the operating costs.
Asadollahi et al., 2014). Based on the mentioned information and also the estimated reservoir
Based on recent technological advancements in upstream field in- model, the optimizer calculates the optimal values of manipulated
struments and in measuring devices, Closed-Loop Reservoir Management variables which are normally selected as water injection and bottom
(CLRM) approaches have become popular and attractive solutions to hole pressure (bhp) trajectories. As updated measurements become
increase the oil recovery factor or a defined financial objective function available, the optimization process can be repeated. Using data as-
such as the Net Present Value (NPV) of a specific hydrocarbon reservoir similation techniques provides the facilities to estimate the required
(Capolei et al., 2013; Jansen et al., 2008; Foss et al., 2011; Van den Hof parameters and states for reservoir modeling phase. The obtained
et al., 2012; Foss, 2012). model is regularly used by model-based optimization algorithms for
In other words, production optimization is the ultimate goal in the computation of optimal trajectories at each time step (Capolei et al.,
closed-loop reservoir management. In production optimization, by using 2013; Forouzanfar et al., 2013).
updated reservoir model(s), the optimal well controls are calculated In the production optimization problem, handling the inherent non-
such that the NPV of the production, or the hydrocarbon recovery in linearity of the recovery process is one of the most challenging parts. In
the lifecycle of the reservoir is maximized (Forouzanfar et al., 2013; General, nonlinear optimization techniques applied in oil reservoirs are
Nævdal et al., 2006; Wang et al., 2009; Foss et al., 2015). The dynamic categorized into gradient-based and derivative-free subdivisions. The
production optimization in hydrocarbon reservoirs via waterflooding is former family utilizes derivative information of the defined objective
one of the most popular topics in reservoir engineering. The main objec- functions and formulated constraints (Luenberger et al., 2008). For
tive in waterflooding process is to maximize the economic return of the examples, ensemble-based optimization methods (Dehdari et al., 2011)
field by optimally controlling the injection/production rates (Horowitz and optimization based on Quadratic Interpolation Models (Zhao et al.,
et al., 2013). 2011) are two common approaches for solving the production opti-
Waterflooding optimization is a demanding and challenging task mization problem. In addition, a set of efficient gradient-based methods
in the most reservoirs (Sincock et al., 1988). In hydrocarbon fields, for performing optimization over a reservoir uses adjoint technique to
a higher production rate is a trivial solution to decrease the payback compute the required derivatives of the objective. However, necessity
period. However, uncontrolled and high injection/production rates may to have access to the reservoir simulator source codes as well as
cause early water breakthroughs and will increase the watercut values computational cost of derivative approximations, are the challenging
of the producing wells, while reducing the total oil recovery. In other concerns in practice (Horowitz et al., 2013; Wen et al., 2014; Jansen,
words, there may exist specific scenarios for which full-rate injection 2011).
policy is the optimal solution. But, this strategy is not always the best Reduced order models (He et al., 2011), and equalizing water break-
one for all different types of hydrocarbon reservoirs or any kind of through of producing wells based on time of flight concept in stream-
production plans. Therefore, optimizing on well injection/production line simulators (Datta-Gupta et al., 2010) are other alternatives for
rates is among the most effective methods to improve the waterflooding encountering with waterflooding optimization problems. Furthermore,
performance and also to increase the recovery factor in the reservoirs using reservoir simulator as a black-box for solving the optimization
(Xu et al., 2013). In Waterflooding process, water is injected into a problem based on data driven techniques is another approach for the
reservoir to sweep the existing hydrocarbon towards the producing optimization purpose. Pattern recognition-based methods (Asadollahi
wells for increasing the oil recovery at the end of natural depletion et al., 2009) and surrogate/proxy modeling methods (CMOST users
stage. Clearly, the geological heterogeneity of the reservoir can decrease guide, 2012; Hourfar et al., 2016, 2017, 2018) are among these ap-
the efficiency of waterflooding process. However, the recent advances proaches.
of down hole equipment and instruments have increased the sweep It is obvious that the derivative-free optimization methods do not
efficiency in geological layers of a reservoir by appropriately adjusting need the gradient information (Ciaurri et al., 2011; Kramer et al., 2011).
of production/injection rates. In other words, by applying properly Generalized Pattern Search (GPS) (Audet and Dennis Jr, 2002), and
designed waterflooding technique, oil production of a field can be Mesh Adaptive Direct Search (MADS) (Audet and Dennis Jr, 2006), are
increased based on available infrastructures and just by modifying also examples of local search derivative-free optimization techniques.
injection/production rates of the wells which are known as the well On the other hand, evolutionary techniques such as Genetic Algorithms
controls. In addition, simulation-based reservoir management is the (GA) (Golberg, 1989), and Particle Swarm Optimization (PSO) (Souza
first step to improve the performance of a reservoir. In this phase, et al., 2010), are the popular examples of global search methods in
waterflooding process is simulated in a valid simulation environment which the optimization space is explored much more comprehensively
and the optimization objectives are pursued. Generally, production in comparison with local methods.
optimization problem is specified by considering the geological model Besides the possibility of finding the global solution for the optimiza-
of the reservoir in the simulator and also by defining an appropriate tion problem, another main advantages of derivative-free optimization
recovery mechanism, which can be the ‘‘waterflooding process’’ (Wen methods is the capacity of being implemented in parallel processing
et al., 2014). Furthermore, computational progresses facilitate model- configuration, which reduces elapsed computational time and increases
ing, simulation and optimization of large scale systems – such as oil the efficiency of the algorithms (Wen et al., 2014).
fields – in an acceptable time cycle (Chen et al., 2009). As a result, by In this paper a derivative-free approach based on ‘‘Reinforcement
using valid updated reservoir model(s), efficient closed loop reservoir Learning ’’ (RL) for optimizing the waterflooding process in oil reservoirs,

99
F. Hourfar et al. Engineering Applications of Artificial Intelligence 77 (2019) 98–116

has been introduced and developed. To the best of authors’ knowledge, 𝜕𝑝𝑐𝑜𝑤 𝜕
the presented contribution is the first attempt for formulating and ∇(𝜆̃𝑤 ∇𝑝𝑜 − 𝜆̃𝑤 ∇𝑆𝑤 ) = (𝜙𝜌𝑤 𝑆𝑤 ), (6)
𝜕𝑆𝑤 𝜕𝑡
solving the production optimization as a RL problem. Several different
practical production scenarios, for both short-term and long-term plan- 𝑘 𝑘𝑟𝑤
where 𝜆̃𝑜 = 𝑘 𝜇𝑟𝑜 and 𝜆̃𝑤 = 𝑘 , which are called oil and water
nings, have been considered and it has been observed that RL-based 𝑜 𝜇𝑤
mobilities, respectively. Flow equations (5) and (6) are defined over
optimization methodology is successful in handling the studied cases by
proposing the optimal injection-rate profiles, appropriate for each of the the entire volume of the reservoir. It is supposed that there is no flow
defined scenarios. across the boundaries of the reservoir (Neumann boundary conditions).
In addition, one of the main goals in most engineering problems – A common approach for numerically solving the above equations is
categorized as the multi objective optimization problems – is to explore discretization in time and space domain. This action results to a system
for the operating conditions that can optimize a set of multiple, even built up of finite number of grids, named as grid blocks. The discretized
conflicting, objectives with specific criteria (Isebor and Durlofsky, 2014; equations are as follows:
Rao and Rao, 2009). Utilization of the presented algorithm facilitates 𝑉 (𝑥𝑘 ).𝑥𝑥+1 = 𝑇 (𝑥𝑘 ).𝑥𝑘 + 𝑞𝑘 , 𝑥0 = 𝑥̄ 0 , (7)
to consider and solve the challenging multi-objective optimization
problem in the oil reservoirs based on RL technique. Moreover, this where 𝑘 is the time index, 𝑥 is the state vector consisting of oil pressure,
ability makes the proposed technique a powerful strategy to be applied 𝑝𝑜 , and water saturation, 𝑆𝑤 , in all grids and 𝑥0 is the initial conditions
in different types of field development contracts (Shakhsi-Niaei et al., vector. The influence of the wells on the reservoir is modeled in Eq. (7)
2014; Ghandi and Lin, 2012; Zhao et al., 2012), to manage the profiles by a source vector called 𝑞𝑘 :
of hydrocarbon production and ensure a fair profit sharing between all
𝑞𝑘𝑗 = 𝑤𝑗 .(𝑝𝑗𝑏ℎ,𝑘 − 𝑝𝑗𝑘 ), (8)
parties.
in which 𝑝𝑗𝑏ℎ,𝑘 is the well’s bottom hole pressure, 𝑗 is the index of the grid
2. Reservoir mechanistic model description
block containing the well and 𝑝𝑗𝑘 is the grid block pressure in which the
well is located. In addition, 𝑤 is a constant value which demonstrates
In commercial simulators, a set of Partial Differential Equations
the effects of well’s geometry and the rock and fluid properties of the
(PDE’s) which has origin in mass and momentum conservation laws
reservoir in the neighborhood of the well.
should be solved to obtain the unknown state variables of a hydrocarbon
For solving the obtained equations for each grid block, the geological
reservoir (Jansen et al., 2008; Aziz and Settari, 1979). Mass balance
properties are supposed to be constant. However, the heterogeneity
equation of a typical 2-phase reservoir (including oil and water) can be
characteristics of the reservoir can be determined by assigning different
written as:
property values to each block. Generally, a large number of grid-
𝜕
∇(𝜌𝑖 𝑢𝑖 ) + (𝜙𝜌𝑖 𝑆𝑖 ) = 0; 𝑖 = 𝑜, 𝑤, (1) blocks should be defined to describe the dynamic behavior of a real
𝜕𝑡
hydrocarbon reservoir. In most professional simulators, the described
in which 𝑡 is time notation, ∇ is the divergence operator, 𝜙 is the
equations are solved for all grids in all time steps which clearly result
porosity, 𝜌𝑖 is the density of phase 𝑖, 𝑢𝑖 is the superficial velocity, 𝑆𝑖
in massive computational load.
is the saturation, which is the proportion of the pore space occupied by
phase 𝑖, where 𝑜 and 𝑤 are notations of oil and water, respectively. In
3. Formulation of waterflooding optimization problem
reservoir literature, porosity is defined as the fraction of the reservoir
rock which can be occupied by the existing fluids.
As mentioned in Section 1, in ‘‘waterflooding process’’, water is
Conservation of momentum can be obtained by the Navier–Stokes
injected in an oil reservoir to increase oil production or a defined
equations. By ignoring the effect of gravity, the simplified equation can
objective function such as the NPV. Mostly, in reservoir applications,
be achieved using the semi-empirical Darcy’s equation for low velocity
NPV, as an objective for the dynamic optimization, is mathematically
flow through porous materials as follows:
formulated in the following format (Siraj et al., 2016):
𝑘𝑟𝑖
𝑢𝑖 = −𝑘 ∇𝑝𝑖 , 𝑖 = 𝑜, 𝑤, (2) 𝑁𝑡 ⎡ 𝑐 .𝑞 𝑘 − 𝑐 .𝑞 𝑘 − 𝑐 .𝑞 𝑘 ⎤
𝜇𝑖 ∑
𝐽= ⎢ 𝑜 𝑜 𝑤 𝑤 𝑖𝑛𝑗 𝑖𝑛𝑗
.𝛥𝑡𝑘 ⎥ , (9)
where 𝑝𝑖 is the pressure of phase 𝑖, 𝑘 is the absolute permeability, ⎢
𝑘=1 ⎣
𝑡𝑘
⎥
(1 + 𝑏) 𝜏𝑡 ⎦
𝑘𝑟𝑖 is the relative permeability and 𝜇𝑖 is the viscosity of phase 𝑖. The
where the terms 𝑞𝑜𝑘 , 𝑞𝑤𝑘 and 𝑞 𝑘 represent total flow rates of produced
permeability 𝑘 is an inverse measure of the resistance a fluid encounters 𝑖𝑛𝑗
while flowing in a porous medium. In other words, the permeability oil, produced water and injected water at time step 𝑘, respectively. In
is the rock ability for transmitting the fluids through the pore spaces addition, 𝑐𝑜 , 𝑐𝑤 and 𝑐𝑖𝑛𝑗 are the oil price, the water production cost and
of the rocks. The relative permeability 𝑘𝑟𝑖 relates to the additional the water injection cost, respectively. 𝑁𝑡 is the used notation for the
resistance that phase 𝑖 experiences when other phases are present, production life-cycle and 𝛥𝑡𝑘 is the time interval of time step 𝑘. Finally,
due to differences in viscosity. Since the relationship between relative term 𝑏 is the discount rate for a certain reference time, 𝜏𝑡 .
permeabilities 𝑘𝑟𝑜 and 𝑘𝑟𝑤 and water saturation 𝑆𝑤 is fully non-linear, The ultimate goal in optimization problem is maximizing the values
the reservoir model behaves like a strongly nonlinear system. of 𝐽 , by properly controlling the well flow rates, while considering the
Substituting (2) into (1) results into 2 equations with 4 unknowns reservoir internal dynamics as well as operational constrains.
which are 𝑝𝑜 , 𝑝𝑤 , 𝑆𝑜 and 𝑆𝑤 . For completing the system description, A re-statement of Eq. (9), in which the effect of each individual
two additional equations should be explored. injection and production well in objective function can be explicitly
The first one is the following trivial summation: observed, is (Forouzanfar et al., 2013):
𝑁𝑡 ⎡𝑁𝑝𝑟𝑑 𝑁𝑖𝑛𝑗 ⎤
𝑆𝑜 + 𝑆𝑤 = 1. (3) ∑ ∑ ∑ 𝛥𝑡𝑘
𝐽= ⎢ 𝑘
(𝑐𝑜 .𝑞𝑜,𝑗 𝑘
− 𝑐𝑤 .𝑞𝑤,𝑗 )− 𝑘
(𝑐𝑤,𝑖𝑛𝑗 .𝑞𝑤𝑖𝑛𝑗,𝑖 )⎥ , (10)
⎢
𝑘=1 ⎣ 𝑗=1
⎥ (1 + 𝑏)𝑡𝑘 ∕365
The second equation is related to the capillary pressure: 𝑖=1 ⎦
𝑝𝑐𝑜𝑤 = 𝑝𝑜 − 𝑝𝑤 = 𝑓𝑐𝑜𝑤 (𝑆𝑤 ). (4) where 𝑁𝑖𝑛𝑗 and 𝑁𝑝𝑟𝑑 are the number of injection and production wells,
respectively. 𝑁𝑡 represents the number of simulation time steps. 𝛥𝑡𝑘 is
Substituting (3) and (4) into the flow equations, and considering the 𝑘 and 𝑞 𝑘
the length of the 𝑘th time step, usually in days. In addition, 𝑞𝑜,𝑗 𝑤,𝑗
oil pressure 𝑝𝑜 and water saturation 𝑆𝑤 as primary state variables result are the average of oil and water production rates in STB/Day1 or m3 /Day
to:
𝜕
∇(𝜆̃𝑜 ∇𝑝𝑜 ) = (𝜙𝜌𝑜 .[1 − 𝑆𝑤 ]), (5) 1
Standard Barrel per Day.
𝜕𝑡

100
F. Hourfar et al. Engineering Applications of Artificial Intelligence 77 (2019) 98–116

of the 𝑗th producer over the 𝑘th simulation time step. Furthermore,
𝑘
𝑞𝑤𝑖𝑛𝑗,𝑖 is the notation for the average injection rate of the 𝑖th injection
well over the 𝑘th simulation time step. Moreover, 𝑐𝑜 , 𝑐𝑤 and 𝑐𝑤,𝑖𝑛𝑗 are
the oil price, produced water disposal cost and the water injection cost,
respectively, all per unit volume which means in $/STB or $/m3 . Finally,
𝑏 is the annual discount rate.
Consequently, the vector of well controls that should be optimized
is:
[ Fig. 1. Reinforcement learning framework (Sutton and Barto, 1998).
1 𝑁𝑐𝑠 1 𝑁𝑐𝑠 1
𝑢 = 𝑞𝑤𝑖𝑛𝑗,1 , … , 𝑞𝑤𝑖𝑛𝑗,1 , 𝑞𝑤𝑖𝑛𝑗,2 , … , 𝑞𝑤𝑖𝑛𝑗,𝑁 , … , 𝑞𝑜,1 ,…,
𝑖𝑛𝑗
]𝑇
𝑁 1 𝑁
𝑞𝑜,1𝑐𝑠 , 𝑞𝑜,2 , … , 𝑞𝑜,𝑁𝑐𝑠 , (11) reward assignment, trade-off between exploration–exploitation and also
𝑝𝑟𝑑
curse of dimensionality (Heidrich-Meisner et al., 2007). Some actions’
which implies that the oil production rates and the water injection rates
effects may appear several steps later and assigning these delayed
at all control steps are considered to be optimally adjusted. In (11), 𝑁𝑐𝑠
rewards to the corresponding actions can be somehow problematic;
is the notation for the number of optimization control steps, which can
especially in the model-free optimization. In addition, inappropriate
be different from the number of simulation time steps, 𝑁𝑡 .
trade-off between exploration the environment and exploitation the
Eq. (11) is the general form of control inputs. However, from sys-
obtained knowledge may lead to either a sub-optimal solution or slow
tematic and also operational point of view, the oil and water production
convergence rate. Finally, the curse of dimensionality is the main draw-
rates of the producing wells (considered as the system outputs) can be
back of reinforcement learning methods in the large-scale problems. By
determined based on the internal dynamic of the reservoir (considered
increasing the dimensions of the problem, the number of state–action
as the system), and the water injection rates (considered as the system
pairs goes up exponentially. So, visiting all state–action pairs is almost
inputs). So, it is a reasonable assumption to just explore for the optimal
impossible for the agent. This phenomenon specially happens in the
water injection trajectories as the control inputs, 𝑢, to maximize 𝐽 on
continuous problems. Nevertheless, we try to handle these issues in this
specified bhp’s for the producing wells. Now, the reservoir optimization
paper.
problem can be generally formulated as:
The basic reinforcement learning model consists of following ele-
max 𝐽 [𝑢], (12) ments:
𝑢

subjected to: 1. a set of environment states, 𝑠𝑡 ∈ ;

{ 2. a set of actions, 𝑎𝑡 ∈ ;
𝑥̇ = 𝑓 (𝑥, 𝑢),
(13) 3. a policy 𝜋, which maps the states to the actions;
𝑦 = 𝑔(𝑥, 𝑢),
4. rules of transitioning between the states;
and, 5. reward function, 𝑟, which determines the scalar immediate and
{ 𝑚𝑖𝑛 delayed reward of a transition;
𝑢 ≤ 𝑢 ≤ 𝑢𝑚𝑎𝑥 ,
∑ (14) 6. value function of states, 𝑉 (𝑠𝑡 ), and value function of state–action
𝑢 ≤ 𝑈.
pairs, 𝑄(𝑠𝑡 , 𝑎𝑡 ); and
Eq. (13) is an expression for the reservoir dynamics by which the 7. rules to update the values of states and state–action pairs.
influences of the system inputs (e.g: water injection flowrates and pro-
duction wells’ bhp’s) on the reservoir states and outputs are represented. The environment is modeled as a stochastic finite state machine and
This information can be provided by using the valid simulators, based on the state of the model, 𝑠𝑡 , determines the status of agent/learner in
the explanations in Section 2. In addition, (14) is a typical representation the environment. States should be defined in such a way that they are
of the operational constraints such as minimum and maximum volume completely distinct and meanwhile, no redundant or trivializing data is
of water injection rates, as well as the upper limit of the accumulative used. The set of states, , usually includes an initial state and a goal state
injection during the life-cycle or even at each time step. as well. In each state, there is a set of possible and meaningful actions,
, from which the action 𝑎𝑡 is selected due to a pre-defined adaptive
4. Reinforcement learning approach policy 𝜋. Each action changes the status of agent/learner and makes
it move to another state. The transition rule between the states has
Reinforcement learning is an unsupervised optimization method, arisen from the dynamics of environment, which is usually unknown
inspired by behaviorist psychology, to find the best control strategies to and nondeterministic. It should be mentioned that it is not necessary
achieve the desired objectives and also to maximize the defined benefits for the agent to know the exact model of the environment. It is just
and rewards in a dynamic environment. In reinforcement learning, a sufficient to sense the current state as well as the effects of previous
learner or agent, on its own learns step by step where to go and what actions, which have been already selected in the previous states. Effects
to do by iteratively trying different possible actions in each situation. of actions are evaluated by the reward or reinforcement signal, 𝑟,
The actions affect the environment in where the agent lives and cause with respect to the specified optimization objectives or the goal states
both immediate response and future reactions of the environment. By of the learning process. These effects can be observed immediately,
sensing these responses, the agent gradually discovers which action determined as the immediate reward, or some steps later as well as
maximizes the immediate and subsequent rewards in each state. The at the end of learning process, known as the delayed reward. In the
feature of trial-and-error interactions with the environment provides long-term, the set of obtained rewards specify the value of each state–
the facility to optimize the problem, needless to have exact knowl- action pair 𝑄(𝑠, 𝑎). The highest value of state–action pairs in each state
edge about the model of environment. Indeed, this feature separates determines the value of state, 𝑉 (𝑠). The ultimate goal is learning a
reinforcement learning method from other standard supervised learning policy that maximizes the values of states, including the goal state. Fig. 1
approaches, in which correct input/output pairs are presented (Sutton shows the general framework of reinforcement learning. For further
and Barto, 1998). In addition, using this technique allows to optimize studies, the interested readers may refer to Kaelbling et al. (1996),
the defined problem both in online and offline regimes. Moreover, Harmon and Harmon (1997), Sutton and Barto (1998), Gosavi (2009),
this method is also capable of considering the expert’s knowledge Heidrich-Meisner et al. (2007) and Khan et al. (2012).
in different situations. Despite these major positive characteristics, The basic features of reinforcement learning make it a suitable
reinforcement learning methods suffer from limitations such as delayed approach to apply to different problems vary from large-scale ones

101
F. Hourfar et al. Engineering Applications of Artificial Intelligence 77 (2019) 98–116

such as airline management (GOSAVII et al., 2002; Sherali et al., Therefore, 𝜃 is always limited between the initial percentage of oil in
2010; Collins and Thomas, 2013), electricity marketing (Nanduri, 2011; the reservoir and the percentage of remained oil at the point that oil
Hajimiri et al., 2014), bank marketing (Sánchez et al., 2015) or traffic extraction becomes non-economic:
control (Walraven et al., 2016) problems, to simple dynamic cases
like controlling bouncing cart (Bucak and Zohdy, 1999), ball screw 𝜃𝐹 ≤ 𝜃 ≤ 𝜃𝐼 , (16)
driver (Fernandez-Gauna et al., 2014), running gait optimization (Bid- where the value of 𝜃𝐼 , should be adjusted by an expert, based on the
goly et al., 2010) and helicopter flight (Abbeel et al., 2007), or even to
reservoir characteristics. This range is discretized to 𝑛𝜃 parts and so, 𝜃
static problems as Othello board game (van Eck and van Wezel, 2004).
can take the following values:
In this paper, we apply reinforcement learning methodology to
optimize waterflooding process in the oil reservoirs. Since the dynamic 𝜃 ∈ {𝜃𝐹 , 𝜃𝐹 + 𝛥𝜃 , … , 𝜃𝐹 + (𝑛𝜃 − 1)𝛥𝜃 , 𝜃𝐼 }, (17)
of subsurface reservoirs are highly nonlinear and complex, it is not so
𝛥 𝜃𝐼 − 𝜃𝐹
easy to apply the classic optimization methods to this problem. Instead, where 𝛥𝜃 = .
𝑛𝜃
reinforcement learning is an appropriate candidate to be utilized, due The watercut of each production well, 𝑞̃𝑤,𝑖 , is a local indicator of the
to its model-free characteristic. However, the classic reinforcement remained oil in percentage, in different parts of the reservoir. Higher
learning method is discrete both in states and actions, while the reservoir value of 𝑞̃𝑤,𝑖 implies that the amount of recoverable oil around the 𝑖th
recovery problem is completely continuous. Fortunately, there are some
production well has a decreasing rate. Hence, the water injection rates
extensions of reinforcement learning for continuous problems. Here,
of the injection wells in the neighborhood of the 𝑖th production well
we start from classic discrete approach in Section 5 and then, we
should be reduced. The initial value of 𝑞̃𝑤,𝑖 is usually supposed to be
generalize the method for continuous problem as the more practical
zero, which means that there is no water in the production well, at the
solution in Section 6. This is mainly done to initially shape the structure
beginning of waterflooding process. This value can theoretically reach
of waterflooding optimization problem by the reinforcement learning
100 percent at the end of extraction period. Therefore, 𝑞̃𝑤,𝑖 can take the
concepts in the simplest way, and then reorganize the structure to adapt
following values:
to continuous conditions. This strategy helps to evade the inherent com-
plexity of the problem which has origin in simultaneously formulating 𝑞̃𝑤,𝑖 ∈ {0, 𝛥𝑤 , … , (𝑛𝑤 − 1)𝛥𝑤 , 100}, (18)
the reinforcement learning-based optimization issue and generalizing
of the basic form in the continuous space framework for a hydrocarbon 100
𝛥
in which 𝛥𝑤 = , and 𝑛𝑤 is the number of discretized values of 𝑞̃𝑤 .
reservoir. 𝑛𝑤
A reasonable assumption during oil production is that if the watercut
value of any production well violates an economic defined threshold,
5. Waterflooding optimization based on RL: Discrete approach
that well will be closed. This threshold is determined based on the oil
price and water disposal cost:
As mentioned in the previous parts, the main goal of waterflooding
process is maximizing the amount of extracted oil with minimum cost, 𝑐𝑜 𝑞𝑜,𝑖 − 𝑐𝑤 𝑞𝑤,𝑖 > 𝜀 > 0, (19)
which results in maximizing the total profit at the end of depletion stage.
So, the reinforcement learning problem and reward signal should be where 𝜀 is the minimum acceptable profit of each individual production
defined in such a way that they lead to achieve the mentioned objective. well.
To this aim, each element of the basic form of reinforcement learning In a typical reinforcement learning problem, the agent explores the
approach is defined in discrete space in this section, subjected to the whole environment until it reaches a terminal state, or a goal state,
practical concerns. or the number of permissible motions exceeds the determined limit.
At that point, the ongoing trial of learning is stopped and a new trial
5.1. The states set should start. However, in the reservoir optimization problem, the final
state is not deterministic. Here, each trial of the learning process is
In a real oil reservoir, the actual dynamic of the system is not terminated whenever either all production wells are closed due to
available. So, the state of the system should be estimated based on exceeding the watercut upper limit threshold, or the simulation time
the existing observations such as water injection flow rates, oil/water of learning process takes longer than the scheduled lifecycle of the oil
production flow rates and also valid approximation of residual oil in the reservoir. Indeed, we have a set of probable conditions in which the
reservoir. Among the mentioned variables, total injection and produc- waterflooding process may be stopped.
tion rates are redundant data, according to the ‘‘voidage replacement
assumption’’. That assumption implies that the total injection rate and
5.2. The actions set
the total production rate are equal during the operation. In addition,
the injection rates are normally adjusted by the operator or automatic
control system. Indeed, they should be considered as the inputs or In the oil recovery process, water injection rates are the most general
actions of the system. On the other hand, the states of the system can manipulating variables to control the economic return of the field.
be defined as the estimation of the remained oil in percentage in the Therefore, in the same manner, water injection rates of each injection
reservoir and also the values of water-cuts which are water percentages well can be defined as the action vector; i.e.:
in total produced fluid of each production well. Therefore, the states can
𝑎𝑡 = [𝑞𝑖𝑛𝑗,1 , … , 𝑞𝑖𝑛𝑗,𝑁𝑖𝑛𝑗 ], (20)
be written as:

𝑠𝑡 = [𝜃, 𝑞̃𝑤,1 , … , 𝑞̃𝑤,𝑁𝑝𝑟𝑑 ], (15) where 𝑞𝑖𝑛𝑗,𝑖 is water injection rate of the 𝑖th injection well, and 𝑁𝑖𝑛𝑗 is the
number of active injection wells. As mentioned in (14), the discretized
where 𝜃 is an indicator in percentage, for current available oil in the water injection rate can take the following values:
𝛥 𝑞𝑤,𝑖
reservoir, 𝑞̃𝑤,𝑖 = × 100 is the water-cut of the production
𝑞𝑤,𝑖 + 𝑞𝑜,𝑖 𝑞𝑖𝑛𝑗,𝑖 ∈ {0, 𝛥𝑖𝑛𝑗 , … , (𝑛𝑖𝑛𝑗 − 1)𝛥𝑖𝑛𝑗 , 𝑞𝑖𝑛𝑗𝑚𝑎𝑥 }, (21)
well 𝑖th and 𝑁𝑝𝑟𝑑 is the number of active production wells. Clearly, the
dimension of state space is 𝑁𝑝𝑟𝑑 +1. Initial value of the oil percentage, 𝜃𝐼 , where 𝑞𝑖𝑛𝑗𝑚𝑎𝑥 is the maximum capacity of injection rate, 𝑛𝑖𝑛𝑗 is the
𝑚𝑎𝑥
should be estimated by a field expert. Then, for the state values during 𝛥 𝑞𝑖𝑛𝑗
number of discretized values of 𝑞𝑖𝑛𝑗 , and 𝛥𝑖𝑛𝑗 = . In this problem,
the operation, 𝜃 is determined according to the total amount of the 𝑛𝑖𝑛𝑗
produced oil. The recovery procedure is continued until it is profitable, the set of actions, , is uniform for all states and have (𝑛𝑖𝑛𝑗 )𝑁𝑖𝑛𝑗 distinct
regarding the oil price, water injection cost and water disposal cost. members.

102
F. Hourfar et al. Engineering Applications of Artificial Intelligence 77 (2019) 98–116

5.3. Policy 𝑡 and 𝑞 𝑡 are the amount of produced oil and water from the
where 𝑞𝑜,𝑖 𝑤,𝑖
production well 𝑖th in state 𝑠𝑡 , respectively. In addition, 𝑐𝑜 , 𝑐𝑤 and 𝑐𝑖𝑛𝑗
In the beginning of interacting with the environment, the agent/ are oil price, water disposal cost and water injection cost in Dollar per
learner has no clear sense of the system and it does not know how barrel, respectively. Those values are specified by the market conditions
to behave to maximize the rewards. Therefore, it should explore the and also the operating costs. Furthermore, 𝐾 is a positive constant,
environment in the initial trials, namely episodes, to obtain the re- known as the scaling factor, and 𝛥𝑡 is the optimization time step.
quired knowledge. On the other hand, in the last episodes, the agent Moreover, the delayed reward can be defined due to different life-
has sufficient experiences to exploit the obtained knowledge. Any cycle production scheduling. Here, we introduce two different sched-
policy of action selection should provide this characteristic of explo- ules:
ration/exploitation trade-off. To achieve this, ‘‘Boltzmann Distribution’’
Scheduling A. Instant NPV remains constant during total period of
can be used as the action selection rule:
oil production;
exp(𝑄(𝑠𝑡 , 𝑎𝑡 )∕𝜏)
𝑝(𝑎𝑡 |𝑠𝑡 ) = ∑ , (22) Scheduling B. Instant NPV remains positive over a pre-defined
exp(𝑄(𝑠𝑡 , 𝑎′𝑡 )∕𝜏)
𝑎′𝑡 ∈′(𝑠 ) period of oil production.
𝑡

where 𝜏 > 0 is called the temperature parameter, 𝑝(𝑎𝑡 |𝑠𝑡 ) is the It should be noted again that the delayed reward, which is the
probability of selecting action 𝑎𝑡 in state 𝑠𝑡 , ′(𝑠 ) is an available subset response of the system to action 𝑎𝑡 , is revealed in later steps. In other
𝑡
of feasible actions (𝑠𝑡 ) in state 𝑠𝑡 . ′(𝑠 ) is determined regarding the words, the observed reward in state 𝑠𝑡+𝑘+1 is the results of the instant
𝑡
action 𝑎𝑡−1 , which has been selected in the previous state, 𝑠𝑡−1 . In the real response of the system to action 𝑎𝑡+𝑘 , performed in state 𝑠𝑡+𝑘 , and
applications, it is not practical to apply any sequence of input actions also the delayed response of the system to action 𝑎𝑡 , done in state
to the system and the derivations of input values should be limited. 𝑠𝑡 . Therefore, the delayed reward of an action is naturally emerged
As a result, the selected action in state 𝑠𝑡 should be in a neighborhood in a sequence of rewards obtained some steps later. Then, the set of
of action 𝑎𝑡−1 , denoted by the set ′(𝑠 ) ; rather than any action from rewards will be propagated to earlier states according to an updating
𝑡
(𝑠𝑡 ) . Here, the exploitation/exploitation feature is controlled with the rule. Similarly, in reservoir optimization problem the selected actions
temperature parameter. High temperature leads to more exploration, may affect the profile of oil production and change the total life-cycle
where the action selection is more randomly. Low temperature results of the reservoir according to the system dynamics. Consequently, the
in more exploitation, where the action selection is more greedy. It means delayed rewards can only be calculated at the end of waterflooding
that the action with the highest value or correspondingly, with the process. In this paper, the terminology of ‘‘delayed reward’’ is generally
highest return reward, is selected. In addition, temperature is updating used to distinguish between long-term and short-term schedules.
in each episode of learning process with the following updating rule: In scheduling A, it is important that the oil production profile not
only leads to maximum possible overall NPV, but also fixes the profit at a
𝜏
𝜏⟵ , constant value during the production, regardless of the total production
1+𝜁
time. On the other hand, in scheduling B, it is crucial that the profit
where 0 < 𝜁 ≪ 1 is the learning rate. It is obvious that at the end of remains economic for a specified operation period, while maximizing
learning process, the policy is completely greedy and in each state, the the overall NPV for the reservoir; regardless of how much the value of
action with the highest state–action value is selected. The parameter 𝜁 instant NPV is.
determines the transition phase from the exploration of the environment For scheduling A, the delayed reward is defined as:
to the exploitation of the obtained knowledge. 𝜁 should be softly tuned {
in such a way that the agent has enough exploration/exploitation −𝐾̂(𝑁 ) (𝑀 − 𝑟𝑡 )4 , if 𝑟𝑡 > 𝑀
𝑑
𝑟𝑡 = 𝑒 (24)
duration in the total number of learning episodes. 0, otherwise

where 𝑀 = mean(𝑟𝑡 ) over time of oil production, 𝐾 ̂(𝑁 ) is a positive

𝑒
5.4. The transition rule
scaling function, and 𝑁𝑒 is the current episode number. The idea of this
definition is simple: any action that leads to an instant reward higher
The transition rule is mainly the result of dynamics of the system than the average value, should be punished.
(see (13) for the general form) and usually is unknown or should be In scheduling B, we intent to achieve an economic production over a
estimated. However; from reservoir engineering point of view, it is specified period. Therefore, any injection profile by which the watercut
known that the waterflooding process decreases the remained oil in the values of all the producing wells exceed the acceptable thresholds before
reservoir, while increases the watercut values of the production wells. It the desired termination time, should be punished. To this aim, the
should be clarified that exact transitions between discrete states are not delayed reward is defined as:
always practical in the real systems with nonlinear dynamics. In other {
words, fixed pre-defined actions may not be able to cause the state of −𝐾̄ (𝑁𝑒 ) 𝛾 𝑡 (𝑡 − 𝑇 − )2 , if 𝑡 < 𝑇 −
𝑟𝑑𝑡 = (25)
a nonlinear system to transit from a discrete value to another discrete 0, otherwise
one. Consequently, in the discrete approach, the real status of the system
where 𝑇 − is the first time that instant NPV becomes negative, 𝐾̄ (𝑁𝑒 ) is a
should be estimated by the nearest discretized available state in the state
positive scaling function and 𝛾 is the discount factor.
space. ̂(𝑁 ) and 𝐾̄ (𝑁 ) are functions of the number of learning episodes
Both 𝐾 𝑒 𝑒
in each epoch. They are defined in such a way that in initial episodes,
5.5. The reward their values and consequently 𝑟𝑑𝑡 , are ignorable. But in the last episodes,
their values become significant. For instance, a good candidate for
The reward should be specified in such a way that its maximization adjusting the value of 𝐾 ̂(𝑁 ) , based on the mentioned policy is:
𝑒
is equivalent to optimization of the main problem. Since maximizing ( )
2 ∗ 𝑁𝑒 2
the total profit of oil recovery is the ultimate goal of waterflooding, a 𝐾̂(𝑁 ) ∝ , (26)
𝑒 𝑁𝐸𝑇
scale of the instant NPV can be used as an appropriate candidate for the
instant reward: where 𝑁𝐸𝑇 is the number of total episodes in each epoch. Obviously,
⎛ ∑ 𝑁𝑝𝑟𝑑 𝑁𝑝𝑟𝑑 ⎞
𝑁𝑖𝑛𝑗 this adjustment policy helps the algorithm to find the actions with the
∑ ∑
𝑟𝑡 = 𝐾 ⎜𝑐𝑜 𝑡
𝑞𝑜,𝑖 − 𝑐𝑤 𝑡
𝑞𝑤,𝑖 − 𝑐𝑖𝑛𝑗 𝑡
𝑞𝑖𝑛𝑗,𝑖 ⎟ 𝛥𝑡, (23) highest instant reward at the initial steps and gradually modify the
⎜ ⎟ action values by considering the long-term scheduling. Moreover, it
⎝ 𝑖=1 𝑖=1 𝑖=1 ⎠

103
F. Hourfar et al. Engineering Applications of Artificial Intelligence 77 (2019) 98–116

simplifies the definition of the delayed reward functions in such a way Among these approximators, Fuzzy Inference Systems offer powerful
that just for one condition in (24) or (25), it is required to define the features such as knowledge representation and uncertainty compensa-
‘‘punishment’’. In scheduling A, the delayed reward is calculated only tion. In this paper, Takagi and Sugeno fuzzy model (Takagi and Sugeno,
for the case 𝑟𝑡 > 𝑀. The other case, 𝑟𝑡 ≤ 𝑀, is derived from either an 1985) has been applied as the value function approximator. This model
inappropriate action or water saturation occurrence in the production is described by 𝑛-input, 𝑚-output fuzzy if-then rules with the following
wells. However; by choosing the above adjustment policy, we ensure format:
that the action with the highest instant reward has been already learned
̂ becomes significant. As a result, 𝑟𝑡 ≤ 𝑀 only occurs 𝑅𝑖 ∶ 𝐢𝐟 𝑥1 is 𝐿𝑖1 and … and 𝑥𝑛 is 𝐿𝑖𝑛 𝐭𝐡𝐞𝐧
prior to value of 𝐾
when production wells are water-saturated. The saturation, is the result 𝑦 is {𝑎1𝑖1 , … , 𝑎1𝑖𝑚 } with value 𝑄1𝑖 or …
of the previous water injection profile rather than the current injection or 𝑦 is {𝑎𝑘𝑖1 , … , 𝑎𝑘𝑖𝑚 } with value 𝑄𝑘𝑖 , (28)
rate value. Therefore, it is sufficient to just punish the sequence of
actions which results in 𝑟𝑡 > 𝑀. A similar rewarding policy is also 𝛥
where 𝑥 = {𝑥1 , … , 𝑥𝑛 } is the input and 𝑦 is the output of the system.
considered for scheduling B. Again, only the sequence of actions which The if-then rules represent local input–output relations for a nonlinear
causes all the production wells produce water more than the tolerable system, by which the nonlinear dynamics of the process are properly
value prior to the favorite final time, is just punished. In general, it expressed, using a set of local linear models (Mehran, 2008). In fuzzy
can be clarified that only the actions that cause undesired conditions at learning procedure, for rule 𝑅𝑖 , the vector 𝑥 is equivalent to the
the end of waterflooding process, are considered in the delayed reward 𝛥
continuous state of the system 𝑠𝑡 , the vector 𝐿𝑖 = {𝐿𝑖1 , … , 𝐿𝑖𝑛 } includes
computation.
discrete state indexes, and the vector 𝑦 is the action of system, 𝑎𝑡 . Similar
to the discrete approach, there are some feasible actions with state–
5.6. Updating rule action value 𝑄𝑗𝑖 in each rule. The set of feasible actions is determined
based on the operational constraints, as well as the expert knowledge.
Different methods of reinforcement learning are mainly distin- In the continuous learning approach, the main characteristics of
guished from each other based on various updating rules by which reinforcement learning problem, such as basics of action selection and
characteristics such as learning speed, convergence rate, stability and updating rule, are similar to the discrete version. However, the calcula-
optimal obtained rewards are affected. For example, while most of tion techniques are somehow different. In the following subsections, the
the classical learning techniques, such as Monte Carlo or Temporal required modifications for applying continuous learning methodology to
Differences, are considered as reward averaging methods, the existing optimize the waterflooding process are described.
differences in averaging procedure of various methods cause dissimilar
features as well as different levels of computational complexity (Sutton 6.1. The states set
and Barto, 1998). In this paper, the basic form of averaging method
has been selected to prevent shadowing of complex dynamics of the oil Although the definition of state space is identical with discrete
reservoir over the efficiency of the proposed optimization procedure. version, normalized convex fuzzy sets are used rather than the set of
In this method, the value of current state–action pair is updated as discrete numbers. Similarly, the state space is composed as follows:
follows:
𝑠𝑡 = [𝜃, 𝑞̃𝑤,1 , … , 𝑞̃𝑤,𝑁𝑝𝑟𝑑 ], (29)
⎧ (𝑣 )𝑄(𝑠 ,𝑎 )+𝑟 +𝑟𝑑
⎪ 𝑄(𝑠𝑡 , 𝑎𝑡 ) ⟵ (𝑠𝑡 ,𝑎𝑡 ) 𝑡 𝑡 𝑡 𝑡 ,
⎨ 𝑣(𝑠𝑡 ,𝑎𝑡 ) +1 (27) in which 𝜃, 𝑞̃𝑤,𝑖 and 𝑁𝑝𝑟𝑑 are defined similar to (15). However, in
⎪ 𝑣(𝑠𝑡 ,𝑎𝑡 ) ⟵ 𝑣(𝑠𝑡 ,𝑎𝑡 ) + 1, contrast to the discrete version, they can take any limited continuous
⎩
value with the following conditions:
where 𝑣(𝑠𝑡 ,𝑎𝑡 ) is the total number of visiting of state 𝑠𝑡 , while action 𝑎𝑡 {
𝜃 ∈ {𝑥|𝜃𝐹 ≤ 𝑥 ≤ 𝜃𝐼 },
has been selected. The instant reward, 𝑟𝑡 , is computed immediately using (30)
𝑞̃𝑤,𝑖 ∈ {𝑥|0 ≤ 𝑥 ≤ 100}.
(23), when action 𝑎𝑡 is done in state 𝑠𝑡 and the system transits to state
𝑠𝑡+1 . The delayed reward, 𝑟𝑑𝑡 , is calculated at the end of waterflooding To cover the above ranges, multiple (𝑁𝑝𝑟𝑑 +1)-order convex fuzzy sets
process according to the desired scheduling A or B, by using (24) or (25), are defined. These fuzzy sets are the input part of if-then fuzzy rules in
respectively. For any other long-term scheduling, 𝑟𝑑𝑡 should be defined (28). The centers of fuzzy sets, 𝐿𝑗𝑖 , are the discrete state indexes. Here,
correspondingly. the members of discrete state space, already defined in Section 5.1, are
used as the state indexes. So, similar to the number of states in discrete
6. Waterflooding optimization based on RL: Continuous approach approach, we will have 𝑛𝜃 ×(𝑛𝑤 )𝑁𝑝𝑟𝑑 distinct convex fuzzy sets. The firing
rate of each fuzzy set, or correspondingly each fuzzy rule, is calculated
by the following membership function:
Despite basic concepts of reinforcement learning method, the nature
of oil reservoir production optimization problem is continuous in both ( (𝜃 − 𝐿 )2 ∑ (𝑞̃𝑤,𝑖 − 𝐿′𝑖𝑗 )2 )
𝑁𝑝𝑟𝑑
𝑖
states and actions. A trivial solution for such continuous problems is 𝜇𝑖 (𝑠𝑡 ) = exp − − , (31)
2𝛥𝜃 𝑗=1
2𝛥𝑤
to use basic method, while decreasing the length of discretization step
or equivalently increasing the number of states and actions. Unfortu- where
nately, this method leads to slow convergence as well as the curse {
𝐿𝑖 ∈ {𝜃𝐹 , 𝜃𝐹 + 𝛥𝜃 , … , 𝜃𝐹 + (𝑛𝜃 − 1)𝛥𝜃 , 𝜃𝐼 },
of dimensionality. Another acceptable alternative for coping with this
𝐿′𝑖𝑗 ∈ {0, 𝛥𝑤 , … , (𝑛𝑤 − 1)𝛥𝑤 , 100}.
kind of problems, is to apply value function approximation. In this
approach, some additional discrete state indexes are defined, instead Other issues such as initial and final states, discretization steps and
of just considering the discrete states. Then, the value of any contin- production wells closing conditions are the same as discrete approach. It
uous state or state–action pair is estimated by fusion of the values of should be noted that by appropriately selecting the values of 𝑛𝜃 and 𝑛𝑤 ,
neighbor indexes. Based on the type of approximator, this approach one can compromise between accuracy of the value function approx-
can be classified in different categories such as: coarse coding function imation and the available computational resources. Large values of 𝑛
approximators (Sutton, 1996; Doya, 2000), gradient based function result in more state indexes. Consequently, more accurate estimation of
approximators (Precup and Sutton, 1997; Barreto and Anderson, 2008), the continuous model will be achieved. However, too many states lead
and fuzzy inference learning (Jouffe, 1998; Derhami et al., 2008). to the curse of dimensionality as well as slow convergence rate.

104
F. Hourfar et al. Engineering Applications of Artificial Intelligence 77 (2019) 98–116

6.2. The actions set 6.4. The transition rule

In the continuous learning framework, the definition of actions and Since the transitions are results of system dynamics, the learning
action sets are as the discrete approach. It means the water injection problem’s type does not affect them. In other words, the transition rule
rates of injection wells are again assumed to be the actions: both in continuous or discrete approach is identical. However, as the
state space is continuous, aggregation of some fuzzy sets are utilized to
𝑎𝑡 = [𝑞𝑖𝑛𝑗,1 , … , 𝑞𝑖𝑛𝑗,𝑁𝑖𝑛𝑗 ], (32) represent the real status of the system, instead of doing estimation by
just using the nearest state.
with the same variables in (20).
Despite the continuous type of the problem, the action values can 6.5. The reward
still be considered as discrete numbers as follows:
Reward signal calculation is also independent of the type of ap-
𝑞𝑖𝑛𝑗,𝑖 ∈ {0, 𝛥𝑖𝑛𝑗 , … , (𝑛𝑖𝑛𝑗 − 1)𝛥𝑖𝑛𝑗 , 𝑞𝑖𝑛𝑗𝑚𝑎𝑥 }, (33)
proach. Therefore, for continuous version both instant and delayed
where again 𝑞𝑖𝑛𝑗𝑚𝑎𝑥 is the maximum value of injection rate, 𝑛𝑖𝑛𝑗 is the rewards can be computed using the formulations in Section 5.5.
𝑞𝑖𝑛𝑗 𝑚𝑎𝑥
number of discretized values of 𝑞𝑖𝑛𝑗 , and 𝛥𝑖𝑛𝑗 = . This relaxation
𝑛𝑖𝑛𝑗 6.6. Updating rule
in considering the actions as discrete numbers is possible due to the
model of the fuzzy learning. In fuzzy learning methodology, the current Similar to the discrete version, the basic form of averaging method
state may activate multiple fuzzy rules in which the best action may not is used to update the rule–action value. However, the updating rule is
be identical. The final action, which is the injection rate in our problem, modified to adapt to the fuzzy learning and the continuous state–action
is the aggregation of the best actions of each fired rule. As a result, spaces. In comparison to the discrete approach, three major differences
based of the selected aggregation technique, it can have any continuous exist in the continuous framework. First, values of several fuzzy rule–
value in the range of 0 ≤ 𝑞𝑖𝑛𝑗 ≤ 𝑞𝑖𝑛𝑗𝑚𝑎𝑥 ; even though the best proposed action pairs, not just one state–action pair, are updated proportional to
actions of the active rules are completely discrete values. This feature of their firing rates. Second, in each fired rule, the action which its value is
fuzzy learning approach provides the facility to use a set of actions with updated, may be totally different from the selected action. Third, more
discrete values, even for continuous problems. Here, we use weighted than one action may be updated in each fired rule, due to the continuous
summation as the aggregation tool. So, the final applied action, 𝑎𝑡 , is value of the final aggregated action. For more clarification, just assume
obtained as follows: that the action is a scalar, not a vector. Now, suppose that 𝑎∗𝑖 = 𝑎𝑗𝑖 is
∑ the selected action in the 𝑖th rule and 𝑎𝑡 is the aggregated action of
𝑎𝑡 = 𝜇̂ 𝑖 𝑎∗𝑖 , (34) all fired rules. For such a case, the value of 𝑎𝑗𝑖 is updated only if it
𝑖 is in the neighborhood of 𝑎𝑡 . This is the interpretation of the second
where 𝑎∗𝑖 is the best action from fired rule 𝑖th, selected due to a policy difference. Otherwise, the nearest actions to 𝑎𝑡 are updated, which is
which will be described in the next subsection. Furthermore, 𝜇̂ 𝑖 is the indeed the third difference. In other words, the last difference means
normalized firing rate: that if the value of the final aggregated action is in the neighborhood
𝜇 of two discrete actions; i.e. 𝑎𝑘 ≤ 𝑎𝑡 ≤ 𝑎𝑘+1 where 𝑎𝑘 , 𝑎𝑘+1 ∈ , then the
𝜇̂ 𝑖 = ∑ 𝑖 . (35) values of both actions are updated proportional to their proximity to 𝑎𝑡 .
𝑗 𝜇𝑗
In the general case, the proximity of discrete actions are measured by
It is obvious that the final action can take any continuous value in the following function:
the interval of min𝑖 {𝑎∗𝑖 } and max𝑖 {𝑎∗𝑖 }. Finally, it should be noted that ( ‖𝑎 − 𝑎𝑗 ‖2 )
𝑡 2
the actions are the output part of if-then fuzzy rules, defined in (28). 𝜂(𝑎𝑗 ) = exp − , (37)
For each rule, there are (𝑛𝑖𝑛𝑗 )𝑁𝑖𝑛𝑗 possible actions similar to the discrete 2𝛥𝑖𝑛𝑗

approach, which has been described in Section 5.2. where ‖ ‖2 is Euclidean Norm. To reduce the computational load,
the proximity values which are less than a threshold, 𝜂𝑗 < 0.005, are
considered to be zero. Now, for any fired rule 𝑅𝑖 with 𝜇𝑖 > 0, the values
6.3. Policy of rule–action pairs for actions with 𝜂𝑗 > 0 can be updated as follows:

⎧ 𝑗 𝑗 𝑗
(𝑣 )𝑄 + (𝜇̂ 𝜂̂ )(𝑟 +𝑟𝑑 )
In the continuous approach, rather than just focusing on the cur- ⎪ 𝑄𝑖 ⟵ 𝑖 𝑖 𝑗 𝑖 𝑗 𝑡 𝑡 , ∀𝑖, 𝜇𝑖 > 0 and ∀𝑗, 𝜂𝑗 > 0,
rent state, an action should be selected for any available fuzzy set ⎨ 𝑣𝑖 +𝜇̂ 𝑖 𝜂̂𝑗 (38)
in the neighborhood of the current state. In addition, the explo- ⎪ 𝑣𝑗𝑖 ⟵ 𝑣𝑗𝑖 + 𝜇̂ 𝑖 𝜂̂𝑗 ,
⎩
ration/exploitation trade-off is a necessary characteristic for all types of
learning based optimizations problems. To fulfill these aims, a modified where 𝑣𝑗𝑖 is the summation of fuzzy value of firing rule 𝑖th and the
calculated value of proximity to action 𝑎𝑗𝑖 using (37). Furthermore, 𝜂̂𝑗 is
version of Boltzmann Distribution (Derhami et al., 2008) has been
the normalized proximity value:
utilized in this paper as the selection rule:
𝜂𝑗
exp(𝜇̂ 𝑖 𝑄𝑗𝑖 ∕𝜏) 𝜂̂𝑗 = ∑ . (39)
𝑝(𝑎𝑗𝑖 |𝑅𝑖 (𝑠𝑡 )) = ∑ , (36) 𝑘 𝜂𝑘
𝑘∈′ exp(𝜇̂ 𝑖 𝑄𝑘𝑖 ∕𝜏)
7. Algorithm implementation and simulation results
where 𝑝(𝑎𝑗𝑖 |𝑅𝑖 (𝑠𝑡 )) is the probability of selecting action 𝑎𝑗𝑖 in the 𝑖th fuzzy
rule, ′ is a permissible subset of the action set , 𝜏 is the temperature The developed RL-based optimization algorithm has been imple-
parameter, and 𝜇̂ 𝑖 is the normalized firing rate. Definition of ′ and mented, using ‘‘Matlab Reservoir Simulation Toolbox’’ (MRST) (Lie,
the updating rule of 𝜏 are the same as discrete version. Moreover, for 2014). MRST provides the capability of controlling and optimizing the
reducing the computational load of the developed algorithm, the firing waterflooding process by properly manipulating the reservoir inputs.
rates which are less than a specified threshold, 𝜇𝑖 < 0.005, are considered Without loss of developed method’s generality, it is considered that
to be zero. The final action 𝑎𝑡 , which is going to be taken in the current the producing wells are under operation in the fixed bhp’s, based on
state, is the aggregation of selected actions of all fired rules in the same operational recommendations. In addition, the annual discount rate, 𝑏,
manner described in the previous subsection. is supposed to be 0. Furthermore, for keeping the reservoir pressure

105
F. Hourfar et al. Engineering Applications of Artificial Intelligence 77 (2019) 98–116

Table 3
Learning process parameters in scenario I.
Parameters Value Parameters Value (unit)
𝑁𝑖𝑛𝑗 1 𝑞𝑖𝑛𝑗𝑚𝑎𝑥 2515.8 (STB/day) [400 (m3 /day)]
𝑁𝐸𝑇 500 𝜁 0.02
𝑛𝑖𝑛𝑗 20

Fig. 2. Standard Egg model reservoir with 8 injection wells (blue), and 4 production wells
(red) (Jansen et al., 2014) . (For interpretation of the references to color in this figure
legend, the reader is referred to the web version of this article.)

Table 1
Geological and fluid properties of standard Egg-Model (Van Essen et al., 2011).
Property Value (unit)
𝜙 0.2
𝜌𝑜 900 (kg/m3 at 1 bar)
𝜌𝑤 1000 (kg/m3 at 1 bar)
𝜇𝑜 5 × 10−3 (Pa s)
𝜇𝑤 1 × 10−3 (Pa s)
𝑝𝑐𝑜𝑤 0 (bar)
Fig. 3. Average accumulative reward per trial for short term strategy in Scenario I.

Table 2
Parameters of learning process in both scenarios.
Parameters Value (unit) Parameters Value (unit) • Short-term scheduling;
𝑐𝑤 10 ($/STB) 𝐾 0.002 • combination of short-term scheduling and long-term scheduling A;
𝑐𝑜 80 ($/STB) 𝐾̂ 0.004 • combination of short-term scheduling and long-term scheduling B;
𝑐𝑖𝑛𝑗 5 ($/STB) 𝐾̄ 0.04 and
𝑛𝜃 5 𝜃𝐹 40%
𝑛𝑤 10 𝜃𝐼 90%
• combination of short-term scheduling, long-term scheduling A and
𝛾 0.995 𝑏 0 long-term scheduling B.
𝛥𝑡 30 (days)
In the short-term strategy, only maximizing the value of instant
reward is the main issue. In the three other strategies, the ultimate
objective is maximizing the instant reward while meeting the expec-
constant and preventing from over-pressurization, the total injection tations from specified delayed rewards, simultaneously. For example,
rates of injection wells are supposed to be equal to the total production the optimal policy in the second strategy leads to the instant NPV’s
rates of producing wells: which are the maximum feasible constant values during the operation.
𝑁𝑖𝑛𝑗
∑
𝑁𝑝𝑟𝑑
∑ In the following parts, the objective of each of the above strategies are
𝑞𝑖𝑛𝑗,𝑖 = 𝑞𝑝𝑟𝑑,𝑖 , (40) described in more detail. Besides, the obtained results are presented and
𝑖=1 𝑖=1 analyzed. By default, the graphs are generally the average of results over
where 𝑞𝑖𝑛𝑗,𝑖 is the flow rate of each injection well and 𝑞𝑝𝑟𝑑,𝑖 is the total twenty epochs, unless the ones that are explicitly excluded. Moreover,
flow rate of each producing well, while 𝑁𝑖𝑛𝑗 and 𝑁𝑝𝑟𝑑 are the numbers each epoch contains 500 episodes. Table 2 introduces the common learn-
of injection and production wells, respectively. ing parameters for both SIMO and MIMO scenarios. Table 3 represents
The developed algorithm is applied to the well-known Egg-Model the specific learning parameters which have been used in SIMO scenario.
(see Fig. 2) for waterflooding process optimization in different practical Fig. 3 shows average accumulative reward per trial for short-term
and operational scenarios. The geological characteristics of the standard strategy in Scenario I. As it can be seen, the optimal policy is gradually
Egg-Model as well as other parameters such as injection and production learned in less than 200 episode and the values of accumulative reward
well locations and initial adjustments are available in Jansen et al. remains nearly constant in the last episodes. This fact implies that
(2014); Van Essen et al. (2011). Table 1 includes several geological and although the rewarding method is simple, it can successfully converge to
fluid key properties of the standard Egg-Model, which are generally used an optimal policy. Furthermore, the performance quality of the proposed
by the researchers in a valid simulator environment. In the simulation method in the reservoir recovery process is observed in that figure.
phase, different scenarios regarding the number of active injection Figs. 4 and 5 illustrate the instant reward and the accumulative
wells in the standard Egg-Model have been studied. Consequently,
reward related to different strategies in scenario I, respectively. Fig. 6
the production optimization problems for both Single-Input-Multi-Output
shows the learned optimal rate of water injection. The rate of produced
(SIMO) system (one injection well and four production wells) and
oil and water for each production well and also the rates of total water
Multi-Input-Multi-Output (MIMO) system (four injection wells and four
injection as well as total water and total oil production in the reservoir
production wells) have been defined and solved. The SIMO learning is
are demonstrated in Figs. 7 and 8, respectively.
denoted as scenario I and the MIMO one is denoted as scenario II.
As it is expected in the short-term strategy, in which maximizing the
7.1. Scenario I: SIMO optimization instant NPV is the main goal, the oil extraction stops sooner than other
strategies (see Fig. 4). In this strategy, the reward is simply considered as
In this scenario, only one of the injection wells is activated during the the instant NPV. Therefore, maximizing the reward leads to maximizing
waterflooding process, while oil can be extracted from all of the existing the instant NPV or equivalently initiating the waterflooding process
production wells. Here, injection well no. 4 is selected as the operating with the maximum capacity of water injection rate. As a result, the
well, according to its critical location in the reservoir. We study the percentage of water in the production wells increases rapidly; see
following four strategies in this scenario: Fig. 8a. By water-saturation occurrence in the production wells, full rate

106
F. Hourfar et al. Engineering Applications of Artificial Intelligence 77 (2019) 98–116

produce oil from the wells with watercut values over 89%. Consequently,
the production wells which meet the above condition should be closed
immediately. This behavior can also be observed in Fig. 7a, specifically
for the mentioned wells.
In three other strategies, the reward term is composed of the instant
reward and the delayed reward. Therefore, the waterflooding process
termination time is different and for some cases, it can be manageable. In
other words, the production duration can be extended or diminished in
a feasible range. In combination of short-term scheduling and long-term
scheduling A, the delayed reward is defined in such a way that makes the
instant NPV remain constant in the life-cycle duration of waterflooding
process. As it can be seen in Fig. 4, the instant reward of the second
strategy follows the expected pattern; although it is slightly greater at
the beginning of the production in comparison with its value at the
final time. Analyzing the optimal injection rate of this strategy implies
Fig. 4. Instant reward for different strategies in Scenario I. that the agent learns to inject the water with a lower rate at the initial
months, when the percentage of produced water is drastically less than
the produced oil in the production wells (see Fig. 8b). In this condition,
water disposal cost is very low. After a while, the percentage of the
produced water in the total produced fluid increases. Consequently, the
optimal policy increases the rate of water injection to compensate the
negative effect of the produced water on the gained profit. Similar to
the short term strategy, agent prefers to decrease the rate of injection
when the production wells become water-saturated. Moreover, although
this strategy tries to keep the value of the instant NPV constant for
the whole life-cycle, the reservoir may become impotent to produce
sufficient oil to satisfy this goal. This event is mainly due to the fact
that the hydrocarbon productivity of the reservoir gradually diminishes
during the operation.
In combination of short-term scheduling and long-term scheduling B,
the main goal of the waterflooding process is achieving an economical
production over a pre-defined operational period, regardless of the value
Fig. 5. Accumulative reward for different strategies in Scenario I. of instant profit. Consequently, the delayed reward is defined such that
any injection profile by which the production wells become water-
saturated sooner than the desired time, will be punished. Obviously, this
strategy decreases the rate of water injection in some periods (see Fig. 6).
Compared to the first studied strategy, the rate of water injection at the
initial months reaches to about 80% of the maximum injection capacity.
Then, the injection rate gradually comes down to about 20% of the full
rate value. This behavior is the direct result of the adjustment policy of
the delayed reward. Initially, the agent learns to maximize the instant
reward that leads to an injection profile similar to the short term strategy
in which production wells water saturation occurs after seven years.
Afterwards, the interaction between the instant reward and the delayed
reward forces the injection rate to go down to increase the production
period. Nevertheless, the decreasing steps are not identical during the
whole process. At the first years, the value of instant NPV is significantly
more than its value at the end of waterflooding and as a result, the
delayed reward has less effect on the rate reduction in that duration.
Furthermore, it is important to note this strategy not only guarantees
Fig. 6. Instant injection rate for different strategies in Scenario I. that the value of instant NPV is always positive, but also it is capable to
ensure that the value of instant NPV is more than a desired achievable
threshold during the production period. To this aim, it is just sufficient
injection strategy is not economic anymore. Because, the operational to redefine 𝑇 − in (25) as the moment in which the value of instant NPV
costs which have origin in water injection and water disposal costs becomes less than the specified threshold for the first time.
become greater than the obtained profit from the oil production under Finally, to simultaneously achieve the constant NPV over a desired
full rate injection policy. Consequently, the agent learns to decrease the period of economical production, combination of both long-term sched-
injection rate in these states in such a way that keeps the instant NPV at ules (A and B) are introduced as a new long-term strategy, entitled
the maximum possible value, as long as the production is economical. ‘‘hybrid strategy’’. In the long-term scheduling A, the agent learns to
Afterwards, the production will be terminated. In short-term strategy, achieve nearly constant NPV during the waterflooding process and the
two distinct jumps can be observed in the graph of instant reward (see total duration of operation as well as the average value of NPV are
Fig. 4) around the years 4th and 6th. These events are due to the closing dictated by the learned water injection profiles. On the other hand,
of specific production wells. At the mentioned years, the percentage of in the long-term scheduling B, the agent finally learns to achieve the
produced water exceeds than 89% in the second and third production economical NPV, which should be non-negative or even more than a
wells, respectively. According to (19), while considering the assumed oil specific positive threshold, over a desired time interval of waterflooding
price and water disposal cost, it can be deduced that it is not profitable to process, regardless of the exact value of instant NPV. Apparently, by

107
F. Hourfar et al. Engineering Applications of Artificial Intelligence 77 (2019) 98–116

Fig. 7. Production rate of each well in scenario I.

combination of both schedules, one can achieve constant NPV over really important in various types of multilateral operational contracts for
a predefined life-cycle. An advantage of this strategy is the ability of fairly profit-sharing between the clients and contractors. In other words,
adjustment of the duration of waterflooding process. This capability is the share of each party can be determined by properly adjusting of the

108
F. Hourfar et al. Engineering Applications of Artificial Intelligence 77 (2019) 98–116

Fig. 8. Total flow rates in scenario I.

production period as well as the value of expected instant NPV at each differences in the optimal water injection profiles corresponding to the
time-step. As it can be seen in Fig. 4, the instant NPV remains nearly applied strategies are traceable. The hybrid strategy has the lowest rate
constant for longer period in comparison with the second strategy. of water injection in the initial years compared to the others. Obviously,
Meanwhile, the value of instant NPV is more acceptable compared to the reservoir can be under operation for longer periods by utilizing this
the third strategy, over the whole period of operation. In Fig. 6 the policy. In the middle years, the percentage of water in the production

109
F. Hourfar et al. Engineering Applications of Artificial Intelligence 77 (2019) 98–116

Fig. 9. Status of the reservoir during waterflooding process in scenario I for different strategies. The interval of snapshots are every four months. The dark red color indicates the
distribution of the available oil while the blue color indicates the distribution of the water during the life-cycle. These snapshots are for a single run, not an average . (For interpretation
of the references to color in this figure legend, the reader is referred to the web version of this article.)

wells increases. As a result, the hybrid strategy learns to augment the Table 4
Learning process parameters in scenario II.
water injection rate step by step to compensate this phenomenon by
producing more oil (see Fig. 8d). However; since the average value of Parameters Value Parameters Value (unit)

NPV is less than the corresponding value in the second strategy, slightly 𝑁𝑖𝑛𝑗 4 𝑞𝑖𝑛𝑗𝑚𝑎𝑥 628.95 (STB/day) [100 (m3 /day)]
𝑁𝐸𝑇 2000 𝜁 0.006
lower incremental steps in water injection rate, and consequently longer
𝑛𝑖𝑛𝑗 10
period of economical operation, are attained.
Finally, the status of oil and water distribution in the reservoir during
the production related to the studied strategies can be observed for every
four months in Fig. 9. These snapshots are for a single run. Same as scenario I, maximization of the value of instant NPV is
the main concern in the short-term strategy. Moreover, in the long-
7.2. Scenario II: MIMO optimization term strategy the desired goal is achieving the maximum constant NPV
during the operation. The observed results present that in this scenario,
Two fundamental strategies have been studied in this scenario: combination of short-term scheduling and long-term scheduling A is
nearly sufficient to guarantee a maximized pseudo-constant NPV over
• Short-term strategy; and
the expected period of oil production. It should be noted that increasing
• combination of short-term scheduling and long-term scheduling A.
the number of injection wells may affect the hydrocarbon sweeping
Similar to the previous scenario, the graphs are again the average of process efficiency, which results in variations in reservoir life-cycle. As a
results over twenty epochs in general, unless the ones that are explicitly result, there is no need to study and apply other advanced combinatorial
excluded. However; this time each epoch contains 2000 episodes. Some strategies in this section. Furthermore, it can be found out that the
learning parameters of scenario II are common with the previous one economical oil production duration in the short-term strategy lasts
which have already been introduced in Table 2. Table 4 represents the longer compared to the corresponding one in scenario I, while the total
specific learning parameters just used in scenario II. injection rate capacities in both scenarios are equal. Additionally, the

110
F. Hourfar et al. Engineering Applications of Artificial Intelligence 77 (2019) 98–116

gained accumulative reward has been almost doubled in comparison

with the corresponding strategy in the previous scenario. This rise is
due to the efficiency enhancement of the sweeping process, which is
equivalent to an increase in hydrocarbon recovery, by using multiple
injection wells. In the following parts the results obtained from the
studied strategies are described in more detail.
Figs. 10 and 11 illustrate the instant reward and the accumulative
reward related to the short-term and the long-term strategies in scenario
II, respectively. Fig. 12 shows the learned optimal rate of total water
injection. The rate of water injection in each injection well, the rates of
produced oil and water for each production well and the total amount of
water and oil production in the reservoir are demonstrated in Figs. 13–
15, respectively. Finally, the status of oil and water distributions in the
reservoir during the production period, related to the studied strategies
can be observed in every four months in Fig. 16. It is clarified that these
snapshots are for a single run. Fig. 10. Instant reward for different strategies in Scenario II.
In the short-term strategy, the agent correctly learns to apply full
injection capacity in the initial months to maximize the instant NPV
(see Fig. 13a). By increase of watercut values in the production wells,
full rate injection policy is not economical and the operational costs
related to water injection and water disposal exceed the obtained profit
from the oil production. Consequently, the agent successfully learns to
decrease the injection rates in these states. However; the values of drops
are not identical in the different injection wells. Similar to scenario I,
again two distinct jumps can be observed in the graph of instant reward
(see Fig. 10) at the months 55th and 80th, which are the results of
satisfying the closing condition (19) for production wells no. 4 and 2,
respectively (see Fig. 14). It should be noted that in the MIMO scenario,
there are some local dynamics between the adjacent injection and the
production wells. So by increasing the watercut value in a production
well, the agent learns to gradually decrease the water injection rate of
the injection wells in the neighborhood, which may delay the occurrence
of the closing condition. This pattern is observable in Fig. 13a. Fig. 11. Accumulative reward for different strategies in Scenario II.
In the second strategy, the combination of the instant reward and
the delayed reward makes instant NPV nearly constant in the life-cycle
duration of oil production. Similar to the previous scenario, the agent
learns to inject the water with a lower rate at the initial months, since the
percentage of produced water is drastically less than the produced oil in
the production wells (see Fig. 15b) and the water disposal cost is very
low. After a while, the percentage of the produced water in the total
produced fluid increases. Consequently, the optimal policy increases
the rates of water injection to compensate the negative effect of the
produced water on the gained profit. Similar to the short-term strategy,
agent prefers to decrease the rate of injection whenever the production
wells are becoming water-saturated. The obtained results from this
strategy demonstrate a noticeable jump in the value of accumulative
NPV, compared to other considered experiments in both scenarios
during the production period. This valuable achievement is thoroughly
aligned with the fact that by properly increasing the number of injection Fig. 12. Instant injection rate for different strategies in Scenario II.
wells and also applying a more efficient sweeping policy, a higher profit
can be ensured.
At this stage, a comparison between the performance of the devel-
However, since MRST is a free open-source, mostly academic reservoir
oped methodology and two popular controlled waterflooding techniques
simulation environment, it is possible for the researchers to evaluate
is presented. Hopefully, MRST toolbox provides an ‘‘Optimization Mod-
ule’’, by which the performance of any novel developed optimization the performance of any proposed approach, by using the appropriate
algorithm can be compared with the ‘‘Reactive Control’’ policy as well as modules of MRST.
the optimization based on quasi-Newton Sequential Quadratic Program- In this part, we investigate that if the ultimate goal is to optimize
ming (SQP) with BFGS approximations of the Hessian. More theoretical the hydrocarbon production for almost 14 years (which is the duration
details on the mentioned approaches are available in Völcker et al. of the optimal production in the long-term strategy of Scenario II),
(2011), Volcker et al. (2011), Krogstad (2015), Suwartadi (2012) and how much the values of accumulative profits will be, while utilizing
Hasan and Foss (2015). Moreover, since the Egg-Model is a synthetic the reactive control strategy and the gradient-based strategy. From
reservoir in which all the reservoir parameters are available, it is a good Fig. 17, it can be easily perceived that both classical strategies are
opportunity to compare the performance of the developed methodology unable to guarantee an acceptable profit for 14 years and the production
with the obtained results from the ‘‘Optimization Module’’. We clarify will be completely uneconomic after a while. However, aligned with
again that in the most real applications, it is almost impossible to our expectations from the developed RL-based technique, multiple
have access to the source codes of the commercial reservoir simulators. objectives – such as adjustment of duration of the production and also

111
F. Hourfar et al. Engineering Applications of Artificial Intelligence 77 (2019) 98–116

Fig. 13. Injection rate of each well in scenario II.

Fig. 14. Production rate of each well in scenario II.

112
F. Hourfar et al. Engineering Applications of Artificial Intelligence 77 (2019) 98–116

Fig. 15. Total flow rates in scenario II.

Fig. 16. Status of the reservoir during waterflooding process in scenario II for different strategies. The interval of snapshots are every four months. The dark red color indicates the
distribution of the available oil while the blue color indicates the distribution of the water during the life-cycle. These snapshots are for a single run, not an average . (For interpretation
of the references to color in this figure legend, the reader is referred to the web version of this article.)

achieving an acceptable profit – can be simultaneously pursued, by term, the gradient-based policy presents slightly better performance
using the introduced methodology. In addition, Fig. 18 demonstrates compared with the introduced RL-based methodology, its drawbacks
the maximum gained profits, while applying reactive and gradient- such as necessity of having access to the simulator source codes and also
based strategies. By considering the results in Fig. 18 and the curves failure in handling multiple objectives may restrict its practical usage.
in Fig. 11, it can be inferred that the RL-based strategy leads to a Finally, we again restate that a serious limiting assumption to apply
larger gain in ‘‘Dollar’’, compared to the reactive control policy, in most gradient-based optimization techniques over a reservoir in practice
which the injection rates are adjusted at the most optimal constant is the availability of professional simulator source codes, which are
values during the production period. Furthermore, although in short necessary for computing the gradient of objective function/constraints

113
F. Hourfar et al. Engineering Applications of Artificial Intelligence 77 (2019) 98–116

management can be straightforwardly done needless to tackle with

the cumbersome necessary computations related to the conventional
gradient-based methods – which generally require the explicit model of
the reservoir dynamic – for exploring the optimal solution. Moreover, in
the studied scenarios both short-term and long-term production policies
have been simulated and analyzed to demonstrate the ability of the
presented technique in handling of different operational strategies. It
has been demonstrated that by appropriately adjusting the interactions
between the instant and delayed rewards in the algorithm, various
production policies or objectives can be pursued. This capability is
thoroughly useful in practical applications for managing the profit-
sharing process among the shareholders, according to the contractual
obligations. It means that at the end of the learning process the agent has
learnt the most suitable sequence of actions (e.g: injection rates) which
leads to the desired objectives and consequently, the gained profit is
Fig. 17. Comparison of the accumulative profit obtained from the reactive control
strategy, the gradient-based optimization policy, and the long-term RL-based optimization fairly allocated to each party.
technique.
Acknowledgment

The first and second authors would also like to thank Mr. Saeed Saedi
for his comments on this article.

Nomenclatures:

Reservoir Model Parameters

𝜌 Density
𝑢𝑜∕𝑤 Oil/water superficial velocity
𝜙 Porosity
𝑆𝑜∕𝑤 Oil/water saturation
𝑝𝑜∕𝑤∕𝑐𝑜𝑤 Oil/water/capillary pressure
𝑘 Absolute permeability
𝑘𝑟 Relative permeability
Fig. 18. Accumulative profits for reactive and gradient-based optimization policies. 𝜇 Viscosity of phase
𝜆̃𝑜∕𝑤 Oil/water mobilities
𝐽 Optimization cost
at each iteration. In other words, if all internal equations of a reser- 𝑞𝑜∕𝑤 Flow rate of produced oil/water
voir grid-blocks are accessible, the gradient-based approaches can be 𝑞𝑖𝑛𝑗 Flow rate of injected water
applied. However, this assumption is somehow unrealistic, since the 𝑐𝑖𝑛𝑗∕𝑤 Water injection/production cost
commercial/professional simulator developers do not usually reveal 𝑐𝑜 Oil price
their source codes for the users. So, introducing methodologies which 𝑁𝑝𝑟𝑑∕𝑖𝑛𝑗 Number of injection and production wells
are capable of considering the professional reservoir simulators as black- 𝑁𝑡 Production life-cycle
boxes to find the optimal solutions, is really important. As a result, the 𝛿𝑡 Time interval
developed algorithm may facilitate interactions with the commercial Reinforcement Learning Parameters
simulators to extract the optimal policy, during the waterflooding 𝑠𝑡 State at time 𝑡
process. 𝑎𝑡 Action at time 𝑡
𝑟𝑡 Reward obtained for time 𝑡
8. Conclusion 𝑟𝑑𝑡 Delayed reward
𝜋 Policy
In this paper, a reinforcement learning-based approach for managing 𝑉 (𝑠𝑡 ) Value function of state 𝑠𝑡
and optimizing of the waterflooding process in the oil reservoirs has 𝑄(𝑠𝑡 , 𝑎𝑡 ) Value function of state–action pair
been introduced. The presented approach is able to successfully formu- 𝜃 Current available oil in the reservoir in percentage
late the expected goals for optimization of the complex waterflooding 𝑞̃ Water cut in percentage
process. All of the principal elements of RL framework, including states, 𝑛𝜃∕𝑤∕𝑜 Number of discretization
actions, rewards, and transient and updating rules, have been defined by 𝑝(𝑎|𝑠) Probability of selecting action 𝑎𝑡 in state 𝑠𝑡
taking into account the general concepts of production engineering in 𝜏 Temperature
hydrocarbon reservoirs. Although in practice the waterflooding process
is inherently continuous, in the first stage the discrete version of the References
algorithm has been explained for more clarification. Afterwards, the
mandatory modifications have been performed for switching to the real Abbeel, P., Coates, A., Quigley, M., Ng, A.Y., 2007. An application of reinforcement
and continuous domain. learning to aerobatic helicopter flight. Adv. Neural Inf. Process. Syst. 19, 1.
The observed results on Egg-model benchmark case study related to Asadollahi, M., Dadashpour, M., Kleppe, J., Naevdal, G., 2009. Production optimization
using derivative free methods applied to brugge field. In: IRIS-Gubkin-Statoil-NTNU
SIMO and MIMO optimization scenarios have shown that by using this
Workshop in Production Optimization, Torondheim, Norway, 28th September.
methodology the agent appropriately learns the sequence of actions, Asadollahi, M., Nævdal, G., Dadashpour, M., Kleppe, J., 2014. Production optimization
or indeed the appropriate water injection profiles, which leads to using derivative free methods applied to Brugge field case. J. Petrol. Sci. Eng. 114,
achieve the predefined objectives. Obviously, the waterflooding process 22–37.

114
F. Hourfar et al. Engineering Applications of Artificial Intelligence 77 (2019) 98–116

Audet, C., Dennis Jr, J.E., 2002. Analysis of generalized pattern searches. SIAM J. Optim. Hourfar, F., Moshiri, B., Salahshoor, K., Elkamel, A., 2017. Real-time management of
13 (3), 889–903. the waterflooding process using proxy reservoir modeling and data fusion theory.
Audet, C., Dennis Jr, J.E., 2006. Mesh adaptive direct search algorithms for constrained Comput. Chem. Eng. 106, 339–354.
optimization. SIAM J. Optim. 17 (1), 188–217. Hourfar, F., Moshiri, B., Salahshoor, K., Zaare-Mehrjerdi, M., Pourafshary, P., 2016.
Aziz, K., Settari, A., 1979. Petroleum Reservoir Simulation. Chapman & Hall. Adaptive modeling of waterflooding process in oil reservoirs. J. Petrol. Sci. Eng. 146,
Bai, Y., Li, J., Zhou, J., Li, Q., 2008. Sensitivity analysis of the dimensionless parameters 702–713.
in scaling a polymer flooding reservoir. Transp. Porous Media 73 (1), 21–37. Hourfar, F., Salahshoor, K., Zanbouri, H., Elkamel, A., Pourafshary, P., Moshiri, B., 2018.
Barreto, A.d.M.S., Anderson, C.W., 2008. Restricted gradient-descent algorithm for value- A systematic approach for modeling of waterflooding process in the presence of
function approximation in reinforcement learning. Artificial Intelligence 172 (4), 454– geological uncertainties in oil reservoirs. Comput. Chem. Eng. 111, 66–78.
482. Isebor, O.J., Durlofsky, L.J., 2014. Biobjective optimization for general oil field develop-
Bidgoly, H.J., Vafaei, A., Sadeghi, A., Ahmadabadi, M.N., 2010. Learning approach to ment. J. Petrol. Sci. Eng. 119, 123–138.
study effect of flexible spine on running behavior of a quadruped robot. In: Emerging Jansen, J., 2011. Adjoint-based optimization of multi-phase flow through porous media–a
Trends in Mobile Robotics. World Scientific, pp. 1195–1201. review. Comput. & Fluids 46 (1), 40–51.
Bucak, I.O., Zohdy, M.A., 1999. Application of reinforcement learning control to a Jansen, J.D., Bosgra, O.H., Van den Hof, P.M., 2008. Model-based control of multiphase
nonlinear bouncing cart. In: American Control Conference, 1999. Proceedings of the flow in subsurface oil reservoirs. J. Process Control 18 (9), 846–855.
1999, Vol. 2. IEEE, pp. 1198–1202. Jansen, J., Fonseca, R., Kahrobaei, S., Siraj, M., Van Essen, G., Van den Hof, P., 2014.
The egg model–a geological ensemble for reservoir simulation. Geosci. Data J. 1 (2),
Capolei, A., Suwartadi, E., Foss, B., Jørgensen, J.B., 2013. Waterflooding optimization in
192–195.
uncertain geological scenarios. Comput. Geosci. 17 (6), 991–1013.
Jouffe, L., 1998. Fuzzy inference system learning by reinforcement methods. IEEE Trans.
Chen, Y., Oliver, D.S., Zhang, D., et al., 2009. Efficient ensemble-based closed-loop
Syst. Man Cybern. C 28 (3), 338–355.
production optimization. SPE J. 14 (04), 634–645.
Kaelbling, L.P., Littman, M.L., Moore, A.W., 1996. Reinforcement learning: A survey. J.
Ciaurri, D.E., Mukerji, T., Durlofsky, L.J., 2011. Derivative-free optimization for oil field
Artif. Intell. Res. 4, 237–285.
operations. In: Computational Optimization and Applications in Engineering and
Khan, S.G., Herrmann, G., Lewis, F.L., Pipe, T., Melhuish, C., 2012. Reinforcement learning
Industry. Springer, pp. 19–55.
and optimal adaptive control: An overview and implementation examples. Annu. Rev.
CMOST users guide, 2012. Calgary, Alberta: Computer Modelling Group Ltd.
Control 36 (1), 42–59.
Collins, A., Thomas, L., 2013. Learning competitive dynamic airline pricing under different Kramer, O., Ciaurri, D.E., Koziel, S., 2011. Derivative-free optimization. In: Computational
customer models. J. Revenue Pricing Manag. 12 (5), 416–430. Optimization, Methods and Algorithms. Springer, pp. 61–83.
Colton, W.M., 2011. The outlook for energy a view to 2030, Technical report from Krogstad, J.A., 2015. Control-Switching Strategies for Reservoir Water-Flooding Manage-
www.exxonmobil.com/energyoutlook. ment (Master’s thesis), NTNU.
Datta-Gupta, A., Alhuthali, A.H., Yuen, B., Fontanilla, J., et al., 2010. Field applications Lie, K., 2014. An introduction to reservoir simulation using matlab: User guide for the
of waterflood optimization via optimal rate control with smart wells. SPE Reservoir matlab reservoir simulation toolbox (mrst), sintef ict.
Eval. Eng. 13 (03), 406–422. Luenberger, D.G., Ye, Y., et al., 2008. Linear and Nonlinear Programming, Vol. 2. Springer.
Dehdari, V., Oliver, D.S., et al., 2011. Sequential quadratic programming (sqp) for solving Mehran, K., 2008. Takagi-sugeno fuzzy modeling for process control. Industrial Automa-
constrained production optimization—case study from brugge field. In: SPE Reservoir tion, Robotics and Artificial Intelligence (EEE8005). Newcastle University, Newcastle
Simulation Symposium. Society of Petroleum Engineers. upon Tyne, UK.
Derhami, V., Majd, V.J., Ahmadabadi, M.N., 2008. Fuzzy Sarsa learning and the proof of Nævdal, G., Brouwer, D.R., Jansen, J.D., 2006. Waterflooding using closed-loop control.
existence of its stationary points. Asian J. Control 10 (5), 535–549. Comput. Geosci. 10 (1), 37–60.
Doya, K., 2000. Reinforcement learning in continuous time and space. Neural Comput. 12 Nanduri, V., 2011. Application of reinforcement learning-based algorithms in co2 al-
(1), 219–245. lowance and electricity markets. In: 2011 IEEE Symposium on Adaptive Dynamic
van Eck, N.J., van Wezel, M., 2004. Reinforcement Learning and its Application to Othello. Programming and Reinforcement Learning (ADPRL). pp. 164–169.
Department of Computer Science, Faculty of Economics, Erasmus University, The Precup, D., Sutton, R.S., 1997. Exponentiated gradient methods for reinforcement learn-
Netherlands. ing. In: ICML. pp. 272–277.
Fernandez-Gauna, B., Ansoategui, I., Etxeberria-Agiriano, I., Graña, M., 2014. Reinforce- Rao, S.S., Rao, S., 2009. Engineering Optimization: Theory and Practice. John Wiley &
ment learning of ball screw feed drive controllers. Eng. Appl. Artif. Intell. 30, 107–117. Sons.
Forouzanfar, F., Della Rossa, E., Russo, R., Reynolds, A., 2013. Life-cycle production Sánchez, E., Clempner, J., Poznyak, A., 2015. A priori-knowledge/actor-critic reinforce-
optimization of an oil field with an adjoint-based gradient approach. J. Petrol. Sci. ment learning architecture for computing the mean-variance customer portfolio: The
Eng. 112, 351–358. case of bank marketing campaigns. Eng. Appl. Artif. Intell. 46, 82–92.
Foss, B., 2012. Process control in conventional oil and gas field: Challenges and opportu- Sarma, P., 2006. Efficient Closed-Loop Optimal Control of Petroleum Reservoirs under
nities. Control Eng. Pract. 20 (10), 1058–1064. Uncertainty (Ph.D. thesis), Stanford University.
Foss, B., Grimstad, B., Gunnerud, V., 2015. Production optimization–facilitated by divide Shafiei, A., Dusseault, M.B., Zendehboudi, S., Chatzis, I., 2013. A new screening tool for
and conquer strategies. IFAC-PapersOnLine 48 (6), 1–8. evaluation of steamflooding performance in naturally fractured carbonate reservoirs.
Foss, B.A., Jensen, J.P., et al., 2011. Performance analysis for closed-loop reservoir Fuel 108, 502–514.
management. SPE J. 16 (01), 183–190. Shakhsi-Niaei, M., Iranmanesh, S.H., Torabi, S.A., 2014. Optimal planning of oil and gas
Ghandi, A., Lin, C.-Y.C., 2012. Do Irans buy-back service contracts lead to optimal development projects considering long-term production and transmission. Comput.
production? The case of Soroosh and Nowrooz. Energy Policy 42, 181–190. Chem. Eng. 65, 67–80.
Sherali, H.D., Bae, K.H., Haouari, M., 2010. Integrated airline schedule design and fleet
Golberg, D.E., 1989. Genetic Algorithms in Search, Optimization, and Machine Learning.
assignment: Polyhedral analysis and benders’ decomposition approach. INFORMS J.
Addion wesley, p. 102.
Comput. 22 (4), 500–513.
Gosavi, A., 2009. Reinforcement learning: A tutorial survey and recent advances. IN-
Sincock, K., Black, C., et al., 1988. Validation of water/oil displacement scaling criteria
FORMS J. Comput. 21 (2), 178–192.
using microvisualization techniques. In: SPE Annual Technical Conference and Exhi-
GOSAVII, A., Bandla, N., Das, T.K., 2002. A reinforcement learning approach to a single
bition. Society of Petroleum Engineers.
leg airline revenue management problem with multiple fare classes and overbooking.
Siraj, M.M., den Hof, P.M.V., Jansen, J.D., 2016. Robust optimization of water-flooding in
IIE Trans. 34 (9), 729–742.
oil reservoirs using risk management tools. IFAC-PapersOnLine 49 (7), 133–138, 11th
Guo, Z., Dong, M., Chen, Z., Yao, J., 2012. Dominant scaling groups of polymer flooding
{IFAC} Symposium on Dynamics and Control of Process SystemsIncluding Biosystems
for enhanced heavy oil recovery. Ind. Eng. Chem. Res. 52 (2), 911–921.
DYCOPS-CAB 2016Trondheim, Norway, 68 June 2016.
Hajimiri, M.H., Ahmadabadi, M.N., Rahimi-Kian, A., 2014. An intelligent negotiator agent Souza, S., Afonso, S., Horowitz, B., 2010. Optimal management of oil production using
design for bilateral contracts of electrical energy. Expert Syst. Appl. 41 (9), 4073– the particle swarm algorithm and adaptive surrogate models. In: XXXI Iberian Latin-
4082. American Congress on Computational Methods in Engineering, Buenos Aires. pp. 15–
Harmon, M.E., Harmon, S.S., 1997. Reinforcement Learning: A Tutorial., Technical Report. 18.
WRIGHT LAB WRIGHT-PATTERSON AFB OH. Srivastava, R., Huang, S., Dyer, S., et al., 1994. Scaling criteria for micellar flooding
Hasan, A., Foss, B., 2015. Optimal switching time control of petroleum reservoirs. J. Petrol. experiments. In: Annual Technical Meeting. Petroleum Society of Canada.
Sci. Eng. 131, 131–137. Sutton, R.S., 1996. Generalization in reinforcement learning: Successful examples using
He, J., Sætrom, J., Durlofsky, L.J., 2011. Enhanced linearized reduced-order models for sparse coarse coding. Adv. Neural Inf. Process. Syst. 1038–1044.
subsurface flow simulation. J. Comput. Phys. 230 (23), 8313–8341. Sutton, R.S., Barto, A.G., 1998. Introduction to Reinforcement Learning, Vol. 135. MIT
Heidrich-Meisner, V., Lauer, M., Igel, C., Riedmiller, M.A., 2007. Reinforcement learning Press, Cambridge.
in a nutshell. In: ESANN. pp. 277–288. Suwartadi, E., 2012. Gradient-based Methods for Production Optimization of Oil Reser-
Van den Hof, P.M., Jansen, J.D., Heemink, A., 2012. Recent developments in model-based voirs (Ph.D. thesis), Norwegian University of Science and Technology.
optimization and control of subsurface flow in oil reservoirs. IFAC Proc. Vol. 45 (8), Takagi, T., Sugeno, M., 1985. Fuzzy identification of systems and its applications to
189–200. modeling and control. IEEE Trans. Syst. Man Cybern. (1), 116–132.
Horowitz, B., Afonso, S.M.B., de Mendonça, C.V.P., 2013. Surrogate based optimal Van Essen, G., Van den Hof, P., Jansen, J.D., et al., 2011. Hierarchical long-term and
waterflooding management. J. Petrol. Sci. Eng. 112, 206–219. short-term production optimization. SPE J. 16 (01), 191–199.

115
F. Hourfar et al. Engineering Applications of Artificial Intelligence 77 (2019) 98–116

Volcker, C., Jørgensen, J.B., Stenby, E.H., 2011. Oil reservoir production optimization Xu, S., Zeng, F., Chang, X., Liu, H., 2013. A systematic integrated approach for water-
using optimal control. In: 2011 50th IEEE Conference on Decision and Control and flooding optimization. J. Petrol. Sci. Eng. 112, 129–138.
European Control Conference (CDC-ECC). pp. 7937–7943. Zendehboudi, S., Chatzis, I., Mohsenipour, A.A., Elkamel, A., 2011a. Dimensional analysis
Völcker, C., Jørgensen, J.B., Thomsen, P.G., 2011. Production Optimization of Oil and scale-up of immiscible two-phase flow displacement in fractured porous media
Reservoirs. Technical University of Denmark (DTU). under controlled gravity drainage. Energy Fuels 25 (4), 1731–1750.
Walraven, E., Spaan, M., Bakker, B., 2016. Traffic flow optimization: A reinforcement Zendehboudi, S., Chatzis, I., Shafiei, A., Dusseault, M.B., 2011b. Empirical modeling of
learning approach. Eng. Appl. Artif. Intell. 52, 203–212. gravity drainage in fractured porous media. Energy Fuels 25 (3), 1229–1241.
Wang, C., Li, G., Reynolds, A.C., et al., 2009. Production optimization in closed-loop Zendehboudi, S., Mohammadzadeh, O., Chatzis, I., et al., 2011. Experimental study of
reservoir management. SPE J. 14 (03), 506–523. controlled gravity drainage in fractured porous media. J. Can. Petrol. Technol. 50
Wen, T., Thiele, M.R., Ciaurri, D.E., Aziz, K., Ye, Y., 2014. Waterflood management using (2), 56–71.
two-stage optimization with streamline simulation. Comput. Geosci. 18 (3–4), 483– Zhao, H., Chen, C., Do, S.T., Li, G., Reynolds, A.C., et al., 2011. Maximization of a
504. dynamic quadratic interpolation model for production optimization. In: SPE Reservoir
World energy demand and economic outlook, 2016. U.S. Energy Information Administra- Simulation Symposium. Society of Petroleum Engineers.
tion, International Energy Outlook. Zhao, X., Luo, D., Xia, L., 2012. Modelling optimal production rate with contract effects
for international oil development projects. Energy 45 (1), 662–668.

116

Adjoint Based Well Placement Optimisation For Enhance 2020 Journal of Petrol
No ratings yet
Adjoint Based Well Placement Optimisation For Enhance 2020 Journal of Petrol
18 pages
Robust Optimization of Geoenergy Production Using Data-Driven Deep Recurrent Auto-Encoder and Fully-Connected Neural Network Proxy
No ratings yet
Robust Optimization of Geoenergy Production Using Data-Driven Deep Recurrent Auto-Encoder and Fully-Connected Neural Network Proxy
19 pages
Deep Reinforcement Learning for Oil Production Optimization
No ratings yet
Deep Reinforcement Learning for Oil Production Optimization
14 pages
Real-Time Reservoir Management: A Multiscale Adaptive Optimization and Control Approach
No ratings yet
Real-Time Reservoir Management: A Multiscale Adaptive Optimization and Control Approach
36 pages
Robust Optimization of Oil Reservoir Flooding, Van Essen Et Al, 2006, 7 Pgs
No ratings yet
Robust Optimization of Oil Reservoir Flooding, Van Essen Et Al, 2006, 7 Pgs
7 pages
DataDriven ReservoirModeling NAGAO THESIS 2021
No ratings yet
DataDriven ReservoirModeling NAGAO THESIS 2021
119 pages
ML Optimizes Oil Production
No ratings yet
ML Optimizes Oil Production
5 pages
Deep Learning for Demand-Side Management
No ratings yet
Deep Learning for Demand-Side Management
14 pages
Hydropower Optimization Using Deep Learning: (Ole - Granmo, Jivitesh - Sharma) @uia - No
No ratings yet
Hydropower Optimization Using Deep Learning: (Ole - Granmo, Jivitesh - Sharma) @uia - No
13 pages
Reservoir Under Uncertainty
No ratings yet
Reservoir Under Uncertainty
115 pages
Tugas Jurnal
No ratings yet
Tugas Jurnal
11 pages
Journal of Petroleum Science and Engineering: A B A A
No ratings yet
Journal of Petroleum Science and Engineering: A B A A
21 pages
Machine Learning for Waterflooding Rates
No ratings yet
Machine Learning for Waterflooding Rates
14 pages
Hierarchical Reinforcement Learning For Crude Oil Supply
No ratings yet
Hierarchical Reinforcement Learning For Crude Oil Supply
20 pages
WenTailai2014 PDF
No ratings yet
WenTailai2014 PDF
159 pages
Optimize Injection Well Placement Using ANN
No ratings yet
Optimize Injection Well Placement Using ANN
23 pages
Optimal Planning of Hybrid Energy Storage Systems
No ratings yet
Optimal Planning of Hybrid Energy Storage Systems
10 pages
Smart Wells & EOR Optimization Techniques
No ratings yet
Smart Wells & EOR Optimization Techniques
7 pages
1 s2.0 S1876380420600556 Main
No ratings yet
1 s2.0 S1876380420600556 Main
10 pages
Research Paper
No ratings yet
Research Paper
26 pages
Optimize Gas Injection for EOR Using ANN & GA
No ratings yet
Optimize Gas Injection for EOR Using ANN & GA
15 pages
Final Draft2
No ratings yet
Final Draft2
16 pages
Well Control Optimization in Waterflooding Using Genetic Algorithm Coupled With Artificial Neural Networks
No ratings yet
Well Control Optimization in Waterflooding Using Genetic Algorithm Coupled With Artificial Neural Networks
13 pages
Chinese Journal of Chemical Engineering: Jin Lang, Jiao Zhao
No ratings yet
Chinese Journal of Chemical Engineering: Jin Lang, Jiao Zhao
8 pages
Chinese Journal of Chemical Engineering: Jin Lang, Jiao Zhao
No ratings yet
Chinese Journal of Chemical Engineering: Jin Lang, Jiao Zhao
8 pages
Hydraulic Potential Energy in Hydropower
No ratings yet
Hydraulic Potential Energy in Hydropower
30 pages
1 s2.0 S0098135422004409 Main
No ratings yet
1 s2.0 S0098135422004409 Main
24 pages
AI Agents Envisioning The Future Forecast-Based Operation of Renewable Energy Storage Systems Using Hydrogen With Deep Reinforcement Learning
No ratings yet
AI Agents Envisioning The Future Forecast-Based Operation of Renewable Energy Storage Systems Using Hydrogen With Deep Reinforcement Learning
19 pages
Data-Driven Algorithm for Well Placement Optimization
No ratings yet
Data-Driven Algorithm for Well Placement Optimization
40 pages
Electronics 13 01459
No ratings yet
Electronics 13 01459
17 pages
Bayesian Deep Reinforcement Learning For Operational
No ratings yet
Bayesian Deep Reinforcement Learning For Operational
16 pages
Wang 03
No ratings yet
Wang 03
196 pages
Post-Fracture Production Prediction With Productio
No ratings yet
Post-Fracture Production Prediction With Productio
40 pages
Wang 2024
No ratings yet
Wang 2024
20 pages
Optimal Operation of Reverse Osmosis Desalination Process With Deep Reinforcement Learning Methods
No ratings yet
Optimal Operation of Reverse Osmosis Desalination Process With Deep Reinforcement Learning Methods
21 pages
Wednesday
No ratings yet
Wednesday
95 pages
Yousof Haghshenas 2020
No ratings yet
Yousof Haghshenas 2020
11 pages
Spe 192818 Ms
No ratings yet
Spe 192818 Ms
14 pages
Empirical - Methods - Vs - Time - Series - Models Bakken
No ratings yet
Empirical - Methods - Vs - Time - Series - Models Bakken
29 pages
Real-Time Reservoir Management Strategies
No ratings yet
Real-Time Reservoir Management Strategies
21 pages
Artificial Intelligence Algorithms For Realtime Production Planning With Incoming New Information in Mining Complexes
No ratings yet
Artificial Intelligence Algorithms For Realtime Production Planning With Incoming New Information in Mining Complexes
309 pages
Applied Soft Computing: Sandeep Singh Chauhan, Prakash Kotecha
No ratings yet
Applied Soft Computing: Sandeep Singh Chauhan, Prakash Kotecha
20 pages
Hosseini 51716777093088
No ratings yet
Hosseini 51716777093088
12 pages
Robust Waterflooding
No ratings yet
Robust Waterflooding
9 pages
Yousefzadeh 2022
No ratings yet
Yousefzadeh 2022
14 pages
Machine Learning-Assisted Prediction of Oil Produc
No ratings yet
Machine Learning-Assisted Prediction of Oil Produc
18 pages
Supply Chain Optimization Using Reinforcement Learning Algorithms
No ratings yet
Supply Chain Optimization Using Reinforcement Learning Algorithms
10 pages
An Approach To Waterflood Optimization Case Study
No ratings yet
An Approach To Waterflood Optimization Case Study
19 pages
Ekofisk Oil
No ratings yet
Ekofisk Oil
19 pages
Waterflood Optimization in Oil Recovery
No ratings yet
Waterflood Optimization in Oil Recovery
19 pages
Petroleum Reservoir Control Optimization With The
No ratings yet
Petroleum Reservoir Control Optimization With The
20 pages
Efficient Brownfield Optimization of A Reservoir in West Siberia
No ratings yet
Efficient Brownfield Optimization of A Reservoir in West Siberia
12 pages
Applications of Reinforcement Learning in Energy Systems
No ratings yet
Applications of Reinforcement Learning in Energy Systems
23 pages
Manual Vii CDM 5.1 Lis
100% (1)
Manual Vii CDM 5.1 Lis
23 pages
Microsoft Word Exercise
100% (1)
Microsoft Word Exercise
2 pages
Microprocessor
No ratings yet
Microprocessor
1,000 pages
Deloitte Uk Automating Vat Returns Vertex
100% (1)
Deloitte Uk Automating Vat Returns Vertex
38 pages
Materiale Didattico Precorso ORACPCERT
100% (1)
Materiale Didattico Precorso ORACPCERT
131 pages
Essential PLC Functions Explained
No ratings yet
Essential PLC Functions Explained
17 pages
LTE Techniques in 4G Wireless System Using Protocol Stack
No ratings yet
LTE Techniques in 4G Wireless System Using Protocol Stack
8 pages
Boost PC Speed: Defrag, Clean, Backup
No ratings yet
Boost PC Speed: Defrag, Clean, Backup
35 pages
Introductory Econometrics: Installing and Using The Monte Carlo Simulation Excel Add-In
No ratings yet
Introductory Econometrics: Installing and Using The Monte Carlo Simulation Excel Add-In
9 pages
Salary SLP 2020
No ratings yet
Salary SLP 2020
5 pages
A04294 ECard
No ratings yet
A04294 ECard
3 pages
Oracle AME Setup Guide
No ratings yet
Oracle AME Setup Guide
6 pages
2011 Masmo Secondary Solution
No ratings yet
2011 Masmo Secondary Solution
10 pages
Consumer Awareness of Herbal Products
No ratings yet
Consumer Awareness of Herbal Products
6 pages
Overview of JSF Core Tags Library
No ratings yet
Overview of JSF Core Tags Library
4 pages
Oral-Defense-schedule Abm 2ndsem2023 2024
No ratings yet
Oral-Defense-schedule Abm 2ndsem2023 2024
16 pages
Alma Configuration Form - Guide
No ratings yet
Alma Configuration Form - Guide
99 pages
Amortized Analysis: A Summary: Whatever That Might Be
No ratings yet
Amortized Analysis: A Summary: Whatever That Might Be
7 pages
Module 5 Workbook
No ratings yet
Module 5 Workbook
4 pages
B.Tech VI Sem Computer Science Courses
No ratings yet
B.Tech VI Sem Computer Science Courses
19 pages
Adjective + Preposition
No ratings yet
Adjective + Preposition
2 pages
BSC Computer Science Cs Semester 6 2022 April Compiler Construction II 2019 Pattern
No ratings yet
BSC Computer Science Cs Semester 6 2022 April Compiler Construction II 2019 Pattern
3 pages
Anna University Placement Offer Letters 2011
No ratings yet
Anna University Placement Offer Letters 2011
14 pages
DM Unit 1
100% (1)
DM Unit 1
31 pages
CCNA Cisco Routing Protocols and Concepts
No ratings yet
CCNA Cisco Routing Protocols and Concepts
68 pages
Read Me - CADWorx Plant 2019 SP2 HF1
No ratings yet
Read Me - CADWorx Plant 2019 SP2 HF1
4 pages
Signal Analysis with Mellin Transform
No ratings yet
Signal Analysis with Mellin Transform
4 pages
Hikvision
100% (1)
Hikvision
5 pages
Bangla Speech Recognition Study
No ratings yet
Bangla Speech Recognition Study
13 pages
Digital Signatures in e-Governance Guidelines
No ratings yet
Digital Signatures in e-Governance Guidelines
42 pages

Artificial Intelligence in Drilling Engineering

Uploaded by

Artificial Intelligence in Drilling Engineering

Uploaded by

Engineering Applications of Artificial Intelligence 77 (2019) 98–116

Contents lists available at ScienceDirect

Engineering Applications of Artificial Intelligence

A reinforcement learning approach for waterflooding optimization in

ARTICLE INFO ABSTRACT

1. Introduction troublous task. As a result, applying reservoir management techniques

subjected to: 1. a set of environment states, 𝑠𝑡 ∈ ;

where 𝑀 = mean(𝑟𝑡 ) over time of oil production, 𝐾 ̂(𝑁 ) is a positive

6.2. The actions set 6.4. The transition rule

Fig. 7. Production rate of each well in scenario I.

Fig. 8. Total flow rates in scenario I.

gained accumulative reward has been almost doubled in comparison

Fig. 13. Injection rate of each well in scenario II.

Fig. 14. Production rate of each well in scenario II.

Fig. 15. Total flow rates in scenario II.

management can be straightforwardly done needless to tackle with

Reservoir Model Parameters

You might also like