You are on page 1of 14

SPE 152414

Challenges in High Performance Computing for Reservoir Simulation


M Ehtesham Hayder and Majdi Baddourah, Saudi Aramco

Copyright 2012, Society of Petroleum Engineers

This paper was prepared for presentation at the EAGE Annual Conference & Exhibition incorporating SPE Europec held in Copenhagen, Denmark, 4–7 June 2012.

This paper was selected for presentation by an SPE program committee following review of information contained in an abstract submitted by the author(s). Contents of the paper have not been
reviewed by the Society of Petroleum Engineers and are subject to correction by the author(s). The material does not necessar ily reflect any position of the Society of Petroleum Engineers, its
officers, or members. Electronic reproduction, distribution, or storage of any part of this paper without the written consent of the Society of Petroleum Engineers is prohi bited. Permission to
reproduce in print is restricted to an abstract of not more than 300 words; illustrations may not be copied. The abstract must contain conspicuous acknowledgment of SPE copyright.

Abstract 1. Introduction

High resolution reservoir modeling is necessary to Reservoir simulation is an important tool to gain
analyze complex flow phenomena in reservoirs. As insight into flow processes in the reservoir. Coats [1]
more powerful computing platforms become gave a brief history of early reservoir simulation and
available, reservoir simulation engineers are building discussed numerical errors in such computations.
larger and higher resolution reservoir models to Seismic interpretation, core data and well logs are
study giant fields. Large number of simulations is used to create geological or static models.
necessary to validate the model and to lower Simulation models are built by up-scaling a static
uncertainty in prediction results. It is a challenge to geological model and then integrating that model
accurately model complex processes in the reservoir with historical well data. Because the up-scaling
and also efficiently solve high resolution giant process introduces errors in approximation, efforts
reservoir models on rapidly evolving hardware and are being made to build simulation model to avoid
software platforms. There are many challenges to get up-scaling [2]. These high resolution models provide
high performance in reservoir simulations on detailed solutions, but pose a challenge for practical
constantly evolving computing platforms. There are simulation. Modeling fluid flows in a well has
constraints which limit performance of large scale become complex with the introduction of Maximum
reservoir simulation. In this study, we review some Reservoir Contact (MRC) with equalizers and other
of these constraints and show their effects on down-hole equipment. Unstructured gridding is
practical reservoir simulation. We review emerging needed to accurately model features in the reservoir
computing platforms and highlight opportunities and and it adds considerable complexities in simulation.
challenges for reservoir simulation on those Fig. 1a shows details of a reservoir model which are
platforms. It is anticipated that management of data needed to describe the reservoir and modern wells
locality by the simulator will become very important accurately. As one can easily realize, it is not an easy
on emerging computing platforms and there will be task to model and solve problems which have such
needs to manage locality to achieve good complexities. There have been considerable
performance. Heterogeneity in the computing advances in techniques for reservoir simulation and
platform will make it difficult to get good many large high resolution models have been built
performance without adoption of a hybrid to study the reservoir, however, many challenges
parallelization style in the simulator. In this study, still remain to be overcome. Branets et al. [3] gave
we analyze many benchmark results to illustrate an overview of modeling techniques for complex
challenges in high performance computations of geometries and heterogeneous properties of
reservoir simulation on current and emerging reservoir. One of the key challenges in reservoir
computing platforms. modeling is accurate representation of reservoir
geometry, including the structural framework. The
structural frameworks define major sections of
2 SPE 152414

reservoir and often provide the first order controls on


fluid volumes in-place and fluid movements during
production. Another major challenge in reservoir
modeling is to accurately and efficiently represent
reservoir heterogeneity at multiple scales. Large
simulation models have been built to study
reservoirs. Sunaidi [4] presented simulation results
of a 1.2 million cell model of a large field using
POWERS [5] reservoir simulator in 1998. It is
obvious from his results (see Fig. 1b) that water
fingering through permeability streaks tends to leave
trapped oil. Such simulation and analysis enables
simulation engineers to follow fingers as they
develop and examine the evolution of the flow
patterns. Then they can formulate alternative
strategies to avoid or minimize loss of reserve
because of bypassing. High resolution reservoir (b) Oil Saturation (Ref. [4])
models have provided many insights into reservoir
processes [6] in large reservoir models. Because of Fig 1: Simulation model
its obvious benefit, there is a great motivation to
build very high resolution models to accurately
capture heterogeneity in reservoirs. High
Baddourah et al. [7] gave an overview of current
performance computations are essential to obtain
status of large scale reservoir simulation. They have
solutions of these models in a reasonable time.
shown that there has been tremendous improvement
in both hardware and software tools to support high
performance reservoir simulation in recent years.
They showed (Fig. 2) that a simulation of a 10
million cell reservoir model with over 3000 wells,
simulating 60 years of history and prediction, which
used to take 45 hours to run in 2001, took less than
half an hour in 2010 on computing platforms at
Saudi Aramco. This advance in high performance
computing (HPC) resulted in a significant reduction
of simulation time and made it possible to run many
simulation studies quickly. Keyes [8] pointed out
that the cost of simulation is rapidly decreasing,
creating great opportunities to solve larger more
complex problems. He used computational speed
achieved by Gordon Bell prize winning projects in
yearly Supercomputing conference organized by
IEEE computer society and ACM to make his
argument. In 1989 the cost of MFlop (million
(a) Details of a model floating point operations per second) was about
$2500 using CM-2, twenty years later in 2009, the
equivalent cost went down to $.0085 or less than a
penny using GPU technology. New computing
platforms come with great opportunities, the
challenge is to extract a significant portion of their
potential performance in actual simulation.
SPE 152414 3

reservoir simulation. Saudi Aramco in-house


simulator POWERS and GigaPowersTM[2] have been
extensively used to solve many large reservoir
models. Experience gained from these simulators
should be applicable to reservoir HPC in general by
other simulators. In this paper, we review many of
our earlier studies and also examine new issues to
make our analysis comprehensive. In the next
section, we discuss challenges in HPC, and then we
review high performance reservoir simulation where
we share some of our experiences in Saudi Aramco.
We then discuss emerging trends and challenges for
HPC. Finally we conclude with a discussion on the
Fig 2: 60 year simulation time for a 10 million cell
outlook for HPC in reservoir simulation.
model with over 3000 wells (Ref. [7])

2. Challenges in HPC
Simulation engineers require rapid turnaround of
hundreds of simulation jobs which can forecast With the advent of modern computers,
reservoir performance. HPC is indispensable to simulation has become a powerful tool along with
address such needs. There are many hurdles towards theory and experiments. Numerical simulation may
simulating very large models. Computational load to provide cost effective answer to many of the
solve a reservoir model increases nonlinearly with problems of interest to reservoir development and
the model size, there is also need to have large fast management engineers, which are extremely
memory on a processor to accommodate data for a difficult to get by theoretical analysis or extremely
large simulation model. Complex algorithms are costly by running series of experiments. Many
needed to manage data efficiently and distribute load physical problems, including processes inside the
evenly among processors. If all array variables in reservoir are very complex. Analysis of such
simulation are not distributed across processors, size problems often requires fast solution of large
of memory on a node will put a constraint on the systems of equations. There are many challenges
maximum size of the model that can be simulated on towards getting solutions of large models efficiently.
the system. For example, a one billion long array We discuss some of the important issues in the next
(using double precision) requires 7.5 giga-byte of section.
memory for storage. One can distribute arrays
among processors instead of maintaining a global
array. This approach has communication overhead,
i.e., increases data movement. As we will discuss 2.1 Performance on a Single Processor
later, cost of data movement is a critical issue on
emerging architectures. Therefore, algorithm and The processing element in the CPU needs to access
implementation of that algorithm on underlying data quickly to maintain high computational speed,
architectures needs to be designed carefully to which is commonly measured as floating point
extract good performance from increasingly operation per second (flops). The processing element
complex computing platforms. Saudi Aramco uses has access to very high speed memory, known as
an intensive collaborative environment [9] in cache. There is a hierarchy of cache memory on a
reservoir simulation studies to address uncertainty in CPU with varying amount of access time or latency.
reservoir characterizations and shorten decision Performance of CPU degrades if needed data are not
making cycle time. The intense nature of the study available in a cache level nearer to the processor.
creates very high demand for computational Cache misses can occur because of limitation of
resources. Such demand can only be met by ensuring capacity of cache buffer, conflict in data access, etc.
high availability of HPC resources. To provide cost Many models have been built to analyze
effective platform for HPC in reservoir simulation performance of computing architectures. They are
Saudi Aramco currently uses PC-Clusters [10] for based on techniques such as statistical analysis,
reservoir simulations. bottleneck analysis, etc.
In this study, our focus is to examine HPC of
4 SPE 152414

computational stencil for heat equation which


2048
performs eight floating point operations for every 24
512
byte of memory traffic, which makes the
Peak Floating point performance operational intensity to be 8/24 or 0.33 Flops/Byte.
Attainable GFlop/s

128 Alternately, one can measure operational intensity


using performance monitoring tools.
32

Kernel B
kernel A
8
The Roofline model combines peak floating point
2
1/8 2/8 1 4 16 64 256
performance, operational intensity and memory
Operational Intensity (Flops/Byte) performance in a two dimensional graph. The peak
floating performance can be found in hardware
specification or may also be measured by
(a) model-1 appropriate benchmarks. Similarly, the memory
performance can also be determined by benchmarks.
Fig. 3a shows a typical roofline model for a
2048 computer, where the x-axis is the operational
intensity (Flops/ DRAM byte-accessed) varying
512 from 0.125 to 256. The y-axis is the attainable
Attainable GFlop/s

floating point performance (Giga-Flops). There are


128
two solid lines in the graph; the horizontal line
32
shows the peak floating performance. Actual
floating-point performance cannot exceed the limit
8 set by this line. The second solid line indicates the
peak memory bandwidth. These two lines meet at
2
1/8 2/8 1 4 16 64 256
the ridge point of the roofline graph. The x-
Operational Intensity (Flops/Byte) coordinate of the ridge point in the minimum
operational intensity required to achieve maximum
(b) model-2 performance of the hardware. If the ridge point is far
to the right, then only kernels with very high
Fig 3: Roofline Model operational intensity can achieve the maximum
performance on the computer. If the ridge point is
Roofline model proposed by Williams et al.[11] can far to the left then almost any kernel can potentially
be used to evaluate a computing platform and assess achieve the maximum performance. We show two
possible room for performance enhancement of a computational kernels in Fig. 3a. Performance of a
computational kernel. The model gives an upper kernel can vary along the dashed line for its
bound on performance of a kernel or a code particular operational intensity. The kernel A is
depending on its operational (or arithmetic) bound by memory bandwidth, i.e., is memory-
intensity. Williams et al. defined operational bound and kernel B is bound by peak floating point
intensity as operations (in processor or CPU) for performance, i.e., is compute-bound. In reality a
each data movement measured in byte of Dynamic particular kernel may perform lower than the limit
Random-Access Memory (DRAM) traffic between set by the roofline model. Certain level of
caches and memory. Characteristics of a optimization is required in implementation to break
computational kernel can also be defined in terms of some performance ceilings below the roofline. In
arithmetic intensity, which measures traffic between other words, performance will likely be limited by a
the processor and cache memory. Roofline model ceiling (shown by dashed and dotted lines in Fig 3b)
can be used to identify which computational unless required optimizations are performed in the
platform would be a good match for a given kernel implementation of the kernel. Examples of these
(or code) or conversely to get a guideline how to optimizations are loop unrolling, restructuring loops
change kernel (or code) and also the hardware for unit stride access, software prefetching, etc.
architecture to run desired kernel more efficiently. Memory bottleneck can be reduced by techniques
Operational (or arithmetic) intensity can be such as restructuring loops for unit stride accesses,
determined by analyzing the algorithm, for example, using software prefetching, etc.
Williams et al. considered a seven point
SPE 152414 5

Amdahl’s law [13] states that the performance of a


parallel machine is limited by the serial portion of
the code. If serial fraction of a code is f, then the
speed up on the code on a parallel computer with n
processors is given by Eq (1). Amdahl’s law put a
serious performance limit on speedup on a parallel
computer. As the capacity of the computer increases,
one would like to solve a bigger problem rather than
solving same size problem. To consider this notion,
Fig 4: Example of a Roofline model (Ref. [12])
Gustafson [14] proposed an alternate constraint
given in Eq (2) assuming problem size increases
A roofline model of AMD Magny Cours processor
with the number of processors. Fig. 6 shows
discussed by Mora [12] is shown in Fig. 4. The
scalability limits according to Amdahl’s and
Stream Triad with operational intensity .082 is
Gustafson’s law for parallel computations for f=.01,
memory-bound, on the other hand, DGEMM with
i.e., 1% of the code is serial and 99% of the code can
operational intensity 14.15 is compute-bound. The
be parallelizable. Recently Hill & Marty [15]
performance of GROMACS is below the bound of
extended Amdahl’s law for multicores in a single
the roofline model of the machine.
CPU, where they show that a locally inefficient
method can be globally efficient. Please see the cited
2.2 Constraints on a Parallel Computer reference for details of their analysis.
In parallel computing, a model is decomposed into
many parts and then solved using multiple SA (f,n) = 1/[(1-f) +f/n] Eq (1)
computing resources in a coordinated way. A
popular approach of parallelization is domain Eq (2)
SG (f,n) = f + n(1-f)
decomposition. This process is shown in Fig. 5,
where a reservoir model is decomposed in 12
domains. Each domain is then assigned to one or
more processors, to perform calculations
simultaneously. Various processes need to be
synchronized at some points depending on the
algorithm. Independent processors can continue until
it reach a synchronization or a barrier point. Data
exchanges among processors can take place both in a
synchronized or asynchronized fashion depending
on the implementation of the algorithm.

Fig 6: Limits on Scalability of Simulation

3. High Performance Reservoir Simulation

Parallel computation can provide solutions very quickly,


however, as Killough [16] pointed out there are many
issues which need careful consideration. He gave an
overview of early computer and highlighted multiple
difficulties which one faced about twenty years ago. He
focused mainly on difficulties associated with load
balancing, data structures, linear equation solution and
well management calculations. Domain decomposition is
the common practice of dividing computational load
among processes. Since it is expensive, dynamic load
balancing is hardly practiced at this time. Dynamic load
balancing may be considered when moving data is
Figure 5: Domain Decomposition inexpensive, as we will see later in this paper, this is not a
promising strategy for emerging technology. To achieve
6 SPE 152414

efficient static load balance (domain decomposition) for a platforms in Saudi Aramco over recent years [7] is shown
model, we developed empirical relation based on run in Fig. 8a. The scalability (shown in Fig. 8b) remained
statistics and some simulation parameters. This approach almost the same. One may observe super scaling in large
has been very useful to improve overall performance of scale reservoir simulation (see for example, Ref. ([22],
simulation and utilization of our clusters. Load balance [23])) because of memory effect (as domain size
can be difficult to achieve if the simulation integrates decreases data are stored in cache/memory closer to the
multiple modules with disparate level of performance (for processor, which has lower latency, resulting in more than
example coupled facility and reservoir simulation study linear/ideal speedup).
reported by Hayder et. al. [17] couples a highly efficient
reservoir simulator with relatively much slower facility
simulator. Reservoir simulator was idle for most of the
time, because the facility simulation was slow. See Ref.
[17] for details of the implementation).

There are many well known techniques which can be used


to make parallel computation efficient. Many
optimization techniques such as interlacing to create
spatial locality for data items needed successively in time,
structural blocking to cut the number of integer loads
significantly, vertex reordering for unstructured grid to
increase temporal and spatial locality in cache, etc. have
been very helpful in HPC ([18] [19]). There are many
techniques to reduce cache misses and improve
performance, for example, one can reduce traffic from
(a) CM-5 (Ref. [4])
conflict misses by padding arrays to change cache line
addressing. Discussions of many optimizations can also
be found in literature [[18], [11], [20], [21]).

Communication related to well calculations is an obstacle


towards achieving high performance on a parallel
computer. These communications are needed to map
completion data from reservoir to well and vice versa.
They are irregular in nature and can be quite expensive.
Optimized communication algorithm have been used to in
simulations ([22], [23]) to reduce overhead of well
communications.

3.1 Scaling on Parallel Platforms


Majority of computational tasks in a reservoir simulation
can be divided into three main categories – Jacobian
building, solver, and update of solutions. Solver portion (b) Linux Cluster (Ref. [21])
of computation is most scalable and Jacobian building is
least scalable among the three components. Sindi et al. Fig 7: Scaling of the POWERS simulator
[21] studied one 4.26 million cell model with about
56,000 well perforations (well perforation is the There are multiple cores on a single CPU. Cores on a
connection of a well to the reservoir). Scalability of that particular CPU share resources among themselves, which
and also the model studied by Sunaidi [4] are shown in adds constraint to their scalability. We show scalability of
Fig. 7. On Linux cluster, there is almost 400 times the model used by Sindi et al. [21] on the Westmere
speedup when 512 nodes are used for simulation (because cluster in Fig 9. The Westmere cluster has two CPUs and
of memory limitation, the minimum number of nodes each CPU has 6 cores. Fig. 9a show scalability with
used for simulation was eight). Cluster technology now number of cores (using 6 cores per CPU) and Fig. 9b
provides much more cost effective solution to reservoir shows normalized computation time on 32 CPUs (with
simulation, which a few years earlier can only be done on varying number of cores). It was observed that the
expensive system like CM-5. Irregular communication optimal number of cores which gives the highest
related to accumulation of well terms in the Jacobian performance may vary based on model parameters.
matrix adds considerable overhead in computations on Brazell et al. presented some scaling results for two
cluster and prevents the Jacobian computation section commercial reservoir simulators in Ref. [24], but their
from achieving a high scale up value. Improvement in simulated were limited to small number of nodes. We
hardware technology used to build computational study scalability on larger platform and achieved very
good scalability in current and earlier studies ([7], [21]).
SPE 152414 7

3.2 Programming Model

Programming model used in simulation should match the


strength of the architecture. It may be beneficial to use a
hybrid model to achieve high performance in simulation.
For example, current version of POWERS on Linux
clusters use MPI for all its communications, while the
previous version on IBM-Nighthawk [23] used hybrid
parallelization using OpenMP directive for shared
memory parallelization and MPI calls for distributed
memory parallelization. Similar hybrid programming
model with MPI and OpenMP have been used by Bova et
(a) Performance of different computing platform al. [25], where they obtained superior performance using
the hybrid programming model compared to a model
using only MPI or OpenMP. Initial version of our cluster
[23] had only two processors and the scope of
parallelization using OpenMP directives was limited.
Tests on those clusters indicated that parallelization
within a node through OpenMP threads gave little
performance improvements. The bus bandwidth and
overhead associated with memory locks destroyed the
performance benefits of multi-threading. However,
utilizing an MPI between two processors within a node
gave more than 25% improvements for some model sizes.
That motivated us to use only MPI for all
communications on Linux clusters. However, as new
architecture evolves and large number of cores share
(b) Scaling of a simulation run
resources on a processor, one MPI process per core may
Fig 8: Computations on different computing platform
not be the best strategy to get good scalability, and it may
be better to use some hybrid programming model.

3.3 Domain Decomposition Strategies


Sindi et al. [21] gave simulation results of a reservoir
model with grid dimensions 249x104x124 using both 1D
and 2D blocking, which are shown in Fig. 10. On 16
Processors 1D blocking is slower than 2D blocking
mainly because of bigger surface area over which flux
vectors are exchanged. One 32 and 64 processors
differences in simulation times are smaller, with 2D
blocking still being faster. However, 1D blocking cannot
be used on 128 CPUs because there is not enough cells in
I direction (simulator implementation needed more than
(a) Speedup with cores one grid cell in any particular direction of the domain
assigned to an MPI process). On the other hand, 2D
blocking can be used on 128 or larger number of CPUs,
since the computational domain is subdivided in both I
and J directions which has 249 and 104 cells,
respectively. In general, communication volume is higher
in simulation with 1D blocking than that with 2D
blocking because surface areas over which fluxes are
exchanged among MPI processes are higher in 1D
blocking than in 2D blocking. Other factors such as
implementation of algorithm in the code and tools used to
parallelize computations also affect timings for different
blocking strategies in computations [26].
(b) Simulation on 32 CPUs
Fig 9: Scalability for number of cores
8 SPE 152414

Gigabit Ethernet using small number of processors,


however, for large models, communication overhear
associated with Ethernet is likely to be enormous and high
performance reservoir simulation becomes infeasible.
High speed network is essential for many CFD HPC
simulation ([20], [27]).

Table 1: Model Parameters

Grid Size Total well perfs


(million cells) (thousands)
Model I 4.26 56
Model II 2.70 102

Figure 10: 1D and 2D blockings (Ref. [21])

3.4 Inter-processor Communications


3.4.1 High Speed Network

Inter-processor communication is a very important issue


in parallel computations. Overhead of such
communication limits performance of parallel
computation, and can even degrade the performance as
additional computing resources (processors) are added to (a) Model I
solve the problem. The extent of this constraint depends
on the model and the algorithm used in computation. An
embarrassing parallel kernel or very similar algorithm
need very small amount of communication and can run
efficiently on a parallel computer even with relatively
slow network. However, typical reservoir simulation
needs significant amount of communications for flux and
well calculations. This is true for most CFD problems
solving Navier-Stokes equations [27]. High speed
networks are essential for reservoir simulation, especially
when a large number of processors are used. Brazell, et al.
[28] recently used 10-Gigabit Ethernet network for
reservoir simulation using a commercial simulator on
relatively small number of cores. To illustrate the
importance high speed network we compare performance (b) Model II
two reservoir models using Infiniband and Gigabit
Ethernet on a much larger cluster. Model parameters for Fig 11: Comparison of High Speed Network
our study are given in Table 1 and simulation timings are
given in Fig. 11. As expected, communication overhead 3.4.2 Communication Algorithm
increases on larger number of cores and simulation using
Gigabit Ethernet becomes slower than those using 3.4.2.1 Well Communications
Infiniband. Model II has more well completions that
Model I, and therefore, has higher communication There are two main types of communications in reservoir
overhead coming from well related calculations. We simulation. One comes from calculation of flux vectors,
observe that simulation using Infiniband network is faster and other comes from mapping of well data on the
when number of processors is larger than 16 (we used reservoir and vice versa. There are some other
single core per CPU to direct inter-processor communications during initialization and I/O operations.
communication over the network). From these and also Volumes of those communications are much smaller than
the study done by Brazell et al. [28], it is clear that one flux and well related communications. Flux related
can simulate small reservoir models on clusters using communications are easy to analyze. Amount of
SPE 152414 9

communication depends on surface areas of neighboring changing MPI calls based on message sizes. Table 2
domains. Therefore, such communication can be reduced shows their simulation timings for two reservoir models
by reducing surface area of domains relative to before and after optimization.
computational task (which is related to the number of grid
cells in the domain). Therefore, the decomposition
strategy should keep surface to volume ratio of domains Table 2: MPI collective operations timings
low. Unfortunately, communications related to wells are Model Message Orig Time New Time Overhead
irregular in nature and can far exceed communication Size (MB) (min.) (min.) Reduction
needed to calculate fluxes. Amount of well A 360 74 47 36.5%
communication becomes significant in a high resolution B 90 1309 1115 15.0%
model with large number of wells (number of well
perforation, i.e., connections to the reservoir increases as
grid blocks become smaller; if a grid block with a well
perforation is subdivided into two blocks, number of well 3.5 Software Library
completions in the new model for that grid block may
double to provide connections to two grid blocks instead
of one in the original case). Habiballah and Hayder ([22], Simulation time of a reservoir model depends on both the
[23]) studied communication algorithms in the POWERS hardware and also software library used during the
simulator to reduce overall communication overhead. As simulation. As expected hardware has a big impact on the
their results shown in Fig. 12a, well communication simulation time. The same is also true for software
overhead of a 200,000 cell model with 400 well is much libraries. In Fig. 13, we show normalized simulation
higher than Flux communication overhead. This was even times of a 270x410x27 model running on 192 cores with
after implementation of an optimization algorithm to three different MPI libraries -- Intel MPI (uses MPI-2
reduce irregular communication (Fig 12b). As Killough standard), MVAPICH1 (uses MPI-1 standard) and
[16] pointed out, well related computation is a challenge MVAPICH2 (uses MPI-2 standard) [Ref. 29]. This model
to HPC of reservoir models. has about 102,000 well completions (i.e., connections
with reservoir grid blocks). Fig. 13 shows that Intel MPI
and MVAPICH2 (both using MPI-2 standard) have
similar performance, while higher communication
overhead causes simulation using MVAPICH1 to be
nearly 35% slower. Strength of a particular
communication (e.g., MPI) library can be observed in
models with high level of communications. Numerical
libraries may also be used to in the simulator to achieve
high performance during simulation.

(a) Communications in the model

Fig 13: Comparison of MPI libraries


(b) Optimization of communication algorithm

Fig 12: Communication Overhead


3.6 Simulation Acceleration
3.4.2.2 Message Size
Simulation can be accelerated by using special hardware
There are many MPI collective operations in POWERS. like Graphical Processing Unit (GPU)[30]. There have
Sindi el al.[21] explored ways to reduce these been several studies of acceleration of reservoir
communication overhead. They were able to reduce simulation using GPU ([7], [21], [31], and [32]). GPU
overhead of collective operations by up to 36% through
10 SPE 152414

have been very successful in some areas, but there has


been only limited success in reservoir simulation. Sindi et 3.8 Simulation Grid
al. [21] studied GPU for calculation is POWERS
reservoir, although they achieved good speedup for
Multiple computing platforms can be combined to create
calculation in GPU over CPU, they experienced large data
a simulation grid to solve a very large problem. In
movement overhead between CPU and GPU. Two
addition, simulation grid can also be used to increase
important computational kernels, one from the
utilization of the system by reducing fragmentation, i.e.,
preconditioner routine and other from Jacobian
unused processors across clusters can be pooled together
calculation were studied. The first kernel had reasonable
to run a particular simulation job, which might not been
amount of memory transfers and numerical calculations,
able to run on unused processors available on a particular
while the second kernel was lighter on memory transfers.
cluster. Simulation grid concept has been successful for
Results of their simulation on NVIDIA’s S2050 are
many applications (for example see Ref. [33]). One
shown in Table 3.
critical issue for such simulation is whether
communication channels provided between clusters are
sufficient to support large scale simulation without
Table 3: Timings on GPU
noticeable degradation in performance. Underlying
CPU GPU
Speed
Data Moving algorithm in the simulation should be latency tolerant.
Kernel time time Overhead Barnard et al. [33] used a special implementation of MPI
up
(sec.) (sec.) (sec.) known as MPICH-G on NASA information power grid to
1 336 144 233% 540 solve a CFD problem. They improved solver latency
2 92 80 13% 19 tolerances by changing communication algorithm. To test
the feasibility of simulation grid computations for
Appleyard et al. [31] used CUDA to write and accelerate reservoir simulation, we connected two of Saudi Aramco
a CPU based serial solver for reservoir simulation. Klie et clusters (Westmere clusters using IB QDR network) using
al. [32] considered a model reservoir simulation problem Qlogic Infiniband switch through free ports in large
using for Implicit Pressure Explicit Saturation (IMPES) Qlogic cluster switches. We were able to make simulation
formulation. Review of present GPU activities indicates on a pool of 23 nodes across clusters (11 nodes from one
that although there is enormous potential in GPU cluster and 12 nodes from another cluster). As shown in
technology, the lack of development tools and difficulty Fig. 14, computations on the simulation grid, i.e., across
in programming restricts their adaptation in reservoir clusters took about 4% more time than doing all
simulation currently. computations within a single cluster. We expect higher
overhead for models with higher communication
overhead (most likely for models with high number of
Table 4: Disk Full vs. Diskless Clusters wells) and simulation using larger number of nodes.
Connectivity and hence bandwidth between processors
within a cluster is higher than that across clusters. This
Gird Size
Disk full Time Diskless will likely cause degradation in performance in
( million Speedup
(min.) Time (min.) computation of models with large communication volume
cells)
on simulation grid.
2.50 72 62 14%

4.25 54 48 11%

3.7 Power Consumption


Power consumption becomes a significant issue as
number of processors in a cluster becomes very large.
Sindi et al. [21] studied the effect diskless HPC
computing on reservoir simulation. They estimated that
there would be a saving of over 16 Kilowatts of powers
for a computing platform with about 3,000 nodes, if nodes are
configured to be diskless rather than maintaining disks in nodes.
They also noticed that simulation jobs ran faster on the
diskless setup, because the operating system was resident
in memory instead of disk. As shown in Table 4, there Fig 14: Computation on Simulation Grid
were up to 14% improvement in run time for some of the
simulation models.
SPE 152414 11

performance.
4. Emerging Trends and Challenges

Convergence of numerical methods often slows as the


system size increases and conditioning become worse.
Development of new algorithm can overcome these
difficulties. Keyes [8] pointed that there has been
tremendous improvement in algorithms over the years in
many areas of science. For example, solution of Poisson’s
equation using numerical techniques developed by Von
Neumann and Goldstine took order operations in 1947.
With technique developed by Bradt in 1984, such solution
can now be obtained with order of operations [See
Table 5]. It is a challenge to invent a new algorithm,
however, pay off can be huge.

Fig 15: Emerging Processor Architectures (Ref. [34])


Table 5: Improvements of Algorithms (Ref. [8])
Nevertheless, emerging systems will provide huge
Year Method Reference Storage Flops opportunities to solve complex large problems, which are
currently not possible. Of course, there will be a need to
GE Von Neumann
1947 (banded) & Goldstine evaluate and make necessary adjustments to maintain
n5 n7 high efficiency of computations [35], which will be
challenging.
1984 Full MG Brandt
n3 n3

4.2 Programming Challenges


4.1 Computing Platforms Dongarra et al. [36] pointed out that although the number
of cores in the system will grow rapidly, computational
Rapid advances in hardware, software, algorithm and power of individual core will likely decrease. As a result,
modeling technology enabled us to study many physical a problem which does not scale beyond a certain number
processes, which previously we could do only using of processors will be slower on the next generation of
experiments. Simulation provides a very cost effective computers. Therefore, ensuring scalability of the
alternative to expensive experiments to get insight into algorithm is essential if it were to be efficient on next
processes such as those in the reservoir. Nevertheless, generation of computers. Efficient implementation on
taking full advantage of advances in hardware and multicore processor architecture needs careful
software is a challenging task. consideration, because the caches are shared at one of
more levels on the chip, but bandwidth on and off the chip
Dongarra [34] discussed current trends in architecture of is limited and unlikely to scale with number of cores.
emerging computers. Fig 15 shows some of processor Programming models will need to consider the fact that
architectures discussed by him. To deliver high communication costs between cores in a node are
performance, new processors will have very high level of different from communication between nodes. Data
concurrency and can be very heterogeneous. Clock speed management will be important and programmer will need
is not expected to rise because of physical barrier coming to manage data locality to achieve performance.
from consumption of too much power and generation of Synchronization of large number of processes will be an
too much heat. Other major issues in these platforms important issue to consider. MPI will likely be used for
include storage capacity, programmability, reliability of inter-node communication for some time to come, but
the system, etc. Data movement across the system will there will be also focus on intra-node parallelism
likely be the most important factor determining power approaches and how to integrate that parallelism with
consumption. It will be important to minimize data MPI communication. Hybrid programming model, for
movements during simulation in order to keep power example, MPI + OpenMP, MPI + CUDA, etc. are already
consumption at an acceptable level. Disparity between being used in some simulation. It is likely that new hybrid
processor performance and memory performance will get programming models will be explored for efficient
worse, making it a challenge for the programmer to programming models on future platforms.
achieve peak performance of the system. Innovative ways
of hiding latency will be required in get good It is expensive to put very large amount of memory on a
12 SPE 152414

processor. Careful use of memory will be needed when simulation jobs. As massive reservoir models are built,
large problem is solved. One of the techniques which one there will be needs to use more nodes for a simulation and
may use is the use of mixed precision in computation, i.e., in addition, simulation will take longer time to finish. If
judiciously store some variables in single precision one uses four times more nodes and the simulation takes
instead of in double precision without compromising three times as long which is not an unreasonable estimate
accuracy of the solution. This should improve for larger jobs, then more than 10% of those jobs will
performance of memory bound algorithms, because likely fail, with the current level of hardware reliability. It
bandwidth requirement will decease with the use of single is important to have resiliency in the software [37] for
precision instead of double precision. An example of such recognition and adaptation to errors to mitigate the
computations can be found in Ref. [34]. Some grid blocks hardware reliability issue. It will be important to be able
in a particular model may not be active in simulation to restart computations from output of a failed job.
because of some of their limiting physical values, for 5. Conclusions
example, cut-off porosity, etc. are below some threshold It is now possible to build very large high resolution
value. Most of computations associated with these grid reservoir model to accurately simulate reservoir
blocks may be eliminated to reduce both memory and processes. We expect continued improvements in
computational requirements during simulation. Recently hardware and software technologies in coming years,
an international effort is underway to build an integrated which will make it possible to simulate even larger
collection of software known as extreme-scale/exascale models. It is a challenge to adopt rapidly evolving
software stack or X-Stack [37]. Many HPC issues hardware and software platforms, and ensure high
discussed in Ref. [37] should be also applicable to HPC of efficiency in reservoir simulations to solve high resolution
reservoir simulation on emerging computing platforms. giant reservoir models. This will remain a challenge as
demand for more accurate simulation results grows. High
4.3 System Reliability speed networks are essential for simulation of big models
System reliability is an important issue in simulation and on large number of processors. Simulation grid approach
several studies have been made to study failure rates in may be used to solve very large problems, at least for
large HPC platforms. Schroeder and Gibson [38] some models in limited basis. Communication overhead is
collected and analyzed failure rate statistics of node likely to limit their usage in routine basis, unless clever
outages and storage failures from a few large HPC latency tolerant algorithms are used.
systems. In terms of complete node outages, they
identified that hardware was the single largest component It is expected that high end systems with 100 million to
responsible for these outages, with more than 50% of billion cores will be built in the near future. It will be
failures assigned to this category, while software was the difficult to efficiently use systems with large number of
second largest category with 20%. The remaining portion cores. Level of power consumption mainly coming from
was related to reasons such as human, environment, data movements will be a big concern. Cost of data
network outages, etc. They observed that number of movement will be a critical factor to consider in designing
failures per year per processor varied between 0.1 and and implementing algorithms on emerging systems. It is
0.65 on those systems. With 0.65 failures per year per expected that there will be a need to use hybrid
processor, a large system they studied had close to 1100 programming model, with MPI is likely to be a part of it,
failures per year. A typical application running on that at least for the near future. Improvement of algorithms for
system would be interrupted and forced into recovery computations on emerging computing platform will
more than two times a day with that high level of failure advance opportunities in HPC in reservoir simulation.
rate. In terms of storage and hard drive failures, Schroeder
and Gibson [38] found out that the average annual failure Acknowledgements
and replacement rate for hard drives was up to 5%. This The authors would like to thank the Saudi Arabian Oil
means that in a cluster of 512 nodes, the average failure Company’s Management for permission to publish this
rate for hard drives is around 1-2 drives every two weeks, paper. We thank our colleague Raed Al-Shaikh for his
which is consistent with our observations of the HPC encouragement to examine the simulation grid concept for
systems in Saudi Aramco. Since the failure rate of a reservoir studies on HPC platforms at Saudi Aramco. We
system grows with the number of processor chips in the also thank our colleagues Raed Al-Shaikh, Gordon Tobert
system, failure rates in future systems will likely to and Tofig Dhubaib for reviewing the paper and making
increase ([37], [38]), as a result, a significant portion of many helpful suggestions.
the system resources will not be available for applications
to do computation [38]. Utilities should be used to
identify any system problem and subsequently take References
preventive measures during resource allocation for a [1] Coats, K. H., “Reservoir Simulation: State of the
simulation job [39]. At present, we observe processor Art”, SPE 10020, JPT, Aug 1982, pp. 1633-1642.
failure rate on Saudi Aramco HPC computing platform [2] Dogru, A. H., “Giga-Cell Simulation”, Saudi Aramco
close to 0.1 per year per processor (or 0.2 per year per Journal of Technology, Spring 2011. pp. 2-8.
node), which translates into failure of about 1% of
SPE 152414 13

[3] Branets, L. V., Ghai, S. S., Lyons, S. L. and Wu X., and Reservoir Simulation Models”. Saudi Aramco
2009. “Challenges and Technologies in Reservoir Journal of Technology, Fall 2011, pp. 66-71.
Modeling”, Communications in Computational [18] Gropp, W. D., Kaushik, D. K., Keyes, D. E. and
Physics, 6(1), pp. 1-23. Smith, B. F., 2001. “Latency, bandwidth, and
[4] Sunaidi, H. A., “Advanced Reservoir Simulation concurrent issue limitations in high-performance
Technology for Effective Management of Saudi CFD”, Computational Fluid and Solid Mechanics,
Arabian Oil Fields”, 1998, Presented at the 17th pp. 839-842. Elsevier Science Ltd.
Congress of the World Energy Council, Sept. 1998, [19] Gropp, W. D., Kaushik, D. K., Keyes, D. E. and
Houston, TX. Smith, B. F., 2001. “High-Performance parallel
[5] Dogru, A.H., Li, K.G., Sunaidi, H.A., et al. 2002. “A implicit CFD”, Parallel Computing, 27, pp. 337-362.
Parallel Reservoir Simulator for Large Scale [20] Jayasimha, D. N., Hayder, M. E. and Pillay, S. K.
Reservoir Simulation”, SPE Reservoir Evaluation & 1997. “An Evaluation of Architectural Platforms for
Engineering Journal, 6(1), pp. 11-23. Parallel Navier-Stokes Computations”, Journal of
[6] Pavlas, E. J., “Fine-Scale Simulation of Complex Supercomputing, 11, pp 41-60.
Water Encroachment in a Large Carbonate Reservoir [21] Sindi, M., Baddourah, M. and Hayder, M.E., 2011.
in Saudi Arabia”, 2002. SPE 79718, SPE Reservoir “Studies of Massive Fields Modeling Using High
Evaluation & Engineering, Oct 2002, pp. 346-354. Performance Computing in the Oil Industry”. Saudi
[7] Baddourah, M., Hayder, M.E., Habiballah, W., et al., Aramco Internal Report, Unpublished.
2011. “Application of High Performance Computing [22] Habiballah, W.A. and Hayder, M.E. 2003. “Large
in Modeling Giant Fields of Saudi Arabia”. SPE Scale Parallel Reservoir Simulations on a Linux PC-
149132, Presented in the 2011 SPE Saudi Arabia Cluster”. 2003. Proceedings of the ClusterWorld
Section Technical Symposium and Exhibition, Al- Conference and Expo: The HPC Revolution (4th LCI
Khobar, Saudi Arabia. International Conference on Linux Clusters), San
[8] Keyes, D. E., “Algorithms for Extreme Simulation in Jose, CA.
Science and Engineering”, Presented at the 2011 [23] Habiballah, W. A. and Hayder, M. E., 2004. “Parallel
Saudi Arabian HPC Symposium, Al-Khobar, Saudi Reservoir Simulation Utilizing PC-Clusters, Saudi
Arabia, December 6-7, 2011. Aramco Journal of Technology, Spring 2004. pp. 18-
[9] Elrafie, E., White, J. P. and Al-Awami, F. H., “The 30.
Event Solution: A New Approach for Fully [24] Brazell, O., Medssenger, S. Abusalbi, N. and
Integrated Studies Covering Uncertainty Analysis Fjerstad, P. 2010. “Multi-core evaluation and
and Risk Assessment”, Saudi Aramco Journal of performance analysis of the ECLIPSE and
Technology, Spring 2009, pp. 53-62. INTERSECT Reservoir simulation codes”. Presented
[10] Huwaidi, M. H., Tyraskis, P. T., Khan, M. S., et al.: at the 2010 Oil and Gas High Performance
“PC-Clustering at Saudi Aramco: from Concept to Computing Workshop, Rice University, Houston, TX.
Reality”, Saudi Aramco Journal of Technology, [25] Bova, S. W., Breshears, C. P., Cuicchi, C. E., et al.,
Spring 2003, pp. 32-42. 2000. “Dual-level parallel analysis of harbor wave
[11] Williams, S., Waterman, A. and Patterson, D., response using MPI and OpenMP”, Int. J. High
“Roofline: An Insightful Visual Performance Model Performance Comput. Appl., 14, pp. 49-64.
for Multicore Architectures”, 2009. Communications [26] Hayder, M.E., Keyes, D.E. and Mehrotra, P. 1997.
of the ACM, 52(4), pp. 65-76. “A Comparison of PETSc-Library and HPF
[12] Mora, J., “Current R&D hands on work for pre- Implementations of an Archetypal PDE
Exascale HPC systems”, Presented at the 2011 Saudi Computation”, Advances in Engineering Software,
Arabian HPC Symposium, Al-Khobar, Saudi Arabia, Vol. 29 (3-6), pp. 415-423.
December 6-7, 2011. [27] Hayder, M. E. and Jayasimha, D. N.: “Navier-Stokes
[13] Amdahl, G.M., “Validity of the single-processor Simulations of Jet Flows on a Network of
approach to achieving large scale computing Workstations”, AIAA Journal, 34(4), pp. 744-749,
capabilities”. Proc. Am. Federation of Information 1996.
Processing Societies Conf, AFIPS Press, 1967, pp. [28] Brazell et al, “Reservoir Simulation Made Simpler
483-485 and More Efficient with 10 Gigabit Ethernet”,
[14] Gustafson, J. L., “Reevaluating Amdahl's Law”, ftp://download.intel.com/support/network/sb/reservoi
1988.Communications of the ACM, 31(5), pp. 532- r_cs_2010.pdf
533. [29] http://mvapich.cse.ohio-state.edu/overview/
[30] Owens, J. D., Houston, M., Luebke, D., et al., 2008.
[15] Hill, M, and Marty, M., 2008. “Amdahl’s Law in the “GPU Computing”, Proceedings of IEEE, 96 (5), pp.
Multicore Era”, Computer, 41(7), pp. 33-38. 879-898.
[16] Killough, J. E., “Will Parallel Computing Ever be [31] Appleyard, J.R., Appleyard, J.D., Wakefield, M.A.
Practical”, 1993. SPE 25556, Presented at the Middle and Desitter, A.L. 2011. “Accelerating Reservoir
East Oil Show of SPE, Manama, Bahrain. Simulators using GPU Technology”, SPE paper
[17] Hayder, M.E., Munoz, A. and Al-Shammari, A. 141265. Proceedings of the 2011 SPE Reservoir
2011. “Facilities Planning Using Coupled Surface Symposium, Woodlands, TX.
14 SPE 152414

[32] Klie, H., Sudan, H., Li, R. and Saad, Y. 2011.


“Exploiting Capabilities of Many Core Platforms in
Reservoir Simulation”. SPE paper 141402.
Proceedings of the 2011 SPE Reservoir Symposium,
Woodlands, TX.
[33] Barnard, S., Biswas, R., Saini, S., et al., 1999.
“Large-scale distributed computational Fluid
dynamics on the Information Power Grid using
Globus”, Proc. of the Seventh Symposium on the
Frontiers of Massively Parallel Computation, pp. 60–
71.
[34] Dongarra, J., “The Past, Present, and Future of
HPC”, Presented at SC2010, New Orleans, LA, Nov.
2010.
[35] Baddourah, M. and Hayder, M. E., “Feasibility of
simulating over billion cell reservoir models on Saudi
Aramco Computing Platforms”, 2012. Saudi Aramco
Internal report, In Preparation.
[36] Dongarra, J., Gannon, D., Fox, G., Kennedy, K.
2007. "The Impact of Multicore on Computational
Science Software," CTWatch Quarterly, 3(1),
February 2007.
[37] Dongarra et al., “The International Exascale Software
Project Roadmap”, 2011. Int. Journal of High
Performance Computing Applications, 25(1) 3–60.
[38] Schroeder, B. and Gibson, G. A., “Understanding
Failures in Petascale Computers”, 2007., SciDAC
2007, Journal of Physics: Conference Series 78 pp.
1-11.
[39] Hayder, M. E., Baddourah, M. A., Nahdi, U. A., et
al., “Experiences in Setting up the Production
Environment on PC Clusters for Reservoir
Simulations at Saudi Aramco”, 2004. SPE SA 29,
Presented at the 2004 SPE Technical Symposium of
Saudi Arabia Section, Dhahran, Saudi Arabia.

You might also like