You are on page 1of 4


2, FEBRUARY 2012 507

Parallel Realization of the Element-by-Element FEM Technique by CUDA

Imre Kiss1 , Szabolcs Gyimthy1 , Zsolt Badics2 , and Jzsef Pv1
Budapest University of Technology and Economics, H-1521 Budapest, Hungary
Tensor Research, LLC, Andover, MA 01810 USA

The utilization of Graphical Processing Units (GPUs) for the element-by-element (EbE) finite element method (FEM) is demonstrated.
EbE FEM is a long known technique, by which a conjugate gradient (CG) type iterative solution scheme can be entirely decomposed
into computations on the element level, i.e., without assembling the global system matrix. In our implementation NVIDIAs parallel
computing solution, the Compute Unified Device Architecture (CUDA) is used to perform the required element-wise computations in
parallel. Since element matrices need not be stored, the memory requirement can be kept extremely low. It is shown that this low-storage
but computation-intensive technique is better suited for GPUs than those requiring the massive manipulation of large data sets.
Index TermsCUDA, EbE FEM, GPU, parallel FEM.

I. INTRODUCTION The introduced technique can be thought as one which trans-

forms a highly memory dependent problem to a massively com-
putational dependent one, which in turn can be efficiently par-
N OWADAYS the finite element method is one of the most
frequently used techniques for engineering analysis of
complex, real-life applications of both linear and non-linear
allelized [4]. Platforms having massively parallel computing
capability (like todays GPUs) can take full advantage of this
types [1]. Let us consider the linear equations system of the method.
(1) A. Parallel FEM Implementations on GPUs
Although several methods partially accelerating the FEM
resulting from the FEM approximation of a Partial Differential computations have already been implemented on GPUs [5][7],
Equation (PDE). is the so-called system matrix, the vector these usually suffer from a strict design limitation. Namely,
of unknowns, and on the right hand side (RHS) represents the these devices can operate only on data that is stored in their
excitation. For large scale problems the solution of (1) is usu- on-board memory. In other words, data must be transferred
ally obtained by iterative solvers, like the variants of gradient from the system memory to the devices memory prior to any
type methods. To keep generalitynot taking advantage of any computation.
special property of the system matrix , except sparsitya pre- Large scale FEM problems need large storage capacity for the
conditioned bi-conjugate gradient (BiCG) solver will be inves- global system matrix. Consequently, efforts to accelerate only
tigated in this paper [2]. the calculation of the matrix-vector product may conflict with a
The element-by-element type finite element method (EbE substantial property of GPUs: the available memory capacity is
FEM) was constructed originally for low memory computers limited (in a few GBs).
[3]. The foundation of the method is based on the recognition To overcome this problem one may consider different domain
that the assembling of element matrices to form the global decomposition techniques. The decomposition can be conceptu-
system matrix is a linear operation. Therefore certain cal- alized either in its traditional meaning [8], or in the sense that
culations with the system matrixlike e.g. a matrix-vector one just decomposes the large system matrix to smaller parts
productcan be traced back to the level of finite elements, thus that fit into the memory. These sub-matrices are then transferred
converted to calculations with the individual element matrices to the GPU one after the other, and the partial multiplications are
, which appear in the elementary equations having the form performed in parallel.
A major disadvantage of this technique is that although the
partial products are performed significantly (by several order)
Most iterative solvers can be decomposed into a sequence of faster this way, the necessary bus transfers will cause the overall
matrix-vector products and inner products of vectors, hence are acceleration to be much more moderate. This disadvantage be-
suitable to be implemented in the EbE context. The idea comes comes even more remarkable when multiple GPUs are present.
natural, not to store the element matrices (as traditional methods In this case the computing capacity is drastically increased
do with the system matrix), but recompute them in each itera- (compared to single GPU case), but the relative throughput of
tion. This is possible because iterative solvers (in contrast to di- the data transfers is bounded by the simultaneous needs from
rect solvers) do not affect the system matrix during the solution. the devices.
To achieve better utilization, a remedy needs to be found
to avoid costly data transfers. Such a technique would be one
which acts entirely on the GPUs.
Manuscript received July 07, 2011; revised October 14, 2011; accepted
November 01, 2011. Date of current version January 25, 2012. Corresponding
author: I. Kiss (e-mail: B. Aim of the Work
Color versions of one or more of the figures in this paper are available online
at As the gap between bus speed and computation density in-
Digital Object Identifier 10.1109/TMAG.2011.2175905 creases, codes which use the accelerator design (i.e. in which
0018-9464/$31.00 2012 IEEE

only the computation intensive parts of the program are exe- Using the above concept, the matrix-vector product, which is
cuted on the GPU) will fall behind codes that take full advan- the basis of iterative solvers, can be reformulated in terms of
tage of it [9], i.e. perform all the necessary computations on the element-wise computations as
The aim of this paper is to show that modern high perfor- (5)
mance computing platforms (GPUs) offer considerable com-
puting capacity that can be fully utilized only if the applied algo-
rithm fits to their specific architecture. This property is in con- This means that the product of an assembled global matrix and a
trast to traditional (multi-) CPU based program design patterns vector is equivalent with the assembled vector of the elementary
where the efficiency of an algorithm is simply estimated using matrix-vector products. According to (4) the elementary con-
its computational complexity. tributions can be accumulated in a vector, the size of which is
Relying on the fact that it is cheaper to recompute element equal to the global degrees-of-freedom (DoF), hence only vec-
matrices than continuously cache them between the device and tors have to be stored during the computations. Elementary ma-
the system memory, EbE FEM technique is revisited here, and trix-vector products in (3) can be computed for each element
its implementation on CUDA architecture is presented. It is also separately, which enables parallel realization [4].
demonstrated that EbE FEM highly extends the scale of prob- The other building block of iterative solution methods is the
lems that can be solved on devices having limited memory ca- inner product of two DoF-sized vectors. This operation is obvi-
pacity but a massively parallel architecture. ously independent of the mesh structure and connectivity, and
its parallel execution is straightforward.
One more advantage of the EbE implementation worth men-
II. CUDA, A MASSIVELY PARALLEL COMPUTING tioning is that no global numbering of unknowns and finite el-
ENVIRONMENT ements is required at all. This feature can be utilized several
GPUs were designed for total computational throughput ways, like for instance with adaptive mesh generation or mesh
rather than for fast execution of serial calculations. Therefore reduction techniques [12].
they have the potential to dramatically speed-up computation
intensive applications over multi-core CPUs. To achieve high B. ConcurrencyGlobal Updates and Coloring
computational throughput, GPUs have hundreds of lightweight On shared memory architectures (like the GPUs) an impor-
cores and execute tens of thousands of threads simultaneously. tant question is how the partial products are summarized. The
Programs executed on GPUs are called kernels. challenge during a global update is to ensure that different
The reason why GPUs can be so effective is the way how threads do not access the same memory space simultaneously.
thread execution is organized. On traditional CPUs the programs This concurrent access is called race-condition and results in
are executed for a certain amount of time and then interrupted an indefinite outcome. Treatment of such cases is traditionally of
(time division multiplexing). During the interruption, the CPU two kinds. One solution is atomic update, when the memory
has to save the current state of the inner registers, and load a space is protected during I/O, causing other threads accessing
previously saved state for an other thread. the same memory place to wait until the operation is fully com-
A detailed overview on this topic can be found in [10], [11]. pleted.
The other solution is a kind of coloring of the problem [13],
[14]. In this case the mesh is considered as a graph, with the un-
known variables (DoF) being the nodes of the graph and the el-
ements representing the connections between them. This graph
A. Disassembling Matrix Manipulations to the Element Level is then colored in a way that any two elements having the same
The finite element assembling procedure relies on some func- color do not share a common unknown. Different colors are then
tions by which the element matrix and the RHS of (2) are processed serially (one after the other), while elements with
computed. These functions depend among others on the type of the same color are processed simultaneously in parallel.
PDE to be solved as well as the applied shape functions. The
computed element matrices and RHS vectors are then assem- C. Element-by-Element Formulation of the BiCG Solver
bled to form the global system matrix and RHS . Let this In an EbE implemented BiCG solver the computations can
assembly step be represented by an operator , which is de- be grouped into so-called EbE steps and DoF steps, re-
fined differently for matrices and vectors, as follows: spectively. The former refers to the matrix-vector product of
(5), while the latter means vector-vector product or initialization
(3) of variables. BiCG algorithm requires several auxiliary vectors
and complex variables during the iter-
(4) ations. Function of these variables is identical to that outlined in
[2, Chapter 2.3.5]. The vectors are DoF-sized, therefore can be
handled (stored) the same way as the vector of unknowns, .
where is the set of elements, and matrix represents the The way the variables are stored gives the real modularity of
transition between local and global numbering of the unknown the EbE method. Contrary to traditional FEM methods using
variables for the -th element. Contrary to the sparse global global numbering, here a dynamic storage structure is used in-
system matrix , the matrix of size ( being the stead. The structure can be thought as an index array (pointers
local degrees-of-freedom) is usually dense. in the actual implementation) keeping the information how local

unknowns correspond to global ones. This is functionally equiv- D. Some Drawbacks of the EbE Implementation
alent to the role of in (3), (4). The lack of assembling makes the method convenient for
Algorithm 1 shows the EbE implemented BiCG, which is GPU parallel execution also raises several difficulties. The first
functionally equivalent to that presented in [2, Chapter 2.3.5], one is related to preconditioning, which traditionally assumes
and is implemented in terms of EbE and DoF iterations. the system matrix to be in an assembled form. To overcome
Label EbE iteration indicates the computation of the element this problem one can use specific element-by-element precon-
matrices. To avoid race conditions during global updates, the ditioners [14], [15].
elements are colored. The iteration goes through all colors seri- In this paper a simple Jacobi preconditioner is used [2] be-
ally, and performs the computations on the elements having the cause it can be represented by a diagonal matrix , which
actual color, , in parallel. Label DoF iteration indicates the can be stored the same way as the DoF-sized auxiliary vectors.
computation of the vector-vector products. This iteration is per- The Jacobi preconditioner is implemented as a DoF step (see
formed simultaneously on all global unknowns. Label global Algorithm 1, line 1011, 31 and 3435).
update means that the value of a global variable is affected. To The second problem is related to the required extra com-
avoid race conditions, atomic updates are used to access global putations: since element matrices are not stored, they must be
variables. recomputed in each iteration, which is obviously redundant
when dealing with linear problems. However, this extra
computation becomes necessary for non-linear or coupled
problems, where a kind of fix-point iteration technique can be
realized this way [12].

E. Details of CUDA Implementation

The EbE BiCG method is implemented to run exclusively on
the GPUs. After the mesh is read and its appropriate coloring
is determined, point positions and triangulation information is
moved to the GPU memory. Both data are stored in special,
aligned data types to ensure coalesced access during the opera-
tions [11]. All computations are carried out using double preci-
sion floating point representation.
Each EbE and DoF iteration is performed as a separate kernel
call, including global updates. When only one GPU is used,
there is no need for synchronization during the kernel calls. In
an EbE iteration, the implementing kernel has as many allocated
threads as the number of elements having the current color. El-
ement sets of different colors are handled one after the other
by calling the same kernel, embedded in a host side iteration
through all the colors. Since in a DoF iteration all computations
can be carried out simultaneously, implementing kernels has as
many threads as global unknowns. Global updates in DoF iter-
ations are carried out using kernels with internal block summa-
Following the initialization part, BiCG loops (c.f. INIT
and LOOP labels in Algorithm 1) are computed as numerous
kernel calls embedded in a host loop. At the end of this loop,
the global variable is transferred to the hosts memory, and
the termination of the loop is decided on the host side. Since no
other host side computation is carried out during looping, the
CPU load of the algorithm is negligible.

The chosen test problem is a static conduction problem with
inhomogeneous conductivity. The equation to be solved is
therefore the Laplace equation with spatially varying conduc-
tivity. The domain is discretized by tetrahedral elements and
linear nodal shape functions are used. The global unknowns
(DoF) are the potential values at the nodes of the mesh. The
element matrices are computed using analytical expressions
To study the accuracy and speed of the proposed method, the
Utah Torso model was investigated by solving an ECG for-
ward problem [17] (c.f. Fig. 1).

competitive with todays multi-CPU based algorithms (Table I

shows some run time data of a FEM conduction solver using the
in-core version of Intel MKL PARADISO direct solver for the
same meshes). Another advantage over the popular GPU FEM
implementations is that the treatable problem size is limited by
the amount of meshing information rather than by the number
of non-zero elements in the system matrix.
Preliminary multi-GPU results demonstrate the scalability of
the method: it can be extended to multiple GPUs and for larger
problems the expected linear scaling could be achieved. This
topic together with the treatment of non-linear problems will be
covered in a forthcoming paper.

The work reported in the paper has been developed in the
Fig. 1. Utah torso modell. Solid faces correspond to organs.
framework of the project Talent care and cultivation in the sci-
entific workshops of BME project. This project was supported
TABLE I by the grant TMOP-4.2.2.B-10/1/KMR-2010-0009.
[1] P. P. Silvester and R. L. Ferrari, Finite Elements for Electrical Engi-
neers. Cambridge, U.K.: Cambridge University Press, 1990.
[2] R. Barrett, Templates for the Solution of Linear Systems: Building
Blocks for Iterative Methods, 2nd ed. Philadelphia, PA: SIAM, 1994.
[3] G. F. Carey, E. Barragy, R. McLay, and M. Sharma, Element-by-el-
ement vector and parallel computations, Commun. Appl. Numer.
Methods, vol. 4, no. 3, pp. 299307, 1988.
[4] G. F. Carey and B.-N. Jiang, Element-by-element linear and nonlinear
solution schemes, Appl. Num. Meth., vol. 2, no. 2, pp. 145153, 1986.
[5] J. Bolz, I. Farmer, E. Grinspun, and P. Schrder, Sparse matrix solvers
on the GPU: Conjugate gradients and multigrid, ACM Trans. Graph.,
Run time statistics obtained for several different mesh sizes vol. 22, pp. 917924, Jul. 2003.
are shown in Table I. The computations has been carried out on [6] C. Cecka, A. Lew, and E. Darve, Introduction to assembly of finite
element methods on graphics processors, in IOP Conf. Series: Mate-
a HP-XW8600 workstation, having 8 GB memory, an NVIDIA rials Science and Engineering, 2010, vol. 10, no. 1, p. 012009.
GTX 590 GPU and a quad-core Intel Xeon X3440 CPU. [7] A. Cevahir, A. Nukada, and S. Matsuoka, Fast conjugate gradients
As also outlined in [7], the performance of the GPU accel- with multiple GPUs, in ICCS 2009, G. G. van Albada, J. Dongarra,
erated matrix-vector multiplication (MxV) highly depends on and P. Sloot, Eds., 2009, vol. 5544, pp. 893903.
[8] R. K. W. Hackbusch and B. N. Khoromskij, Direct schur complement
the structure of the matrix, i.e., the distribution and number of method by domain decomposition based on H-matrix approximation,
non-zero elements. Unlike methods using GPUs only for accel- Comput. Vis. Sci., vol. 8, pp. 179188, Dec. 2005.
erating the computation of the MxV, the proposed method does [9] I. Kiss, S. Gyimthy, and J. Pv, Acceleration of moment method
not rely on the system matrix, hence no such degradation may using CUDA, COMPEL: Int. J. Comput. Math. Elect. Eng., vol. 31,
no. 6, to be published.
occur. This results in a uniform performance irrespectively of [10] T. R. Halfhill, Parallel processing With CUDA., Microproc. J., 2008.
the domain (matrix) structure. [11] D. Kirk and W.-M. Hwu, Programming Massively Parallel Processors,
Another advantage is the memory efficiency. Since sparse A Hands-On Approach.. San Mateo, CA: Morgan Kaufmann, 2010.
[12] S. Gyimthy and I. Sebestyn, Symbolic description of field calcula-
matrix storage inherently requires some extra storage overhead tion problems, IEEE Tran. Magn., vol. 34, no. 5, pp. 34273430, 1998.
(for the row and column information), the efficiency of memory [13] C. Farhat and L. Crivelli, A general approach to nonlinear FE compu-
occupancy is limited. On the contrary, the proposed EbE FEM tations on shared-memory multiprocessors, Comput. Methods Appl.
only needs to store the meshing information and several DoF Mech. Eng., vol. 72, no. 2, pp. 153171, Feb. 1989.
[14] A. J. Wathen, An analysis of some element-by-element techniques,
sized auxiliary vectors required for the CG iterations. Comput. Methods Appl. Mech. Eng., vol. 74, no. 3, pp. 271287, Sep.
[15] G. Golub and Q. Ye, Inexact preconditioned conjugate gradient
V. CONCLUSION method with inner-outer iterations, SIAM J. Sci. Comp., vol. 21, no.
4, pp. 13051320, 2000.
In this paper the EbE FEM method is re-formulated to fit the [16] A. Nentchev, Numerical Analysis and Simulation in Microelectronics
GPU architecture. The method has extremely low memory con- by Vector Finite Elements, Ph.D. dissertation, Tech. Univ. Wien, Vi-
sumption and can take full advantage of the massively parallel enna, 2008.
[17] R. MacLeod, C. Johnson, and P. Ershler, Construction of an inhomo-
execution environment. Not only the algorithm outperforms tra- geneous model of the human torso for use in computational ECG, in
ditional CUDA accelerated FEM methods [5][7] but it is also IEEE Med. and Biology Society Annu. Conf., 1991, pp. 688689.