10 views

Uploaded by Thiago de Sousa

EBE FEM

- Introduction 1of2
- Image Parallel Processing Based on GPU.pdf
- Full Text
- CUDA C Programming Guide
- Critical Fluid Structure Interaction With FEM
- 01 Introductionfea
- ANSYS Mechanical APDL Parallel Processing Guide
- Finite Element Primer
- Code Aster User Manuel
- FIRST_COURSE_in_FINITE_ELEMENTS.pdf
- CO_Computer & Communication
- FEM Analysis of a Large Nozzle-To-Cylinder Shell Junction _ Carmagen Engineering
- Absorbing Boundaries for the Time-Domain Analysis of Dam-Reservoir-Foundation Syst (1997) - Thesis (268)
- Arun Vyas CV Jan07
- backtrack-4-cuda-guide
- ihlenburg.pdf
- Modeling Issues in FEA With ANSYS
- 1-s2.0-S1877705817308226-main
- Article Review 2
- FEM III Class Test Paper

You are on page 1of 4

Imre Kiss1 , Szabolcs Gyimthy1 , Zsolt Badics2 , and Jzsef Pv1

Budapest University of Technology and Economics, H-1521 Budapest, Hungary

Tensor Research, LLC, Andover, MA 01810 USA

The utilization of Graphical Processing Units (GPUs) for the element-by-element (EbE) finite element method (FEM) is demonstrated.

EbE FEM is a long known technique, by which a conjugate gradient (CG) type iterative solution scheme can be entirely decomposed

into computations on the element level, i.e., without assembling the global system matrix. In our implementation NVIDIAs parallel

computing solution, the Compute Unified Device Architecture (CUDA) is used to perform the required element-wise computations in

parallel. Since element matrices need not be stored, the memory requirement can be kept extremely low. It is shown that this low-storage

but computation-intensive technique is better suited for GPUs than those requiring the massive manipulation of large data sets.

Index TermsCUDA, EbE FEM, GPU, parallel FEM.

forms a highly memory dependent problem to a massively com-

putational dependent one, which in turn can be efficiently par-

N OWADAYS the finite element method is one of the most

frequently used techniques for engineering analysis of

complex, real-life applications of both linear and non-linear

allelized [4]. Platforms having massively parallel computing

capability (like todays GPUs) can take full advantage of this

types [1]. Let us consider the linear equations system of the method.

form

(1) A. Parallel FEM Implementations on GPUs

Although several methods partially accelerating the FEM

resulting from the FEM approximation of a Partial Differential computations have already been implemented on GPUs [5][7],

Equation (PDE). is the so-called system matrix, the vector these usually suffer from a strict design limitation. Namely,

of unknowns, and on the right hand side (RHS) represents the these devices can operate only on data that is stored in their

excitation. For large scale problems the solution of (1) is usu- on-board memory. In other words, data must be transferred

ally obtained by iterative solvers, like the variants of gradient from the system memory to the devices memory prior to any

type methods. To keep generalitynot taking advantage of any computation.

special property of the system matrix , except sparsitya pre- Large scale FEM problems need large storage capacity for the

conditioned bi-conjugate gradient (BiCG) solver will be inves- global system matrix. Consequently, efforts to accelerate only

tigated in this paper [2]. the calculation of the matrix-vector product may conflict with a

The element-by-element type finite element method (EbE substantial property of GPUs: the available memory capacity is

FEM) was constructed originally for low memory computers limited (in a few GBs).

[3]. The foundation of the method is based on the recognition To overcome this problem one may consider different domain

that the assembling of element matrices to form the global decomposition techniques. The decomposition can be conceptu-

system matrix is a linear operation. Therefore certain cal- alized either in its traditional meaning [8], or in the sense that

culations with the system matrixlike e.g. a matrix-vector one just decomposes the large system matrix to smaller parts

productcan be traced back to the level of finite elements, thus that fit into the memory. These sub-matrices are then transferred

converted to calculations with the individual element matrices to the GPU one after the other, and the partial multiplications are

, which appear in the elementary equations having the form performed in parallel.

A major disadvantage of this technique is that although the

(2)

partial products are performed significantly (by several order)

Most iterative solvers can be decomposed into a sequence of faster this way, the necessary bus transfers will cause the overall

matrix-vector products and inner products of vectors, hence are acceleration to be much more moderate. This disadvantage be-

suitable to be implemented in the EbE context. The idea comes comes even more remarkable when multiple GPUs are present.

natural, not to store the element matrices (as traditional methods In this case the computing capacity is drastically increased

do with the system matrix), but recompute them in each itera- (compared to single GPU case), but the relative throughput of

tion. This is possible because iterative solvers (in contrast to di- the data transfers is bounded by the simultaneous needs from

rect solvers) do not affect the system matrix during the solution. the devices.

To achieve better utilization, a remedy needs to be found

to avoid costly data transfers. Such a technique would be one

which acts entirely on the GPUs.

Manuscript received July 07, 2011; revised October 14, 2011; accepted

November 01, 2011. Date of current version January 25, 2012. Corresponding

author: I. Kiss (e-mail: kiss@evt.bme.hu). B. Aim of the Work

Color versions of one or more of the figures in this paper are available online

at http://ieeexplore.ieee.org. As the gap between bus speed and computation density in-

Digital Object Identifier 10.1109/TMAG.2011.2175905 creases, codes which use the accelerator design (i.e. in which

0018-9464/$31.00 2012 IEEE

508 IEEE TRANSACTIONS ON MAGNETICS, VOL. 48, NO. 2, FEBRUARY 2012

only the computation intensive parts of the program are exe- Using the above concept, the matrix-vector product, which is

cuted on the GPU) will fall behind codes that take full advan- the basis of iterative solvers, can be reformulated in terms of

tage of it [9], i.e. perform all the necessary computations on the element-wise computations as

GPU.

The aim of this paper is to show that modern high perfor- (5)

mance computing platforms (GPUs) offer considerable com-

puting capacity that can be fully utilized only if the applied algo-

rithm fits to their specific architecture. This property is in con- This means that the product of an assembled global matrix and a

trast to traditional (multi-) CPU based program design patterns vector is equivalent with the assembled vector of the elementary

where the efficiency of an algorithm is simply estimated using matrix-vector products. According to (4) the elementary con-

its computational complexity. tributions can be accumulated in a vector, the size of which is

Relying on the fact that it is cheaper to recompute element equal to the global degrees-of-freedom (DoF), hence only vec-

matrices than continuously cache them between the device and tors have to be stored during the computations. Elementary ma-

the system memory, EbE FEM technique is revisited here, and trix-vector products in (3) can be computed for each element

its implementation on CUDA architecture is presented. It is also separately, which enables parallel realization [4].

demonstrated that EbE FEM highly extends the scale of prob- The other building block of iterative solution methods is the

lems that can be solved on devices having limited memory ca- inner product of two DoF-sized vectors. This operation is obvi-

pacity but a massively parallel architecture. ously independent of the mesh structure and connectivity, and

its parallel execution is straightforward.

One more advantage of the EbE implementation worth men-

II. CUDA, A MASSIVELY PARALLEL COMPUTING tioning is that no global numbering of unknowns and finite el-

ENVIRONMENT ements is required at all. This feature can be utilized several

GPUs were designed for total computational throughput ways, like for instance with adaptive mesh generation or mesh

rather than for fast execution of serial calculations. Therefore reduction techniques [12].

they have the potential to dramatically speed-up computation

intensive applications over multi-core CPUs. To achieve high B. ConcurrencyGlobal Updates and Coloring

computational throughput, GPUs have hundreds of lightweight On shared memory architectures (like the GPUs) an impor-

cores and execute tens of thousands of threads simultaneously. tant question is how the partial products are summarized. The

Programs executed on GPUs are called kernels. challenge during a global update is to ensure that different

The reason why GPUs can be so effective is the way how threads do not access the same memory space simultaneously.

thread execution is organized. On traditional CPUs the programs This concurrent access is called race-condition and results in

are executed for a certain amount of time and then interrupted an indefinite outcome. Treatment of such cases is traditionally of

(time division multiplexing). During the interruption, the CPU two kinds. One solution is atomic update, when the memory

has to save the current state of the inner registers, and load a space is protected during I/O, causing other threads accessing

previously saved state for an other thread. the same memory place to wait until the operation is fully com-

A detailed overview on this topic can be found in [10], [11]. pleted.

The other solution is a kind of coloring of the problem [13],

[14]. In this case the mesh is considered as a graph, with the un-

III. IMPLEMENTATION OF EBE FEM ON CUDA

known variables (DoF) being the nodes of the graph and the el-

ements representing the connections between them. This graph

A. Disassembling Matrix Manipulations to the Element Level is then colored in a way that any two elements having the same

The finite element assembling procedure relies on some func- color do not share a common unknown. Different colors are then

tions by which the element matrix and the RHS of (2) are processed serially (one after the other), while elements with

computed. These functions depend among others on the type of the same color are processed simultaneously in parallel.

PDE to be solved as well as the applied shape functions. The

computed element matrices and RHS vectors are then assem- C. Element-by-Element Formulation of the BiCG Solver

bled to form the global system matrix and RHS . Let this In an EbE implemented BiCG solver the computations can

assembly step be represented by an operator , which is de- be grouped into so-called EbE steps and DoF steps, re-

fined differently for matrices and vectors, as follows: spectively. The former refers to the matrix-vector product of

(5), while the latter means vector-vector product or initialization

(3) of variables. BiCG algorithm requires several auxiliary vectors

and complex variables during the iter-

(4) ations. Function of these variables is identical to that outlined in

[2, Chapter 2.3.5]. The vectors are DoF-sized, therefore can be

handled (stored) the same way as the vector of unknowns, .

where is the set of elements, and matrix represents the The way the variables are stored gives the real modularity of

transition between local and global numbering of the unknown the EbE method. Contrary to traditional FEM methods using

variables for the -th element. Contrary to the sparse global global numbering, here a dynamic storage structure is used in-

system matrix , the matrix of size ( being the stead. The structure can be thought as an index array (pointers

local degrees-of-freedom) is usually dense. in the actual implementation) keeping the information how local

KISS et al.: PARALLEL REALISATION OF THE ELEMENT-BY-ELEMENT FEM TECHNIQUE BY CUDA 509

unknowns correspond to global ones. This is functionally equiv- D. Some Drawbacks of the EbE Implementation

alent to the role of in (3), (4). The lack of assembling makes the method convenient for

Algorithm 1 shows the EbE implemented BiCG, which is GPU parallel execution also raises several difficulties. The first

functionally equivalent to that presented in [2, Chapter 2.3.5], one is related to preconditioning, which traditionally assumes

and is implemented in terms of EbE and DoF iterations. the system matrix to be in an assembled form. To overcome

Label EbE iteration indicates the computation of the element this problem one can use specific element-by-element precon-

matrices. To avoid race conditions during global updates, the ditioners [14], [15].

elements are colored. The iteration goes through all colors seri- In this paper a simple Jacobi preconditioner is used [2] be-

ally, and performs the computations on the elements having the cause it can be represented by a diagonal matrix , which

actual color, , in parallel. Label DoF iteration indicates the can be stored the same way as the DoF-sized auxiliary vectors.

computation of the vector-vector products. This iteration is per- The Jacobi preconditioner is implemented as a DoF step (see

formed simultaneously on all global unknowns. Label global Algorithm 1, line 1011, 31 and 3435).

update means that the value of a global variable is affected. To The second problem is related to the required extra com-

avoid race conditions, atomic updates are used to access global putations: since element matrices are not stored, they must be

variables. recomputed in each iteration, which is obviously redundant

when dealing with linear problems. However, this extra

computation becomes necessary for non-linear or coupled

problems, where a kind of fix-point iteration technique can be

realized this way [12].

The EbE BiCG method is implemented to run exclusively on

the GPUs. After the mesh is read and its appropriate coloring

is determined, point positions and triangulation information is

moved to the GPU memory. Both data are stored in special,

aligned data types to ensure coalesced access during the opera-

tions [11]. All computations are carried out using double preci-

sion floating point representation.

Each EbE and DoF iteration is performed as a separate kernel

call, including global updates. When only one GPU is used,

there is no need for synchronization during the kernel calls. In

an EbE iteration, the implementing kernel has as many allocated

threads as the number of elements having the current color. El-

ement sets of different colors are handled one after the other

by calling the same kernel, embedded in a host side iteration

through all the colors. Since in a DoF iteration all computations

can be carried out simultaneously, implementing kernels has as

many threads as global unknowns. Global updates in DoF iter-

ations are carried out using kernels with internal block summa-

tions.

Following the initialization part, BiCG loops (c.f. INIT

and LOOP labels in Algorithm 1) are computed as numerous

kernel calls embedded in a host loop. At the end of this loop,

the global variable is transferred to the hosts memory, and

the termination of the loop is decided on the host side. Since no

other host side computation is carried out during looping, the

CPU load of the algorithm is negligible.

IV. RESULTS

The chosen test problem is a static conduction problem with

inhomogeneous conductivity. The equation to be solved is

therefore the Laplace equation with spatially varying conduc-

tivity. The domain is discretized by tetrahedral elements and

linear nodal shape functions are used. The global unknowns

(DoF) are the potential values at the nodes of the mesh. The

element matrices are computed using analytical expressions

[16].

To study the accuracy and speed of the proposed method, the

Utah Torso model was investigated by solving an ECG for-

ward problem [17] (c.f. Fig. 1).

510 IEEE TRANSACTIONS ON MAGNETICS, VOL. 48, NO. 2, FEBRUARY 2012

shows some run time data of a FEM conduction solver using the

in-core version of Intel MKL PARADISO direct solver for the

same meshes). Another advantage over the popular GPU FEM

implementations is that the treatable problem size is limited by

the amount of meshing information rather than by the number

of non-zero elements in the system matrix.

Preliminary multi-GPU results demonstrate the scalability of

the method: it can be extended to multiple GPUs and for larger

problems the expected linear scaling could be achieved. This

topic together with the treatment of non-linear problems will be

covered in a forthcoming paper.

ACKNOWLEDGMENT

The work reported in the paper has been developed in the

Fig. 1. Utah torso modell. Solid faces correspond to organs.

framework of the project Talent care and cultivation in the sci-

entific workshops of BME project. This project was supported

TABLE I by the grant TMOP-4.2.2.B-10/1/KMR-2010-0009.

RESULTS FOR THE UTAH TORSO PROBLEM

REFERENCES

[1] P. P. Silvester and R. L. Ferrari, Finite Elements for Electrical Engi-

neers. Cambridge, U.K.: Cambridge University Press, 1990.

[2] R. Barrett, Templates for the Solution of Linear Systems: Building

Blocks for Iterative Methods, 2nd ed. Philadelphia, PA: SIAM, 1994.

[3] G. F. Carey, E. Barragy, R. McLay, and M. Sharma, Element-by-el-

ement vector and parallel computations, Commun. Appl. Numer.

Methods, vol. 4, no. 3, pp. 299307, 1988.

[4] G. F. Carey and B.-N. Jiang, Element-by-element linear and nonlinear

solution schemes, Appl. Num. Meth., vol. 2, no. 2, pp. 145153, 1986.

[5] J. Bolz, I. Farmer, E. Grinspun, and P. Schrder, Sparse matrix solvers

on the GPU: Conjugate gradients and multigrid, ACM Trans. Graph.,

Run time statistics obtained for several different mesh sizes vol. 22, pp. 917924, Jul. 2003.

are shown in Table I. The computations has been carried out on [6] C. Cecka, A. Lew, and E. Darve, Introduction to assembly of finite

element methods on graphics processors, in IOP Conf. Series: Mate-

a HP-XW8600 workstation, having 8 GB memory, an NVIDIA rials Science and Engineering, 2010, vol. 10, no. 1, p. 012009.

GTX 590 GPU and a quad-core Intel Xeon X3440 CPU. [7] A. Cevahir, A. Nukada, and S. Matsuoka, Fast conjugate gradients

As also outlined in [7], the performance of the GPU accel- with multiple GPUs, in ICCS 2009, G. G. van Albada, J. Dongarra,

erated matrix-vector multiplication (MxV) highly depends on and P. Sloot, Eds., 2009, vol. 5544, pp. 893903.

[8] R. K. W. Hackbusch and B. N. Khoromskij, Direct schur complement

the structure of the matrix, i.e., the distribution and number of method by domain decomposition based on H-matrix approximation,

non-zero elements. Unlike methods using GPUs only for accel- Comput. Vis. Sci., vol. 8, pp. 179188, Dec. 2005.

erating the computation of the MxV, the proposed method does [9] I. Kiss, S. Gyimthy, and J. Pv, Acceleration of moment method

not rely on the system matrix, hence no such degradation may using CUDA, COMPEL: Int. J. Comput. Math. Elect. Eng., vol. 31,

no. 6, to be published.

occur. This results in a uniform performance irrespectively of [10] T. R. Halfhill, Parallel processing With CUDA., Microproc. J., 2008.

the domain (matrix) structure. [11] D. Kirk and W.-M. Hwu, Programming Massively Parallel Processors,

Another advantage is the memory efficiency. Since sparse A Hands-On Approach.. San Mateo, CA: Morgan Kaufmann, 2010.

[12] S. Gyimthy and I. Sebestyn, Symbolic description of field calcula-

matrix storage inherently requires some extra storage overhead tion problems, IEEE Tran. Magn., vol. 34, no. 5, pp. 34273430, 1998.

(for the row and column information), the efficiency of memory [13] C. Farhat and L. Crivelli, A general approach to nonlinear FE compu-

occupancy is limited. On the contrary, the proposed EbE FEM tations on shared-memory multiprocessors, Comput. Methods Appl.

only needs to store the meshing information and several DoF Mech. Eng., vol. 72, no. 2, pp. 153171, Feb. 1989.

[14] A. J. Wathen, An analysis of some element-by-element techniques,

sized auxiliary vectors required for the CG iterations. Comput. Methods Appl. Mech. Eng., vol. 74, no. 3, pp. 271287, Sep.

1989.

[15] G. Golub and Q. Ye, Inexact preconditioned conjugate gradient

V. CONCLUSION method with inner-outer iterations, SIAM J. Sci. Comp., vol. 21, no.

4, pp. 13051320, 2000.

In this paper the EbE FEM method is re-formulated to fit the [16] A. Nentchev, Numerical Analysis and Simulation in Microelectronics

GPU architecture. The method has extremely low memory con- by Vector Finite Elements, Ph.D. dissertation, Tech. Univ. Wien, Vi-

sumption and can take full advantage of the massively parallel enna, 2008.

[17] R. MacLeod, C. Johnson, and P. Ershler, Construction of an inhomo-

execution environment. Not only the algorithm outperforms tra- geneous model of the human torso for use in computational ECG, in

ditional CUDA accelerated FEM methods [5][7] but it is also IEEE Med. and Biology Society Annu. Conf., 1991, pp. 688689.

- Introduction 1of2Uploaded byAnkit Sharma
- Image Parallel Processing Based on GPU.pdfUploaded byEider Carlos
- Full TextUploaded bySukhwinder Singh Gill
- CUDA C Programming GuideUploaded byClarkKent_
- Critical Fluid Structure Interaction With FEMUploaded byDaniel Dadzie
- 01 IntroductionfeaUploaded byTanmay Sinha
- ANSYS Mechanical APDL Parallel Processing GuideUploaded bySuri Kens Michua
- Finite Element PrimerUploaded byhowardzw88
- Code Aster User ManuelUploaded byCostynha
- FIRST_COURSE_in_FINITE_ELEMENTS.pdfUploaded byRogerio
- CO_Computer & CommunicationUploaded byAyushya Rao
- FEM Analysis of a Large Nozzle-To-Cylinder Shell Junction _ Carmagen EngineeringUploaded bybapug98
- Absorbing Boundaries for the Time-Domain Analysis of Dam-Reservoir-Foundation Syst (1997) - Thesis (268)Uploaded byJulio Humberto Díaz Rondán
- Arun Vyas CV Jan07Uploaded byArun Vyas
- backtrack-4-cuda-guideUploaded byrajeshceg38310
- ihlenburg.pdfUploaded byGanesh Diwan
- Modeling Issues in FEA With ANSYSUploaded byMukeshChopra
- 1-s2.0-S1877705817308226-mainUploaded bypastcal
- Article Review 2Uploaded byAllenArmodia
- FEM III Class Test PaperUploaded byKrishna Murthy
- 1999-AJ_CC_FS-50_Lines_of_Matlab.pdfUploaded byMateo Chazi
- ADHTP-OMCUploaded byRamiro Ordoñez Calero
- FEM-1_swarUploaded byMonika
- Mesh refinementUploaded byOm Petel
- Abstract ESMC2009 (Versão final)Uploaded byMarkko Frenski
- Guided Resource Organisation in Heterogeneous Parallel ComputingUploaded bynoodles321
- Output LogUploaded byIero Stanescu
- Leg force analysis by Kinematics.docxUploaded byDiego Franco Pinto Muñoz
- VIS-06Uploaded bybbsample
- Numerical Methods -EnUploaded byKhasan Mustofa

- Regular Evaluation Test IV Science X AUploaded byB.L. Indo Anglian Public School, Aurngabad, Bihar
- The fast progressing medical products market in India – Market Overview & Category Insights.pdfUploaded byVJ Reddy R
- RecuritmentUploaded byAnkit Jain
- 92387738 EPRI 1000987 Mechanical Seal Maintenance and Application GuideUploaded byhufuents-1
- McAfee NGFW Reference Guide for Firewall VPN Role v5-7Uploaded byaniruddha_ratil
- f4 Chapter 2&3 Strucutre qUploaded byzhen1998
- DeloitteUploaded byIzZa Rivera
- The Identification and Prevention of Defects on Anodized Aluminium Parts.pdfUploaded byCemalOlgunÇağlayan
- AdcUploaded bystratocaser
- The Risk Wallet Survival KitUploaded by888hhhnnnmmm
- PT3 C1Uploaded byMilk
- En 2010 Ammonitcatalogue WebUploaded byzivkovic branko
- US Treasury: form-197Uploaded byTreasury
- wooden furniture Report by Pradeep kannojiya.docxUploaded bysid kannojiya
- GSM-R_QoSUploaded byVan Quang Dung
- Emerald BrazilUploaded byZoltánSáfár
- NON-LINEAR GROUND RESPONSE ANALYSIS FOR LIQUEFACTION ASSESMENT OF SOIL DEPOSITS bangalore.docxUploaded bysridhar
- AnalogyUploaded byNisa Halim
- learning objectives - embryology lecture 2Uploaded bykep1313
- Technology: Ideology, Economics and Power in the AndesUploaded byMF2017
- HAND GRIP STRENGTH AS A NUTRITIONAL ASSESSMENT TOOL IN PATIENTS WITH LIVER CIRRHOSISUploaded byIJAR Journal
- Na Dmecompany ComUploaded byPrabhath Sanjaya
- Tactics eBookUploaded byJavier Vallejo
- TricolomaniaUploaded byCarolZz
- Bukti Evolusi, HomologiUploaded byAndi Citra Pratiwi
- Sex - linked genes.pptxUploaded byRose Budhi
- Legacy WA 300 Dual Relay Automatic Wall Switch Cut SheetUploaded byJennifer White
- LOTAN Permaculture Study NotesUploaded byHaris A
- No 5-The Use of Gas at Saint Mihiel 90th Division Sep 1918Uploaded byReid Kirby
- 36HE)+Boiler+ManualsUploaded byrodofgod