00 upvotes00 downvotes

10 views45 pagesGPU

Jun 19, 2018

© © All Rights Reserved

PDF, TXT or read online from Scribd

GPU

© All Rights Reserved

10 views

00 upvotes00 downvotes

GPU

© All Rights Reserved

You are on page 1of 45

Benchmarking report

MIKE 2017

DHI headquarters

Agern Allé 5

DK-2970 Hørsholm

Denmark

+45 4516 9200 Telephone

+45 4516 9333 Support

+45 4516 9292 Telefax

mike@dhigroup.com

www.mikepoweredbydhi.com

PLEASE NOTE

protected by copyright. All rights are reserved. Copying or other

reproduction of this manual or the related programmes is

prohibited without prior written consent of DHI. For details please

refer to your ‘DHI Software Licence Agreement’.

LIMITED LIABILITY The liability of DHI is limited as specified in Section III of your

‘DHI Software Licence Agreement’:

‘IN NO EVENT SHALL DHI OR ITS REPRESENTATIVES

(AGENTS AND SUPPLIERS) BE LIABLE FOR ANY DAMAGES

WHATSOEVER INCLUDING, WITHOUT LIMITATION,

SPECIAL, INDIRECT, INCIDENTAL OR CONSEQUENTIAL

DAMAGES OR DAMAGES FOR LOSS OF BUSINESS PROFITS

OR SAVINGS, BUSINESS INTERRUPTION, LOSS OF

BUSINESS INFORMATION OR OTHER PECUNIARY LOSS

ARISING OUT OF THE USE OF OR THE INABILITY TO USE

THIS DHI SOFTWARE PRODUCT, EVEN IF DHI HAS BEEN

ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. THIS

LIMITATION SHALL APPLY TO CLAIMS OF PERSONAL

INJURY TO THE EXTENT PERMITTED BY LAW. SOME

COUNTRIES OR STATES DO NOT ALLOW THE EXCLUSION

OR LIMITATION OF LIABILITY FOR CONSEQUENTIAL,

SPECIAL, INDIRECT, INCIDENTAL DAMAGES AND,

ACCORDINGLY, SOME PORTIONS OF THESE LIMITATIONS

MAY NOT APPLY TO YOU. BY YOUR OPENING OF THIS

SEALED PACKAGE OR INSTALLING OR USING THE

SOFTWARE, YOU HAVE ACCEPTED THAT THE ABOVE

LIMITATIONS OR THE MAXIMUM LEGALLY APPLICABLE

SUBSET OF THESE LIMITATIONS APPLY TO YOUR

PURCHASE OF THIS SOFTWARE.’

MIKE 2017

CONTENTS

MIKE 21 Flow Model FM

Parallelisation using GPU

Benchmarking report

2 Methodology ..................................................................................................................... 2

2.1 GPU Parallelisation .............................................................................................................................. 2

2.2 Hardware .............................................................................................................................................. 2

2.3 Software ............................................................................................................................................... 3

2.4 Performance of the GPU Parallelisation .............................................................................................. 3

3.1 Mediterranean Sea ............................................................................................................................... 5

3.1.1 Description ........................................................................................................................................... 5

3.1.2 Setup .................................................................................................................................................... 5

3.2 Ribe Polder ........................................................................................................................................... 6

3.2.1 Description ........................................................................................................................................... 6

3.2.2 Setup .................................................................................................................................................... 7

3.3 EA2D Test 8A ....................................................................................................................................... 9

3.3.1 Description ........................................................................................................................................... 9

3.3.2 Setup .................................................................................................................................................. 11

3.4 EA2D Test 8B ..................................................................................................................................... 11

3.4.1 Description ......................................................................................................................................... 11

3.4.2 Setup .................................................................................................................................................. 13

4.1 Mediterranean Sea ............................................................................................................................. 14

4.2 Ribe Polder ......................................................................................................................................... 20

4.3 EA2D Test 8A ..................................................................................................................................... 21

4.4 EA2D Test 8B ..................................................................................................................................... 23

5.1 Mediterranean Sea ............................................................................................................................. 26

5.2 Ribe Polder ......................................................................................................................................... 26

5.3 EA2D Test8A ...................................................................................................................................... 27

5.4 EA2D Test8B ...................................................................................................................................... 27

6.1 Mediterranean Sea ............................................................................................................................. 28

6.2 Ribe Polder ......................................................................................................................................... 30

6.3 EA2D Test 8A ..................................................................................................................................... 31

6.4 EA2D Test8B ...................................................................................................................................... 32

7.1 Mediterranean Sea ............................................................................................................................. 33

i

7.2 Ribe Polder ......................................................................................................................................... 35

7.3 EA2D Test 8A ..................................................................................................................................... 35

8 Discussion ...................................................................................................................... 37

9 Conclusions.................................................................................................................... 39

10 References ...................................................................................................................... 40

ii

Vision and Scope

A set of well-defined test cases for the GPU version of the MIKE 21 Flow Model FM, has

been established. These test cases also cover MIKE FLOOD using the GPU version of

the MIKE 21 Flow Model FM for the 2D surface flow calculation. The test-suite is used to

test the performance across platforms with different graphics cards. It is essential that it is

possible to run the simulation with different spatial resolutions to be able to evaluate the

scalability of the parallelisation. The main focus is to benchmark the GPU parallelisation

of the flexible mesh modelling system. For comparison, simulations have also been

performed using the CPU version of MIKE 21 Flow Model FM.

Methodology

2 Methodology

The GPU computing approach uses the computer’s graphics card to perform the

computational intensive calculations. This approach is based on CUDA by NVIDIA and

can be executed on NVIDIA graphics cards with Compute Capability 2.0 or higher.

or multiple GPUs. The multiple GPU approach is based on the domain decomposition

concept, where the communication between the processors is done using MPI (Message

Passing Interface).

the GPU. The additional calculations are for each sub-domain in the domain

decomposition performed locally on the CPU and these calculations are further

parallelised based on the shared memory approach, OpenMP.

As default, the program uses one MPI process per GPU, but it is possible to assign more

processes to the same GPU. In this way simulations, where the hydrodynamic

calculations are less time consuming than the calculations performed in the other

modules, will benefit from the MPI parallelisation.

2.2 Hardware

The benchmarks have been performed using the following hardware platforms and

GPUs:

Operating

Computer Processor Memory GPUs

system

2 x Intel®Xeon® Windows 10 2x

DELL Precision

1 E5-2687W v2 32 GB Enterprise, 64- GeForce

T7610 (workstation)

(8 cores, 3.40 GHz) bit GTX Titan

DELL Precision

2 2687W (8 cores, 3.1 16 GB Enterprise SP1, GTX 1080

T7600 (workstation)

GHz) 64-bit Ti

3 DHI SaaS Intel®Xeon® E5- 488 GB Server 2008 K80

(Instance p2.8xlarge) 2686 v4, 2.3 GHz R2, 64-bit (dual card)

Microsoft Azure

4 E5-2690 v3 234 GB release K80

(Instance NC24) 7.3.1611, 64-bit

(12 cores, 2.60 GHz) (dual card)

Methodology

Single/Double

Number of GPU

Compute Memory Bandwidth precision

GPU CUDA Clock

Capability (GB) (GB/s) floating point

cores (MHz)

performance

3.5 2688 6 288 836

GTX Titan 1.27 Tflops

3.7 2 x 2496 2 x 12 2 x 240 562

(dual card) 2 x 0.94 Tflops

6.1 3584 11 484 1480

GTX 1080 Ti 0.33 Tflops

2.3 Software

All benchmarks have been performed using the MIKE 2017 Release. The CUDA 8.0

library is used in the MIKE 2017 Release. In the present benchmark the NVIDIA graphics

driver 378.78 has been used for hardware platform 1 and the graphics driver is running in

TCC mode. For platform 2 NVIDIA graphics driver 381.65 has been used. For platform 3

NVIDA graphics driver 377.35 has been used and the driver is running with ECC enabled.

For platform 4 NVIDA graphics driver 375.51 has been used and the driver is running with

ECC enabled.

The parallel performance of the GPU version of MIKE 21 Flow Model FM compared to the

CPU version is illustrated by measuring the speedup factor, tCPU(m)/tGPU(n). Here tCPU(m) is

the elapsed time using the existing CPU version (m subdomains and 1 core/thread) and

tGPU(n) is the elapsed time using the new GPU version (n subdomains and 1core/thread for

the CPU part of the calculation). The elapsed time is the total elapsed time (excluding

pre- and post-processing). The performance metric is highly dependent on not only the

GPU hardware but also the CPU hardware. The parallel performance of multi-GPUs is

illustrated by measuring the speedup factor, tGPU(1)/tGPU(n). In the simulations one MPI

process is used per GPU. For the GPU simulations the number of threads per block on

the GPU is 128.

Per default the calculations performed on the GPU are done in double precision.

However, since some GPUs have a significantly lower double precision floating point

performance than single precision floating point performance, it is possible to force the

calculations on the GPU to be performed in single precision. For this reason the

benchmarking has been done using both single and double precision calculations.

Be aware that using single precision calculations will affect the accuracy of the simulation

results, since single precision calculations are less accurate than double precision

calculations.

Methodology

The ratio between the specified theoretical single and double precision floating point

performance is not equal to the actual measured performance ratio between single and

double precision. This becomes evident when comparing the values in the last column of

Table 2.2, where the theoretical single precision performance is a factor 3-4 higher than

the theoretical double precision performance, to the actual measured difference between

single and double precision as presented in the benchmarking below.

Description of Test Cases

This test case has been established for benchmarking of the MIKE 21 Flow Model FM.

3.1.1 Description

In the Western parts of the Mediterranean Sea tides are dominated by the Atlantic tides

entering through the Strait of Gibraltar, while the tides in the Eastern parts are dominated

by astronomical tides, forced directly by the Earth-Moon-Sun interaction.

3.1.2 Setup

The bathymetry is shown in Figure 3.1. Simulations are performed using five meshes with

different resolution (see Table 3.1). The meshes are generated specifying the value for

the maximum area of 0.04, 0.005, 0.00125, 0.0003125 and 0.000078125 degree2,

respectively. The simulation period for the benchmarks covers 2 days starting 1 January

2004 for the simulations using mesh A, B and C. The simulation period is reduced to 6

hours for the simulations using mesh D and 3 hours for mesh E.

At the Atlantic boundary a time varying level boundary is applied. The tidal elevation data

is based on global tidal analysis (Andersen, 1995).

For the bed resistance the Manning formulation is used with a Manning number of 32. For

the eddy viscosity the Smagorinsky formulation is used with a Smagorinsky factor of 1.5.

Tidal potential is applied with 11 components (default values).

The shallow water equations are solved using both the first-order scheme and the higher-

order scheme in time and space.

Mesh Elements Nodes

shape Degree2

Description of Test Cases

The averaged time step for the simulations using Mesh A, B, C, D and E is 17.65s, 5.61s,

2.86s, 1.43s and 0.69s, respectively, for both the first-order scheme and the higher-order

scheme in time and space.

This test case has been established for benchmarking of the MIKE 21 Flow Model FM.

3.2.1 Description

The model area is located, on the southern part of Jutland, Denmark, around the city of

Ribe. The area is protected from storm floods in the Wadden Sea to the west by a dike.

The water course Ribe Å runs through the area and crosses the dike through a lock.

The flood condition where the dike is breached during a storm flood is illustrated by

numerical modelling. The concept applied to model the breach failure in the

hydrodynamic model is based on prescribing the breach by a dynamic bathymetry that

change in accordance with the relation applied for the temporal development of the

breach. Use of this method requires that the location of the breach is defined and known

at an early stage, so that it can be resolved properly and built into the bathymetry. The

shape and temporal development of the breach is defined with a time-varying distribution

along the dike crest. It is further defined how far normal to the crest line the breach can

be felt. Within this distance the bathymetry is following the level of the breach, if the local

level is lower than the breach level no changes are introduced. The area of influence of

the breach will therefore increase with time.

The breach and flood modelling has been carried out based on a historical high water

event (24 November, 1981), shown in Figure 3.2. Characteristic for this event is that high

tide occurs at the same time as the extreme water level. Højer sluice is located about 40

km south of the breach, while Esbjerg is located about 20 km to the north. Based on the

high water statistics for Ribe the extreme high water level has been estimated for an

event having a return period of 10,000 years. The observed water level at Højer is

hereafter adjusted gradually over two tidal cycles to the extreme high water level

estimated for the given return periods at Ribe, as indicated in Figure 3.2. The water level

time series established in this way are shown in Figure 3.2.

Description of Test Cases

Figure 3.2 Runoff from the catchment is included as specified discharges given for the two

streams Ribe Å and Kongeåen

The crossing between the dike and Ribe Å is shown in Figure 3.4. The crossing is in the

form of a navigational chamber lock. It is represented in the model bathymetry as a

culvert that can be closed by a gate. The points defining the dike next to the creek are

modified to have increased levels in order to ensure a well‐defined bathymetry where flow

only occurs through the cells defining the creek proper. The sluice is defined as a check

valve allowing only flow towards the sea.

A constant discharge of 9.384 m 3/s and 14.604 m3/s, respectively, are applied for the two

streams Ribe Å and Kongeåen. For the bed resistance the Manning formulation is used

with a Manning number of 18. For the eddy viscosity the Smagorinsky formulation is used

with a Smagorinsky factor of 0.28.

3.2.2 Setup

The bathymetry is shown in Figure 3.3. The computational mesh contains 173101

elements. A satisfactory resolution of the breach is obtained by a fine mesh of structured

triangles and rectangles as shown in Figure 3.4. The areas in‐ and offshore of the dike is

defined by a relatively fine mesh to avoid instabilities due to humps or holes caused by

large elements with centroids just outside the area of influence from the breach. The

simulation period is 42 hours.

Description of Test Cases

The shallow water equations are solved using both the first-order scheme and higher-

order scheme in space and time.

The averaged time step is 0.21s for both the first-order scheme and higher-order scheme

in space and time.

Description of Test Cases

This test is Test 8A in the benchmarks test developed during the Joint Defra/Environment

Agency research programme. This tests the package’s capability to simulate shallow

inundation originating from a point source and from rainfall applied directly to the model

grid, at a relatively high resolution. This test case has been established for benchmarking

of the MIKE 21 Flow model FM.

3.3.1 Description

The modelled area is approximately 0.4 km by 0.96 km and covers entirely the DEM

provided and shown in Figure 3.5. Ground elevations span a range of ~21m to ~37m.

• a uniformly distributed rainfall event illustrated by the hyetograph in Figure 3.6. This

is applied to the modelled area only (the rest of the catchment is ignored).

• a point source at the location (264896, 664747) (Map projection: British national

grid), and illustrated by the inflow time series in Figure 3.7. (This may for example be

assumed to arise from a surcharging culvert.)

Description of Test Cases

DEM is a 0.5m resolution Digital Terrain Model (no vegetation or buildings) created from

LiDAR data collected on 13th August 2009 and provided by the Environment Agency

(http://www.geomatics-group.co.uk). Model grid resolution should be 2m (or ~97000

nodes in the 0.388 km 2 area modelled).

All buildings at the real location (Cockenzie Street and surrounding streets in Glasgow,

UK) are ignored and the modelling is carried out using the “bare-earth” DEM provided.

pavements; 2) Any other land cover type. Manning’s n = 0.02 is applied for roads and

pavements n = 0.05 everywhere else.

All boundaries in the model area are closed (no flow) and the initial condition is dry bed.

The model is run until time T = 5 hours to allow the flood to settle in the lower parts of the

modelled domain.

Description of Test Cases

3.3.2 Setup

Simulations are performed using four meshes with different resolution (see Table 3.2).

The four meshes uses regular quadrilateral elements with grid spacing 2m, 1m, 0.5m and

0.25m, respectively. Mesh A corresponds to the original mesh used in the EA2D test, and

the additional meshes are obtained by refining this mesh.

Mesh Elements Nodes

shape metres

The shallow water equations are solved using the first-order scheme in time and space.

The averaged time step for the simulation using Mesh A, B, C and D is 0.22s, 0.10s, 0.5s

and 0.027s, respectively.

This test is Test 8B in the benchmarks test developed during the Joint Defra/Environment

Agency research programme. This tests the package’s capability to simulate shallow

inundation originating from a surcharging underground pipe, at relatively high resolution.

This test case has been established for benchmarking of the MIKE Flood using MIKE 21

Flow model FM for the 2d surface flow calculation and MIKE Urban for calculation of the

pipe flow.

3.4.1 Description

The modelled area is approximately 0.4 km by 0.96 km and covers entirely the DEM

provided and shown in Figure 3.8. Ground elevations span a range of ~21m to ~37m.

Description of Test Cases

with invert level uniformly 2m below ground is assumed to run through the modelled area.

An inflow boundary condition is applied at the upstream end of the pipe, illustrated in

Figure 3.9. A surcharge is expected to occur at a vertical manhole of 1m 2 cross-section

located 467m from the top end of the culvert, and at the location (264896, 664747). For

the downstream boundary condition free out fall (critical flow is assumed). The base flow

(uniform initial condition) is 1.6 m/s. The manhole is connected to the grid in one point

and the surface flow is assumed not to affect the manhole outflow.

Figure 3.9 Inflow hydrograph applied for the EA2D Test8B at upstream end of culvert

DEM is a 0.5m resolution Digital Terrain Model (no vegetation or buildings) created from

LiDAR data collected on 13th August 2009 and provided by the Environment Agency

(http://www.geomatics-group.co.uk). Model grid resolution should be 2m (or ~97000

nodes in the 0.388 km 2 area modelled).

The presence of a large number of buildings in the modelled area is taken into account.

Building outlines are provided with the dataset. Roof elevations are not provided.

Description of Test Cases

pavements; 2) Any other land cover type. Manning’s n = 0.02 is applied for roads and

pavements n = 0.05 everywhere else.

All boundaries in the model area are closed (no flow) and the initial condition is dry bed.

The model is run until time T = 5 hours to allow the flood to settle in the lower parts of the

modelled domain.

3.4.2 Setup

Simulations are performed using four meshes with different resolution (see Table 3.3).

The four meshes uses regular quadrilateral elements with grid spacing 2m, 1m, 0.5m and

0.25m, respectively. Mesh A corresponds to the original mesh used in the EA2D test, and

the additional meshes are obtained by refining this mesh.

Mesh Elements Nodes

shape metres

The shallow water equations are solved using the first-order scheme in time and space.

The averaged time step for the simulation using Mesh A, B, C and D is 0.27s, 0.15s,

0.76s and 0.025s, respectively.

Benchmarking using a GeForce GTX Titan Card

These tests have been performed using hardware platform 1 specified in Table 2.1. The

simulations have been performed using one and two GeForce GTX Titan cards (1 and 2

subdomains and 1 thread). For comparison, simulations have also been performed

without GPU acceleration (1 and 16 subdomains and 1 thread).

Table 4.1 Computational time, tGPU(n),using GPU acceleration (1 and 2 subdomains and 1

thread) and speedup factor ,tGPU(1)/tGPU(n). The simulations are carried out using single

precision (SP) and double precision (DP) and using first-order scheme in time and

space

SP DP

No. of

Mesh GPUs Speedup Speedup

n Time (s) Time (s)

Factor Factor

tGPU(n) tGPU(n)

tGPU(1)/tGPU(n) tGPU(1)/tGPU(n)

1 5.02 1 5.27 1

Mesh A

2 5.98 0.83 6.34 0.83

1 29.38 1 36.89 1

Mesh B

2 26.46 1.11 29.55 1.34

1 147.02 1 207.64 1

Mesh C

2 96.04 1.53 126.29 1.64

1 125.35 1 186.56 1

Mesh D

2 67.80 1.84 97.67 1.91

1 503.46 1 765.17 1

Mesh E

2 255.98 1.96 382.45 2.00

Benchmarking using a GeForce GTX Titan Card

Table 4.2 Computational time, tGPU(n),using GPU acceleration (1 and 2 subdomains and 1

thread) and speedup factor ,tGPU(1)/tGPU(n). The simulations are carried out using single

precision (SP) and double precision (DP) and using higher-order scheme in time and

space

SP DP

No. of

Mesh GPUs Speedup Speedup

n Time (s) Time (s)

Factor Factor

tGPU(n) tGPU(n)

tGPU(1)/tGPU(n) tGPU(1)/tGPU(n)

1 11.24 1 12.62 1

Mesh A

2 13.10 0.85 14.21 0.88

1 90.22 1 118.44 1

Mesh B

2 68.32 1.32 81.63 1.45

1 558.93 1 783.70 1

Mesh C

2 331.30 1.68 428.26 1.82

1 518.49 1 717.68 1

Mesh D

2 264.90 1.95 372.80 1.92

1 2164.98 1 3071.29 1

Mesh E

2 1095.47 1.97 1518.02 2.02

Benchmarking using a GeForce GTX Titan Card

Figure 4.1 Speedup factor, tGPU(1)/tGPU(n), for two GPUs relative to a single GPU using first-order

scheme in time and space. Blue line: single precision; Black line: double precision;

Red line: Ideal speedup factor

Figure 4.2 Speedup factor, tGPU(1)/tGPU(n), for two GPUs relative to a single GPU using higher-

order scheme in time and space. Blue line: single precision; Black line: double

precision; Red line: Ideal speedup factor

Benchmarking using a GeForce GTX Titan Card

Table 4.3 Computational time, tCPU(n), using no GPU acceleration (1 and 16 subdomains and 1

thread) and speedup factor, tCPU(1)/tCPU(n). The simulations are carried out using first-

order scheme and higher-order scheme in time and space

First-order Higher-order

No. of

Mesh domains Speedup Speedup

n Time (s) Time (s)

Factor Factor

tCPU(n) tCPU(n)

tCPU(1)/tCPU(n) tCPU(1)/tCPU(n)

1 57.87 1 174.61 1

Mesh A

16 5.73 10.09 15.27 11.43

1 1253.81 1 4848.75 1

Mesh B

16 94.07 13.32 314.16 15.34

1 10866.48 1 37241.11 1

Mesh C

16 844.58 12.86 2818.74 13.21

1 10917.15 1 37830.65 1

Mesh D

16 964.27 11.32 3406.35 11.10

Table 4.4 Speedup factors, tCPU(1)/tGPU(n ) and tCPU(16)/tGPU(n ). The simulations are carried out

using single precision (SP) and double precision (DP) and using first-order scheme in

time and space

No. of tCPU(1)/tGPU(n) tCPU(16)/tGPU(n)

Mesh GPUs

n

SP DP SP DP

Mesh A

2 9.67 9.12 0.95 0.90

Mesh B

2 47.38 42.43 3.55 3.18

Mesh C

2 113.41 86.04 8.79 6.68

Mesh D

2 161.09 111.77 14.22 9.87

Benchmarking using a GeForce GTX Titan Card

Table 4.5 Speedup factors, tCPU(1)/tGPU(n ) and tCPU(16)/tGPU(n). The simulations are carried out

using single precision (SP) and double precision (DP) and using higher-order

scheme in time and space

No. of tCPU(1)/tGPU(n) tCPU(16)/tGPU(n)

Mesh GPUs

n SP DP SP DP

Mesh A

2 13.32 12.28 1.36 1.07

Mesh B

2 70.97 59.39 4.49 3.84

Mesh C

2 112.40 86.95 8.50 6.58

Mesh D

2 142.81 101.47 12.85 9.13

Benchmarking using a GeForce GTX Titan Card

Figure 4.3 Speedup factor, tCPU(1)/tGPU(n ), for one and two GeForce GTX Titan card using first-

order scheme. Blue line: single precision; Black line: double precision and solid line:

1 GPU and dash line: 2 GPU

Figure 4.4 Speedup factor, tCPU(1)/tGPU(n ), for one and two GeForce GTX Titan card using higher-

order scheme. Blue line: single precision; Black line: double precision and solid line:

1 GPU and dash line: 2 GPU

Benchmarking using a GeForce GTX Titan Card

Table 4.6 Computational time, tGPU(n), using GPU acceleration (1 and 2 subdomains and 1

thread) and speedup factor ,tGPU(1)/tGPU(n). The simulations are carried out using single

precision (SP) and double precision (DP) and using first-order scheme in time and

space

SP DP

No. of

Mesh GPUs Speedup Speedup

n Time (s) Time (s)

Factor Factor

tGPU(n) tGPU(n)

tGPU(1)/tGPU(n) tGPU(1)/tGPU(n)

1 1920.08 1 2127.53 1

Mesh A

2 1355.13 1.41 1563.44 1.36

Table 4.7 Computational time, tGPU(n) using GPU acceleration (1 and 2 subdomains and 1

thread) and speedup factor ,tGPU(1)/tGPU(n). The simulations are carried out using single

precision (SP) and double precision (DP) and using higher-order scheme in time and

space

No. of

Mesh GPUs Speedup Speedup

n Time (s) Time (s)

Factor Factor

tGPU(n) tGPU(n)

tGPU(1)/tGPU(n) tGPU(1)/tGPU(n)

1 3983.48 1 4573.40 1

Mesh A

2 2920.90 1.36 3405.28 1.34

Table 4.8 Computational time, tCPU(n), using no GPU acceleration (1 and 16 domains and 1

thread) and speedup factor, tCPU(1)/tCPU(n). The simulations are carried out using first-

order scheme and higher-order scheme in time and space

First-order Higher-order

No. of

Mesh domains Speedup Speedup

n Time (s) Time (s)

Factor Factor

tCPU(n) tCPU(n)

tCPU(1)/tCPU(n) tCPU(1)/tCPU(n)

1 24071.79 1 92536.13 1

Mesh A

16 3387.64 7.10 15621.08 5.92

Benchmarking using a GeForce GTX Titan Card

Table 4.9 Speedup factors, tCPU(1)/tGPU(n ) and tCPU(16)/tGPU(n). Simulations are carried out using

single precision (SP) and double precision (DP) and using first-order scheme in time

and space

No. of tCPU(1)/tGPU(n) tCPU(16)/tGPU(n)

Mesh GPUs

n

SP DP SP DP

Mesh A

2 17.76 15.39 2.49 2.16

Table 4.10 Speedup factors, tCPU(1)/tGPU(n ) and tCPU(16)/tGPU(n). Simulations are carried out using

single precision (SP) and double precision (DP) and higher-order scheme in time and

space

No. of tCPU(1)/tGPU(n) tCPU(16)/tGPU(n)

Mesh GPUs

n

SP DP SP DP

Mesh A

2 31.68 27.17 5.34 4.58

Table 4.11 Computational time, tGPU(n), using GPU acceleration (1 and 2 subdomains and 1

thread) and speedup factor ,tGPU(1)/tGPU(n). The simulations are carried out using single

precision (SP) and double precision (DP) and using first-order scheme in time and

space

No. of

Mesh GPUs Speedup Speedup

n Time (s) Time (s)

Factor Factor

tGPU(n) tGPU(n)

tGPU(1)/tGPU(n) tGPU(1)/tGPU(n)

1 115.97 1 141.78 1

Mesh A

2 99.35 1.16 111.16 1.27

1 631.94 1 803.02 1

Mesh B

2 415.50 1.52 508.99 1.57

1 3590.63 1 4704.55 1

Mesh C

2 2084.08 1.72 2711.03 1.73

1 27741.35 1 34994.81 1

Mesh D

2 13910.62 1.99 18216.11 1.92

Benchmarking using a GeForce GTX Titan Card

Figure 4.5 Speedup factor, tGPU(1)/tGPU(n), for two GPUs relative to a single GPU. Blue line:

single precision; Black line: double precision; Red line: Ideal speedup factor

Table 4.12 Computational time, tCPU(n), using no GPU acceleration (1 and 16 domains and 1

thread) and speedup factor, tCPU(1)/tCPU(n). The simulations are carried out using first-

order scheme in time and space

No. of Speedup

Time (s)

Mesh domains Factor

tCPU(n)

n tCPU(1)/tCPU(n)

1 1494.00 1

Mesh A

16 157.48 9.48

1 14751.77 1

Mesh B

16 1791.74 8.23

1 106463.08 1

Mesh C

16 14514.62 8.10

Benchmarking using a GeForce GTX Titan Card

Table 4.13 Speedup factors, tCPU(1)/tGPU(n ) and tCPU(16)/tGPU(n). Simulations are carried out using

single precision (SP) and double precision (DP) and using first-order scheme in time

and space

No. of tCPU(1)/tGPU(n) tCPU(16)/tGPU(n)

Mesh GPUs

n

SP DP SP DP

Mesh A

2 15.03 13.43 1.58 1.41

Mesh B

2 35.50 28.98 4.30 3.52

Mesh C

2 51.08 39.27 6.96 5.35

Table 4.14 Computational time, tGPU(n), using GPU acceleration (1 and 2 subdomains and 1

thread) and speedup factor ,tGPU(1)/tGPU(n). The simulations are carried out using single

precision (SP) and double precision (DP) and using first-order scheme in time and

space

SP DP

No. of

Mesh GPUs Speedup Speedup

Time (s) Time (s)

n Factor Factor

tGPU(n) tGPU(n)

tGPU(1)/tGPU(n) tGPU(1)/tGPU(n)

1 99.42 1 122.11 1

Mesh A

2 84.04 1.18 94.80 1.28

1 449.79 1 484.17 1

Mesh B

2 266.65 1.68 296.16 1.63

1 2038.29 1 2445.55 1

Mesh C

2 1120.32 1.81 1275.39 1.91

1 11673.43 1 14927.29 1

Mesh D

2 5822.06 2.00 7179.54 2.07

Benchmarking using a GeForce GTX Titan Card

Figure 4.6 Speedup factor, tGPU(1)/tGPU(n), for two GeForce GTX Titan cards relative to a single

GeForce GTX Titan card. Blue line: single precision; Black line: double precision;

Red line: Ideal speedup factor

Table 4.15 Computational time, tCPU(n), using no GPU acceleration (1 and 16 domains and 1

thread) and speedup factor, tCPU(1)/tCPU(n). The simulations are carried out using first-

order scheme in time and space

No. of Speedup

Time (s)

Mesh domains Factor

tCPU(n)

n tCPU(1)/tCPU(n)

1 762.43 1

Mesh A

16 107.07 7.12

1 6322.43 1

Mesh B

16 953.09 6.63

1 47032.28 1

Mesh C

16 8434.83 5.57

Benchmarking using a GeForce GTX Titan Card

Table 4.16 Speedup factors, tCPU(1)/tGPU(n ) and tCPU(16)/tGPU(n). Simulations are carried out using

single precision (SP) and double precision (DP) and using first-order scheme in time

and space

No. of tCPU(1)/tGPU(n) tCPU(16)/tGPU(n)

Mesh GPUs

n

SP DP SP DP

Mesh A

2 9.07 8.04 1.27 1.12

Mesh B

2 23.71 21.34 3.57 3.21

Mesh C

2 41.98 36.87 7.52 6.61

Benchmarking using a GeForce GTX 1080 Ti

These tests have been performed using hardware platform 2 specified in Table 2.1. The

simulations have been performed using a single GeForce GTX 1080 Ti card (1

subdomain and 1 thread).

Table 5.1 Computational time, tGPU(n), using GPU acceleration (1 domain and 1 thread). The

simulations are carried out using single precision (SP) and double precision (DP) and

using first-order scheme and higher-order scheme in time and space

Time (s)

tGPU(1)

Mesh First-order Higher-order

SP DP SP DP

Table 5.2 Computational time, tGPU(n),using GPU acceleration (1 domain and 1 thread). The

simulations are carried out using single precision (SP) and double precision (DP) and

using first-order scheme and higher-order scheme in time and space

Time (s)

tGPU(1)

Mesh

First-order Higher-order

SP DP SP DP

Benchmarking using a GeForce GTX 1080 Ti

Table 5.3 Computational time, tGPU(n), using GPU acceleration (1 domain and 1 thread). The

simulations are carried out using single precision (SP) and double precision (DP) and

using first-order scheme in time and space

Time (s)

tGPU(1)

Mesh

SP DP

Mesh D - 22745.63

Table 5.4 Computational time, tGPU(n), using GPU acceleration (1 domain and 1 thread). The

simulations are carried out using single precision (SP) and double precision (DP) and

using first-order scheme in time and space

Time (s)

tGPU(1)

Mesh

SP DP

Mesh D - -

Benchmarking using a Tesla K80 on MIKE SaaS

These tests have been performed using a GPU instance of MIKE Powered by DHI

Software as a Service specified as hardware platform 3 in Table 2.1. The simulations

have been performed using Tesla K80 cards (multi subdomains and 1 thread).

Table 6.1 Computational time, tGPU(n), using GPU acceleration (1 domain and 1 thread). The

simulations are carried out using single precision (SP) and double precision (DP) and

using first-order scheme in time and space

Time (s)

tGPU(1)

Mesh

SP DP

Table 6.2 Computational time, tGPU(n), using GPU acceleration (multi domains and 1 thread) and

speedup factor ,tGPU(1)/tGPU(n). The simulations are carried out using single precision

(SP) and double precision (DP) and using higher-order scheme in time and space

No. of

Mesh GPUs Speedup Speedup

n Time (s) Time (s)

Factor Factor

tGPU(n) tGPU(n)

tGPU(1)/tGPU(n) tGPU(1)/tGPU(n)

1 11.85 1 13.73 1

Benchmarking using a Tesla K80 on MIKE SaaS

No. of

Mesh GPUs Speedup Speedup

n Time (s) Time (s)

Factor Factor

tGPU(n) tGPU(n)

tGPU(1)/tGPU(n) tGPU(1)/tGPU(n)

1 104.53 1 151.12 1

1 650.18 1 1009.37 1

1 585.02 1 924.44 1

1 2385.04 1 4937.59 1

Benchmarking using a Tesla K80 on MIKE SaaS

Figure 6.1 Speedup factor, tGPU(1)/tGPU(n), using double precision and higher-order scheme in

time and space. Pink line: Mesh A; Light blue line: Mesh B; Black line: Mesh C; Blue

line: Mesh D; Green line: Mesh E; Red line: Ideal speedup factor

Table 6.3 Computational time, tGPU(n),using GPU acceleration (1 domain and 1 thread). The

simulations are carried out using single precision (SP) and double precision (DP) and

using first-order scheme and higher-order scheme in time and space

Time (s)

tGPU(1)

Mesh

First-order Higher-order

SP DP SP DP

Benchmarking using a Tesla K80 on MIKE SaaS

Table 6.4 Computational time, tGPU(n), using GPU acceleration and 1 thread and speedup factor

,tGPU(1)/tGPU(n). The simulations are carried out using single precision (SP) and double

precision (DP) and using first-order scheme in time and space

No. of

Mesh GPUs Speedup Speedup

n Time (s) Time (s)

Factor Factor

tGPU(n) tGPU(n)

tGPU(1)/tGPU(n) tGPU(1)/tGPU(n)

1 137.87 1 161.3 1

1 691.93 1 873.54 1

1 3963.79 1 5213.29 1

1 29312.43 1 38412.53 1

Benchmarking using a Tesla K80 on MIKE SaaS

Figure 6.2 Speedup factor, tGPU(1)/tGPU(n), using double precision and first-order scheme in time

and space. Pink line: Mesh A; Light blue line: Mesh B; Black line: Mesh C; Blue line:

Mesh D; Green line: Mesh E; Red line: Ideal speedup factor

Table 6.5 Computational time, tGPU(n), using GPU acceleration (1 domain and 1 thread). The

simulations are carried out using single precision (SP) and double precision (DP) and

using first-order scheme in time and space

Time (s)

tGPU(1)

Mesh

SP DP

Benchmarking using a Tesla K80 on Microsoft Azure

These tests have been performed using a GPU instance of Microsoft Azure specified as

hardware platform 4 in Table 2.1. The simulations have been performed using Tesla K80

cards (multi subdomains and 1 thread).

Table 7.1 Computational time, tGPU(n), using GPU acceleration (1 domain and 1 thread). The

simulations are carried out using double precision (DP) and using first-order scheme

in time and space

Time (s)

Mesh

tGPU(1)

Mesh A 9.65

Mesh B 52.03

Mesh C 282.89

Mesh D 254.52

Mesh E 1226.95

Table 7.2 Computational time, tGPU(n), using GPU acceleration (multi domains and 1 thread) and

speedup factor ,tGPU(1)/tGPU(n). The simulations are carried out using double precision

(DP) and using higher-order scheme in time and space

No. of Speedup

Time (s)

Mesh GPUs Factor

tGPU(n)

n tGPU(1)/tGPU(n)

1 18.86 1

4 25.3 0.74

1 157.87 1

4 106.26 1.48

1 1056.56 1

4 362.11 2.91

Benchmarking using a Tesla K80 on Microsoft Azure

No. of Speedup

Time (s)

Mesh GPUs Factor

tGPU(n)

n tGPU(1)/tGPU(n)

1 948.59 1

4 294.82 3.21

1 5015.65 1

4 1073.62 4.67

Figure 7.1 Speedup factor, tGPU(1)/tGPU(n), using double precision and higher-order scheme in

time and space. Pink line: Mesh A; Light blue line: Mesh B; Black line: Mesh C; Blue

line: Mesh D; Green line: Mesh E; Red line: Ideal speedup factor

Benchmarking using a Tesla K80 on Microsoft Azure

Table 7.3 Computational time, tGPU(n),using GPU acceleration (1 domain and 1 thread). The

simulations are carried out using double precision (DP) and using first-order scheme

and higher-order scheme in time and space

Time (s)

tGPU(1)

Mesh

Higher-

First-order

order

Table 7.4 Computational time, tGPU(n), using GPU acceleration and 1 thread and speedup factor

,tGPU(1)/tGPU(n). The simulations are carried out using double precision (DP) and using

first-order scheme in time and space

No. of Speedup

Time (s)

Mesh GPUs Factor

tGPU(n)

n tGPU(1)/tGPU(n)

1 165.14 1

4 162.03 1.01

1 879.11 1

4 551.8 1.59

1 5239.08 1

4 1757.38 2.98

1 43723.99 1

4 10991.37 3.97

Benchmarking using a Tesla K80 on Microsoft Azure

Figure 7.2 Speedup factor, tGPU(1)/tGPU(n), using double precision and first-order scheme in time

and space. Pink line: Mesh A; Light blue line: Mesh B; Black line: Mesh C; Blue line:

Mesh D; Green line: Mesh E; Red line: Ideal speedup factor

Discussion

8 Discussion

The performance strongly depends on the graphics card. For simulations having a small

number of elements in the computational grid, the best performance is obtained with the

GeForce GTX Titan card while for simulations with a large number of elements the best

performance is obtained with the GeForce GTX 1080 Ti card. Since the Tesla K80 is a

dual card consisting of two GPUs, it can be seen as one entity having two GPUs or it can

be thought of as two separate GPUs. In this report the latter have been chosen. This

approach eliminates the time spent on domain decomposition and communication

between subdomains from the comparison. However, if considering both GPUs on the

Tesla K80 card as one entity, then the Tesla K80 card and the GeForce GTX 1080 Ti

card show similar performance for simulations having a large number of elements. This is

illustrated in Figure 8.1, which shows the speedup for the various graphics cards tested in

this report compared to the performance of the GeForce GTX Titan card.

Considering double precision calculations, the GeForce GTX 1080 Ti card is for very

large meshes up to 36% faster than the GeForce GTX Titan and up to 60% faster than

the Tesla K80 card (using only 1 of 2 GPUs on the card). This might seem odd, since the

Tesla line of graphics cards are designed for numerical calculations whereas the

GeForce cards are primarily used for gaming. The explanation is found in the hardware

specifications for the various cards, where the GeForce GTX 1080 Ti is superior in a

number of specifications compared to the other cards, see Table 2.2. When utilising the

entire Tesla K80 card (both GPUs) the Tesla K80 is up to 37% faster than a single

GeForce GTX Titan card and around the same performance as the GeForce GTX 1080

Ti.

Figure 8.1 Comparison of the performance using different graphic cards. The speedup factor is

the ratio of the computational time using the GeForce GTX Titan Card to the

computational time using other cards. The Mediterranean Sea case with higher-order

scheme in time and space. Black line: GeForce GTX 1080 Ti card; Blue line: Tesla

K80 card (1 GPU); Green line: Tesla K80 (2 GPU)

Discussion

efficient than parallelisation using GPU acceleration when the number of elements is

small. When the number of processors is increased for fixed problem size the efficiency

will decrease. The suboptimal speedups can be explained by the workload imbalance

and a high communication overhead. For large problems the efficiency will decrease for

increasing problem size due to the increase in memory access time. This means that for

large problems the use of parallelisation utilising GPU acceleration will significantly

reduce the computational time compared to using pure MPI parallelisation. Especially

when using multiple GPUs. This conclusion is of course very dependent on both the CPU

and GPU hardware considered.

When the number of wet elements in the considered problem is sufficiently high, it is

possible to obtain nearly ideal speed-up using multiple GPUs relative to using 1 GPU.

Depending on the considered problem it is even possible to get superlinear speed-up. Of

course the communication overhead increases when using multiple GPUs, but as long as

the problem is large enough for each GPU to have full work-load the scalability over

multiple GPUs is very good.

The Tesla K80 card has been tested on hardware platform 3 (Windows) and hardware

platform 4 (Linux), and the card shows similar performance for the two platforms.

Conclusions

9 Conclusions

The overall conclusions of the benchmarks are

• The numerical scheme and the implementation of the GPU version of the MIKE 21

Flow Model FM are identical to the CPU version of MIKE 21 Flow Model FM.

Simulations without flooding and drying produces identical results using the two

versions. Simulations with extensive flooding and drying produce results that may

contain small differences.

• The performance of the GPU version of MIKE 21 Flow Model FM depends highly on

the graphics card and the model setup. When evaluating the performance by

comparing with a single core (no parallelisation) CPU simulation the performance

also depend highly on the specifications for the CPU.

• The speedup factor of simulations with no flooding and drying increases with

increasing number of elements in the computational mesh. When the number of

elements becomes larger than approximately 400.000 then there is only a very

limited increase in the speedup factor for increasing number of elements.

• The use of multi-GPU shows excellent performance. To get the optimal speedup

factor a large number of elements is required for each sub-domain.

improve the performance compared to the use of pure MPI.

References

10 References

/1/ Andersen, O.B., 1995, Global ocean tides from ERS-1 and TOPEX/POSEIDON

altimetry, J. Geophys Res. 100 (C12), 25,249-25,259.

/2/ Bo Brahtz Christensen, Nils Drønen, Peter Klagenberg, John Jensen, Rolf

Deigaard and Per Sørensen, 2013, Multiscale modelling of coastal flooding,

Coastal Dynamics 2013, paper no. 053.

/3/ Néelz S., Pender G., 2010, Benchmarking of 2D Hydraulic Modelling Packages,

Report published by Environment Agency, www.environment-agency.gov.uk,

Copies of this report are available from the publications catalogue:

http://publications.environment-agency.gov.uk

## Much more than documents.

Discover everything Scribd has to offer, including books and audiobooks from major publishers.

Cancel anytime.