You are on page 1of 36

To hear today’s event :

Listen via the audio stream through your


computer speakers
OR
Listen via phone by clicking the
teleconference request button in the
Participants window

You will not hear “hold music”


while waiting for the event to begin.

1 © 2011 ANSYS, Inc. February 26, 2014


User Defined Functions in
ANSYS Fluent –
Part 4, Making UDFs Work
in Parallel

2 © 2011 ANSYS, Inc. February 26, 2014


Introduction To This Webinar Series

UDFs – Powerful and important feature of ANSYS Fluent


Five part webinar series to discuss UDFs.
• Introduction to UDFs
• (Nov 6, 4 PM EST; Nov 13, 12:30 AM EST)
• UDFs for Lagrangian particle tracking (DPM)
• (Nov 20, 4 PM EST; Dec 4, 12:30 AM EST; Dec 11, 9 AM EST)
• UDFs for multiphase flow modeling
• (Jan 8, 4 PM EST; Jan 15, 9 AM EST; Feb 5, 12:30 AM EST)
• Making UDFs Work in Parallel
• (Feb 12, 9 AM EST; Feb 19, 12:30 AM EST; Feb 26, 4 PM EST)
• Best practices for writing UDFs

3 © 2011 ANSYS, Inc. February 26, 2014


Agenda

Introduction
Fluent Parallel Architecture
4 Basic Components of Parallel UDFs
Troubleshooting Tips for Parallel UDFs
Questions?

4 © 2011 ANSYS, Inc. February 26, 2014


Introduction
• Under some conditions, a UDF that works in serial must be
modified to ensure that it will also work correctly in parallel
• The use of parallel computing for Fluent simulations has
become commonplace due to continual advances in HPC
technology and decreasing computer hardware costs
– Simulations for which UDFs are required must be able to run in
parallel

~84% efficiency for 96M cell


case at 10240 cores

5 © 2011 ANSYS, Inc. February 26, 2014


Objectives
• “Parallelizing” a UDF means modifying a UDF that works in
serial so that it works properly both in serial and parallel
– Some UDFs need to be parallelized, others do not
• The motivation for this session is to introduce a few basic
concepts that illustrate how to parallelize a UDF
– It is not intended to be a training session or to discuss every
possible consideration that applies to UDFs in parallel
– More advanced topics such as low level message passing, file
writing, GPU programing, … will not be discussed in this session
• The objectives are to explain
– How to know whether a UDF needs parallelization
– A few basic concepts that need to be understood to parallelize a
UDF
– How to troubleshoot a UDF that is not working correctly in
parallel

6 © 2011 ANSYS, Inc. February 26, 2014


Fluent Parallel Architecture
Cortex Host Compute-Node-0

Compute-Node-1 Compute-Node-2 Compute-Node-3

• Imagine a Fluent parallel session using 4 CPUs


– The session has 6 compute processes, connected as shown in the figure
– The grid and solution data are distributed to and stored on the node
processes
– The cortex (GUI) and host processes do not have any data
– The host process communicates commands from the cortex to node-0, which
passes the commands to the other node
– When solution information is required, it is collected by node-0 from the
other nodes and transferred to cortex via the host
7 © 2011 ANSYS, Inc. February 26, 2014
A Simple Example Problem
w_top
w_left_a

w_right
w_left_b

w_bottom
• The case shown here will be used as the basis for numerous
examples in this session
• If it is read into a Fluent parallel session using 4 cpus, the mesh and
solution data will be distributed into grid partitions as shown on the
next slide
8 © 2011 ANSYS, Inc. February 26, 2014
Mesh with 4 Partitions

• In serial, there is just a UDF, but in parallel, there are multiple,


identical instances of the UDF executing independently of one
another
– Here there are 4 node processes and also the host process, so one instance
of the UDF executes independently on 5 different processes
9 © 2011 ANSYS, Inc. February 26, 2014
Executing the UDF
Serial

1.

Parallel – 4 cpus

1.

What happened here?

1) The text user interface (TUI) command is used throughout this


presentation in order to indicate the point at which the
DEFINE_ON_DEMAND function was executed, but the output from the
function would have been identical if it had been executed in the GUI
10 © 2011 ANSYS, Inc. February 26, 2014
Four Basic Components of Parallel UDFs

• The output from the parallel session can be understood by


introducing four basic components of parallel UDFs
– Compiler Directives
– Looping (internal and external cells and faces)
– Global Reductions (synchronization)
– Node-to-Host and Host-to-Node Data Transfer

11 © 2011 ANSYS, Inc. February 26, 2014


Basic Compiler Directives
It is sometimes necessary to restrict certain commands in a UDF to execute only
on a node process, or only on a host process by using compiler directives

#if RP_HOST
/* Coding here only performed on HOST process */
#endif

#if RP_NODE
/* Coding here only performed on NODE processes */
#endif

#if PARALLEL
/* Coding here only performed on HOST & NODE processes */
#endif

12 © 2011 ANSYS, Inc. February 26, 2014


Negated Compiler Directives
Since many of the operations will also be required in the serial version, the
negated versions are more commonly used:

#if !PARALLEL
/* Coding here only performed on SERIAL process */
#endif

#if !RP_HOST
/* Coding here only performed on NODE & SERIAL processes */
#endif

#if !RP_NODE
/* Coding here only performed on HOST & SERIAL processes */
#endif

13 © 2011 ANSYS, Inc. February 26, 2014


Partition Boundaries
• Even though the mesh is distributed into different partitions, Fluent’s solver
algorithms expect a cell to be on both sides of an interior face, so copies of the
neighboring partition’s cells are also kept on each mesh partition
• Compute Node 0 has copies of the cells on the other side of all partition faces
and Compute Node 1 has corresponding cell copies from Node 0

Compute Node 0 Compute Node 1

Domain Decomposition Distribution across Compute Nodes

14 © 2011 ANSYS, Inc. February 26, 2014


Interior and Exterior Cells and Faces
• The main cells of each partition are designated as “Interior” cells and the
additional copied cells from other Compute Nodes are designated as “Exterior”
cells. The Partition Boundary Faces are a special type of Interior face

Compute Node 0 Surface Boundary


Zone Face
Partition Boundary Face

Interior Faces

Exterior Cell

Interior Cells

15 © 2011 ANSYS, Inc. February 26, 2014


Looping Macros in Parallel UDFs

• The standard form of looping macros loops through all


internal and external cells and faces

• When addition or counting operations are performed,


looping must be restricted to interior cells for the correct
total to be returned

16 © 2011 ANSYS, Inc. February 26, 2014


Looping in the Example UDF

• Let’s modify the example UDF so it uses a compiler


directive to restrict the looping macro to execute only on
the node processes and it loops through only the interior
cells

Because the solution data only


exists on the compute nodes,
there is no need to perform the
loop on the host

17 © 2011 ANSYS, Inc. February 26, 2014


Executing the Modified UDF
Original

Modified

• The mesh contains 224 cells


– Before modification, the UDF reported there were 65+68+67+68 = 268 cells
– Now the total number is correct
• The host process still reports there are no cells and ideally it would be nice
not to have to manually add the totals from the different compute nodes
– We will see how to deal with this shortly
18 © 2011 ANSYS, Inc. February 26, 2014
Face Looping

• The correct way to avoid looping over faces that belong to exterior cells is by
using PRINCIPAL_FACE_P(f,t)
- Do not use begin_f_loop_int(f,t)
- This usually only required when a quantity is summed over all the faces of a thread
– Generally not necessary for assigning boundary conditions in DEFINE_PROFILE
macros
19 © 2011 ANSYS, Inc. February 26, 2014
Global Reductions
• The example has shown that as the UDF executes in parallel, variables
can have different values on different compute nodes
• In many cases, the workflow of the UDF requires the execution of the
commands to be synchronized, for instance so that at a certain point
in the program execution, a variable has the same value on every
compute node
• This synchronization is achieved through the use of global reductions
– Depending on the objective
• For total value over all nodes, use a summation reduction
• For minimum or maximum values, use a high or low reduction
• For a logical test over all nodes, use an AND or OR reduction
• The form of the macro depends on the variable type

20 © 2011 ANSYS, Inc. February 26, 2014


Adding a Global Reduction in the Example

• We want to sum the cell count over all the compute nodes, so PRF_GISUM1 is
used
• Global reductions should always be inside an RP_NODE compiler directive

21 © 2011 ANSYS, Inc. February 26, 2014


Global Reduction in Action

• Through the PRF_GISUM reduction, the value of the ncount variable is the same
on each compute node
• But its value on the host process is still zero
• Global reductions operate only on node processes
• The final piece of the puzzle is how to communicate the value of ncount to the
host process

22 © 2011 ANSYS, Inc. February 26, 2014


Inter-Process Data Transfer
• The example has shown that sometimes it is necessary to
communicate values from the nodes to the host or vice-versa
• This is done using node-to-host and host-to-node operations
• The macro host_to_node_int_1(ncount) will send the value of
ncount from the host to the nodes
• Here, the value needs to be sent from the nodes to the host, so
node_to_host_int_1(ncount) must be used
– Important: host_to_node sends a value from the host to all nodes, but
node_to_host sends only the value from node-0 to the host
– Different forms (e.g. host_to_node_real_4(v_x,v_y,v_z,v_mag) ) exist
depending on the variable type and number of variables

23 © 2011 ANSYS, Inc. February 26, 2014


Node-to-Host Data Transfer in Action
• By making the following modifications to the example UDF, the correct number
of cells will be counted and will be displayed only once by executing only on the
host process

24 © 2011 ANSYS, Inc. February 26, 2014


Alternate Example UDF Execution
• Message0 executes only on node-0 in parallel and in the exact same way as
Message in parallel, which means the same end result could have been
accomplished like below
• Depending on circumstances, sometimes this way would be better, other times
executing the Message statement on the host might be better

Remember from Slide 21 that ncount


has the same (correct) value on all
nodes

25 © 2011 ANSYS, Inc. February 26, 2014


When to Parallelize a UDF
• In the example, the UDF needed to be parallelized
because it was performing an operation that
required information located on different compute
nodes
• Operations involving summation or addition
(integration) usually need to be parallelized
• These kinds of operations are most commonly
performed in general purpose define macros such as
DEFINE_ADJUST, DEFINE_ON_DEMAND, w_top
DEFINE_EXECUTE_AT_END, … (complete list in UDF
w_left_a

T(w_right)=f(w_left_a)
manual)
– Sometimes required in other macros but much less
common

w_left_b
• Other common examples where parallelization is
required include using the values of user-defined
scheme variables in a UDF, or using non-local values to
control boundary conditions
– Example: want to set the temperature of wall w_right as a
function of the temperature at w_left_a, but these are
located on different grid partitions
w_bottom
26 © 2011 ANSYS, Inc. February 26, 2014
When Parallelization is Unnecessary
• Most DEFINE_DPM_ macros require no parallelization
• Particles carry information with them as they cross mesh partitions
• Some exceptions for file writing, see UDF Manual, Section 7.4
• Many UDFs operate on one cell, or one face, at a time, such as
DEFINE_PROFILE, DEFINE_SOURCE, DEFINE_PROPERTY
• These generally do not need to be modified
Mostly no parallelization is needed for
simple DEFINE_PROFILE UDFs such as
this. Even PRINCIPAL_FACE_P is not
needed at boundary faces.

27 © 2011 ANSYS, Inc. February 26, 2014


Troubleshooting
• Parallel UDFs can be subject to a number of different kinds of run
time errors
• Correct value not calculated
• Data access violation
• Program hangs or crashes
• The first step in correcting run time errors is generally to find the
line in the code where the error occurs
• Also, whether it occurs on a host or a node process
• Usually with the aid of Message statements
• In the remainder of the session, a few troubleshooting tips and
tricks are presented

28 © 2011 ANSYS, Inc. February 26, 2014


Troubleshooting with Message Statements
The example UDF is working correctly, but if it
were not, the message statements could indicate
the following likely errors

• If 1. is not displayed, there might be a problem


accessing the thread
• If PRF_ is not enclosed in a compiler directive,
2. would not be displayed
• If node_to_host_ is accidentally included inside
a compiler directive, 3. would not be displayed

29 © 2011 ANSYS, Inc. February 26, 2014


Process Identification
• The Fluent solver variable “myid” can be used to identify on which
process a command is being executed
• Host process is number 999999, nodes are the number of the compute node

30 © 2011 ANSYS, Inc. February 26, 2014


Real Time Output
• When a UDF crashes or hangs in parallel, sometimes output from Message
statements is not reported in the console window before the crash
• This can occur because the output is from a Message statement is stored in a print
buffer and the error might happen before the buffer is displayed in the console
• Use hflush() to ensure the buffer is flushed before any other UDF commands
are executed

31 © 2011 ANSYS, Inc. February 26, 2014


Performance Tips: Global Reductions
• Use global reductions (PRF_ ) and intra-process communication
(host_to_node_, node_to_host_ only when absolutely necessary
• While processes are waiting to synchronize with one another, they are not executing,
which is not efficient
• Do not use global reductions in loops and/or in macros such as
DEFINE_SOURCE or DEFINE_PROPERTY that are called on a cell-by-cell basis
• If these macros use variables which need to be reduced, do the reduction in a
DEFINE_ADJUST or DEFINE_ON_DEMAND macro

This example is
from a
DEFINE_ADJUST
macro

Incorrect Usage
Correct Usage This will probably also result in a
32 © 2011 ANSYS, Inc. February 26, 2014 runtime error - why?
Performance Tips: Looping and Directives
• For parallel use,
• {begin,end}_c_loop instead of {begin,end}_c_loop_int whenever possible
• Communication between compute processes will be reduced
– Increase in communication can lead to decrease in parallel efficiency
• The same consideration applies for PRINCIPAL_FACE_P in {begin,end}_f_loop
– Obviously sometimes these need to be used, just restrict the usage to only those
times
• Also, use compiler directives judiciously
so that commands execute only where
needed

In this example, initial values might


be applied to exterior cells without
{begin,end}_c_loop_int, but it does
not have any effect on the solution

33 © 2011 ANSYS, Inc. February 26, 2014


Where to Find More Information
• Chapter 7 of the UDF Manual, “Parallel Considerations” provides
detailed explanations of the all aspects of parallel UDF
• In-depth explanation
• Numerous examples
• Advanced topics
• Advanced UDF training provides a detailed, structured description
of UDF parallelization

In the customer portal,


select training materials
under Knowledge
Resources and use filters
to narrow the search

34 © 2011 ANSYS, Inc. February 26, 2014


Summary
• Understanding the basic concepts parallelization simplifies the task
of making your UDF work in parallel
• Compiler directives
• Looping (interior and exterior cells and faces)
• Global reductions (synchronization)
• Node-to-Host and Host-to-Node data transfer
• Not all UDFs require parallelization
• If a UDF computes the sum of a quantity over the cells or faces within a
thread, or requires non-local information, or uses a user-defined scheme
variable, it will need to be parallelized
• Simple UDFs that act on a cell-by-cell or face-by-face basis often do not
require any modification to work in parallel
• Use myid and hflush() in conjunction with Message statements to
troubleshoot run time errors in parallel UDFs

35 © 2011 ANSYS, Inc. February 26, 2014


To Ask a Question:

Click on the Q&A tab in the WebEx Toolbar

Webinar Recording:
Available in one week’s time in the
ANSYS Resource Library at
www.ansys.com/Resource+Library

36 © 2011 ANSYS, Inc. February 26, 2014

You might also like