Ask The Expert Udfs Part 4 Making Udfs Work in Parallel

To hear today’s event :
Listen via the audio stream through your

computer speakers
OR
Listen via phone by clicking the
teleconference request button in the
Participants window
You will not hear “hold music”

while waiting for the event to begin.
1 © 2011 ANSYS, Inc. February 26, 2014

User Defined Functions in
ANSYS Fluent –
Part 4, Making UDFs Work
in Parallel

Introduction To This Webinar Series
UDFs – Powerful and important feature of ANSYS Fluent

Five part webinar series to discuss UDFs.
• Introduction to UDFs
• (Nov 6, 4 PM EST; Nov 13, 12:30 AM EST)
• UDFs for Lagrangian particle tracking (DPM)
• (Nov 20, 4 PM EST; Dec 4, 12:30 AM EST; Dec 11, 9 AM EST)
• UDFs for multiphase flow modeling
• (Jan 8, 4 PM EST; Jan 15, 9 AM EST; Feb 5, 12:30 AM EST)
• Making UDFs Work in Parallel
• (Feb 12, 9 AM EST; Feb 19, 12:30 AM EST; Feb 26, 4 PM EST)
• Best practices for writing UDFs

Agenda
Introduction
Fluent Parallel Architecture
4 Basic Components of Parallel UDFs
Troubleshooting Tips for Parallel UDFs
Questions?

Introduction
• Under some conditions, a UDF that works in serial must be
modified to ensure that it will also work correctly in parallel
• The use of parallel computing for Fluent simulations has
become commonplace due to continual advances in HPC
technology and decreasing computer hardware costs
– Simulations for which UDFs are required must be able to run in
parallel
~84% efficiency for 96M cell

case at 10240 cores

Objectives
• “Parallelizing” a UDF means modifying a UDF that works in
serial so that it works properly both in serial and parallel
– Some UDFs need to be parallelized, others do not
• The motivation for this session is to introduce a few basic
concepts that illustrate how to parallelize a UDF
– It is not intended to be a training session or to discuss every
possible consideration that applies to UDFs in parallel
– More advanced topics such as low level message passing, file
writing, GPU programing, … will not be discussed in this session
• The objectives are to explain
– How to know whether a UDF needs parallelization
– A few basic concepts that need to be understood to parallelize a
UDF
– How to troubleshoot a UDF that is not working correctly in
parallel

Fluent Parallel Architecture
Cortex Host Compute-Node-0
Compute-Node-1 Compute-Node-2 Compute-Node-3
• Imagine a Fluent parallel session using 4 CPUs

– The session has 6 compute processes, connected as shown in the figure
– The grid and solution data are distributed to and stored on the node
processes
– The cortex (GUI) and host processes do not have any data
– The host process communicates commands from the cortex to node-0, which
passes the commands to the other node
– When solution information is required, it is collected by node-0 from the
other nodes and transferred to cortex via the host
A Simple Example Problem
w_top
w_left_a
w_right
w_left_b
w_bottom
• The case shown here will be used as the basis for numerous
examples in this session
• If it is read into a Fluent parallel session using 4 cpus, the mesh and
solution data will be distributed into grid partitions as shown on the
next slide
Mesh with 4 Partitions
• In serial, there is just a UDF, but in parallel, there are multiple,

identical instances of the UDF executing independently of one
another
– Here there are 4 node processes and also the host process, so one instance
of the UDF executes independently on 5 different processes
Executing the UDF
Serial
1.
Parallel – 4 cpus
1.
What happened here?
1) The text user interface (TUI) command is used throughout this

presentation in order to indicate the point at which the
DEFINE_ON_DEMAND function was executed, but the output from the
function would have been identical if it had been executed in the GUI
Four Basic Components of Parallel UDFs
• The output from the parallel session can be understood by

introducing four basic components of parallel UDFs
– Compiler Directives
– Looping (internal and external cells and faces)
– Global Reductions (synchronization)
– Node-to-Host and Host-to-Node Data Transfer

Basic Compiler Directives
It is sometimes necessary to restrict certain commands in a UDF to execute only
on a node process, or only on a host process by using compiler directives
#if RP_HOST
/* Coding here only performed on HOST process */
#endif
#if RP_NODE
/* Coding here only performed on NODE processes */
#endif
#if PARALLEL
/* Coding here only performed on HOST & NODE processes */
#endif

Negated Compiler Directives
Since many of the operations will also be required in the serial version, the
negated versions are more commonly used:
#if !PARALLEL
/* Coding here only performed on SERIAL process */
#endif
#if !RP_HOST
/* Coding here only performed on NODE & SERIAL processes */
#endif
#if !RP_NODE
/* Coding here only performed on HOST & SERIAL processes */
#endif

Partition Boundaries
• Even though the mesh is distributed into different partitions, Fluent’s solver
algorithms expect a cell to be on both sides of an interior face, so copies of the
neighboring partition’s cells are also kept on each mesh partition
• Compute Node 0 has copies of the cells on the other side of all partition faces
and Compute Node 1 has corresponding cell copies from Node 0
Compute Node 0 Compute Node 1
Domain Decomposition Distribution across Compute Nodes

Interior and Exterior Cells and Faces
• The main cells of each partition are designated as “Interior” cells and the
additional copied cells from other Compute Nodes are designated as “Exterior”
cells. The Partition Boundary Faces are a special type of Interior face
Compute Node 0 Surface Boundary

Zone Face
Partition Boundary Face
Interior Faces
Exterior Cell
Interior Cells

Looping Macros in Parallel UDFs
• The standard form of looping macros loops through all

internal and external cells and faces
• When addition or counting operations are performed,

looping must be restricted to interior cells for the correct
total to be returned

Looping in the Example UDF
• Let’s modify the example UDF so it uses a compiler

directive to restrict the looping macro to execute only on
the node processes and it loops through only the interior
cells
Because the solution data only

exists on the compute nodes,
there is no need to perform the
loop on the host

Executing the Modified UDF
Original
Modified
• The mesh contains 224 cells

– Before modification, the UDF reported there were 65+68+67+68 = 268 cells
– Now the total number is correct
• The host process still reports there are no cells and ideally it would be nice
not to have to manually add the totals from the different compute nodes
– We will see how to deal with this shortly
Face Looping
• The correct way to avoid looping over faces that belong to exterior cells is by
using PRINCIPAL_FACE_P(f,t)
- Do not use begin_f_loop_int(f,t)
- This usually only required when a quantity is summed over all the faces of a thread
– Generally not necessary for assigning boundary conditions in DEFINE_PROFILE
macros
Global Reductions
• The example has shown that as the UDF executes in parallel, variables
can have different values on different compute nodes
• In many cases, the workflow of the UDF requires the execution of the
commands to be synchronized, for instance so that at a certain point
in the program execution, a variable has the same value on every
compute node
• This synchronization is achieved through the use of global reductions
– Depending on the objective
• For total value over all nodes, use a summation reduction
• For minimum or maximum values, use a high or low reduction
• For a logical test over all nodes, use an AND or OR reduction
• The form of the macro depends on the variable type

Adding a Global Reduction in the Example
• We want to sum the cell count over all the compute nodes, so PRF_GISUM1 is
used
• Global reductions should always be inside an RP_NODE compiler directive

Global Reduction in Action
• Through the PRF_GISUM reduction, the value of the ncount variable is the same
on each compute node
• But its value on the host process is still zero
• Global reductions operate only on node processes
• The final piece of the puzzle is how to communicate the value of ncount to the
host process

Inter-Process Data Transfer
• The example has shown that sometimes it is necessary to
communicate values from the nodes to the host or vice-versa
• This is done using node-to-host and host-to-node operations
• The macro host_to_node_int_1(ncount) will send the value of
ncount from the host to the nodes
• Here, the value needs to be sent from the nodes to the host, so
node_to_host_int_1(ncount) must be used
– Important: host_to_node sends a value from the host to all nodes, but
node_to_host sends only the value from node-0 to the host
– Different forms (e.g. host_to_node_real_4(v_x,v_y,v_z,v_mag) ) exist
depending on the variable type and number of variables

Node-to-Host Data Transfer in Action
• By making the following modifications to the example UDF, the correct number
of cells will be counted and will be displayed only once by executing only on the
host process

Alternate Example UDF Execution
• Message0 executes only on node-0 in parallel and in the exact same way as
Message in parallel, which means the same end result could have been
accomplished like below
• Depending on circumstances, sometimes this way would be better, other times
executing the Message statement on the host might be better
Remember from Slide 21 that ncount

has the same (correct) value on all
nodes

When to Parallelize a UDF
• In the example, the UDF needed to be parallelized
because it was performing an operation that
required information located on different compute
nodes
• Operations involving summation or addition
(integration) usually need to be parallelized
• These kinds of operations are most commonly
performed in general purpose define macros such as
DEFINE_ADJUST, DEFINE_ON_DEMAND, w_top
DEFINE_EXECUTE_AT_END, … (complete list in UDF
w_left_a
T(w_right)=f(w_left_a)
manual)
– Sometimes required in other macros but much less
common
w_left_b
• Other common examples where parallelization is
required include using the values of user-defined
scheme variables in a UDF, or using non-local values to
control boundary conditions
– Example: want to set the temperature of wall w_right as a
function of the temperature at w_left_a, but these are
located on different grid partitions
w_bottom
When Parallelization is Unnecessary
• Most DEFINE_DPM_ macros require no parallelization
• Particles carry information with them as they cross mesh partitions
• Some exceptions for file writing, see UDF Manual, Section 7.4
• Many UDFs operate on one cell, or one face, at a time, such as
DEFINE_PROFILE, DEFINE_SOURCE, DEFINE_PROPERTY
• These generally do not need to be modified
Mostly no parallelization is needed for
simple DEFINE_PROFILE UDFs such as
this. Even PRINCIPAL_FACE_P is not
needed at boundary faces.

Troubleshooting
• Parallel UDFs can be subject to a number of different kinds of run
time errors
• Correct value not calculated
• Data access violation
• Program hangs or crashes
• The first step in correcting run time errors is generally to find the
line in the code where the error occurs
• Also, whether it occurs on a host or a node process
• Usually with the aid of Message statements
• In the remainder of the session, a few troubleshooting tips and
tricks are presented

Troubleshooting with Message Statements
The example UDF is working correctly, but if it
were not, the message statements could indicate
the following likely errors
• If 1. is not displayed, there might be a problem

accessing the thread
• If PRF_ is not enclosed in a compiler directive,
2. would not be displayed
• If node_to_host_ is accidentally included inside
a compiler directive, 3. would not be displayed

Process Identification
• The Fluent solver variable “myid” can be used to identify on which
process a command is being executed
• Host process is number 999999, nodes are the number of the compute node

Real Time Output
• When a UDF crashes or hangs in parallel, sometimes output from Message
statements is not reported in the console window before the crash
• This can occur because the output is from a Message statement is stored in a print
buffer and the error might happen before the buffer is displayed in the console
• Use hflush() to ensure the buffer is flushed before any other UDF commands
are executed

Performance Tips: Global Reductions
• Use global reductions (PRF_ ) and intra-process communication
(host_to_node_, node_to_host_ only when absolutely necessary
• While processes are waiting to synchronize with one another, they are not executing,
which is not efficient
• Do not use global reductions in loops and/or in macros such as
DEFINE_SOURCE or DEFINE_PROPERTY that are called on a cell-by-cell basis
• If these macros use variables which need to be reduced, do the reduction in a
DEFINE_ADJUST or DEFINE_ON_DEMAND macro
This example is
from a
DEFINE_ADJUST
macro
Incorrect Usage
Correct Usage This will probably also result in a
32 © 2011 ANSYS, Inc. February 26, 2014 runtime error - why?
Performance Tips: Looping and Directives
• For parallel use,
• {begin,end}_c_loop instead of {begin,end}_c_loop_int whenever possible
• Communication between compute processes will be reduced
– Increase in communication can lead to decrease in parallel efficiency
• The same consideration applies for PRINCIPAL_FACE_P in {begin,end}_f_loop
– Obviously sometimes these need to be used, just restrict the usage to only those
times
• Also, use compiler directives judiciously
so that commands execute only where
needed
In this example, initial values might

be applied to exterior cells without
{begin,end}_c_loop_int, but it does
not have any effect on the solution

Where to Find More Information
• Chapter 7 of the UDF Manual, “Parallel Considerations” provides
detailed explanations of the all aspects of parallel UDF
• In-depth explanation
• Numerous examples
• Advanced topics
• Advanced UDF training provides a detailed, structured description
of UDF parallelization
In the customer portal,

select training materials
under Knowledge
Resources and use filters
to narrow the search

Summary
• Understanding the basic concepts parallelization simplifies the task
of making your UDF work in parallel
• Compiler directives
• Looping (interior and exterior cells and faces)
• Global reductions (synchronization)
• Node-to-Host and Host-to-Node data transfer
• Not all UDFs require parallelization
• If a UDF computes the sum of a quantity over the cells or faces within a
thread, or requires non-local information, or uses a user-defined scheme
variable, it will need to be parallelized
• Simple UDFs that act on a cell-by-cell or face-by-face basis often do not
require any modification to work in parallel
• Use myid and hflush() in conjunction with Message statements to
troubleshoot run time errors in parallel UDFs

To Ask a Question:
Click on the Q&A tab in the WebEx Toolbar
Webinar Recording:
Available in one week’s time in the
ANSYS Resource Library at
www.ansys.com/Resource+Library

Ask The Expert Udfs Part 4 Making Udfs Work in Parallel

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ask The Expert Udfs Part 4 Making Udfs Work in Parallel

Uploaded by

Copyright:

Available Formats

To hear today’s event :

Listen via the audio stream through your

You will not hear “hold music”

1 © 2011 ANSYS, Inc. February 26, 2014

2 © 2011 ANSYS, Inc. February 26, 2014

UDFs – Powerful and important feature of ANSYS Fluent

3 © 2011 ANSYS, Inc. February 26, 2014

4 © 2011 ANSYS, Inc. February 26, 2014

~84% efficiency for 96M cell

5 © 2011 ANSYS, Inc. February 26, 2014

6 © 2011 ANSYS, Inc. February 26, 2014

Compute-Node-1 Compute-Node-2 Compute-Node-3

• Imagine a Fluent parallel session using 4 CPUs

• In serial, there is just a UDF, but in parallel, there are multiple,

What happened here?

1) The text user interface (TUI) command is used throughout this

• The output from the parallel session can be understood by

11 © 2011 ANSYS, Inc. February 26, 2014

12 © 2011 ANSYS, Inc. February 26, 2014

13 © 2011 ANSYS, Inc. February 26, 2014

Compute Node 0 Compute Node 1

Domain Decomposition Distribution across Compute Nodes

14 © 2011 ANSYS, Inc. February 26, 2014

Compute Node 0 Surface Boundary

15 © 2011 ANSYS, Inc. February 26, 2014

• The standard form of looping macros loops through all

• When addition or counting operations are performed,

16 © 2011 ANSYS, Inc. February 26, 2014

• Let’s modify the example UDF so it uses a compiler

Because the solution data only

17 © 2011 ANSYS, Inc. February 26, 2014

• The mesh contains 224 cells

20 © 2011 ANSYS, Inc. February 26, 2014

21 © 2011 ANSYS, Inc. February 26, 2014

22 © 2011 ANSYS, Inc. February 26, 2014

23 © 2011 ANSYS, Inc. February 26, 2014

24 © 2011 ANSYS, Inc. February 26, 2014

Remember from Slide 21 that ncount

25 © 2011 ANSYS, Inc. February 26, 2014

27 © 2011 ANSYS, Inc. February 26, 2014

28 © 2011 ANSYS, Inc. February 26, 2014

• If 1. is not displayed, there might be a problem

29 © 2011 ANSYS, Inc. February 26, 2014

30 © 2011 ANSYS, Inc. February 26, 2014

31 © 2011 ANSYS, Inc. February 26, 2014

In this example, initial values might

33 © 2011 ANSYS, Inc. February 26, 2014

In the customer portal,

34 © 2011 ANSYS, Inc. February 26, 2014

35 © 2011 ANSYS, Inc. February 26, 2014

Click on the Q&A tab in the WebEx Toolbar

36 © 2011 ANSYS, Inc. February 26, 2014

You might also like