This action might not be possible to undo. Are you sure you want to continue?
Samir Patel Advisor: Dr. Cris Cecka June 23, 2012
Abstract Commercial ﬂuid dynamics software is expensive and can be diﬃcult to handle for transient problems involving moving objects. While open-source codes exist to handle such problems, the documentation and structure of such codes might be diﬃcult to navigate for researchers not well-versed in computer science or students lacking a formal background in ﬂuid dynamics. aeroCuda was developed to provide an eﬃcient, accurate, and open-source method for testing ﬂuid dynamics problems involving moving objects. The solution method for the Navier-Stokes equations was the Projection Method, and the eﬀects of objects moving in ﬂuid were implemented via Peskin’s Immersed Boundary Method. The code was ﬁrst developed in serial and then parallelized via CUDA and MPI to optimize its speed. It generates and rotates a full 2-d point cloud to simulate the object’s shape, and also allows the user to implement full 2-d translational and rotational motion of the object. The results obtained for Reynolds numbers at 25 and 100 matched those obtained by Saiki and Biringen as well as Peskin and Lai; the expected physical phenomena are also conﬁrmed.
Preface This paper was submitted for the satisfaction of the thesis requirement for the Bachelor of Science in Engineering Sciences at Harvard College on April 2, 2012.
My interest in the ﬁeld of CFD was piqued in high school, when I ﬁrst studied the Speedo LZR Racer. Since then, I have come a long way in my understanding of CFD, both in its applications and theoretical underpinnings. However, none of this would have been possible without the support of many individuals who have supported me throughout my career as a student.
I would like to thank my parents and my sister for their continued support and trust in me. They have been monumental in getting me to where I am today. I love you, Satish, Sneh, and Swati Patel!
I would like to thank my advisor, Cris Cecka, for his support in helping me bring this project to life.
There are some individuals who have supported my work as a student at Harvard without whom I could not envision being where I am today. Special thanks to Professor Robert Wood and Dr. Hiroto Tanaka for allowing me the opportunity to work on their robotics projects and learn from their dedication to the subject, which helped develop my interests and skill as a researcher. Special thanks to Professor Anette Hosoi and Ms. Lisa Burton for allowing me to begin exploring CFD under their tutelage.
I would also like to thank those that inﬂuenced me in high school: Dr. Thom Morris, Mrs. Martha DeWeese, Mrs. Kemp Hoversten, Mr. Stephen Mikell, and Mr. Patrick Fisher. Their guidance allowed me to become the individual that I am today, and without their support I would not have be where I am. In addition, I would like to thank the man who helped kindle my interest in mathematics, Mr. Farhad Azar.
I would also like to thank Assistant Professor Charbel Bou-Mosleh of the Notre Dame University of Lebanon, who over the course of one summer taught me to appreciate CFD and helped me craft my beginnings as a researcher in this area.
I would like to thank Professor Charles Peskin of NYU for his support of my project (and of course, for developing its theoretical basis).
I would like to thank Karl Helfrich of Woods Hole Institute and Mattheus Ueckermann of the Massachusetts Institute of Technology for helping me navigate the world of CFD.
This project is dedicated to the memory of my grandfathers, a mechanical engineer and a physicist.
1 Motivation 1.1 1.2 1.3 1.4 Computational Fluid Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Moving Mesh and a Translating Cylinder . . . . . . . . . . . . . . . . . . . . . . . . Governing Equations and Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . Why Immersed Boundary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3 4 5 6 8 8 8 9 10
2 Immersed Boundary Method and Solution to the Navier-Stokes equations 2.1 2.2 2.3 Modiﬁcation of the Navier-Stokes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Developing the Forcing Term . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Relationship between the Solid and Prescribed Points . . . . . . . . . . . . . . . . .
3 Goal and Design Phase 3.1 3.2 3.3
Goal of aeroCuda . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Reasons for Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Platforms Evaluated . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.3.1 3.3.2 3.3.3 Comsol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Ansys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 openFoam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Language for the Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 13
4 Working with openFoam 4.1 4.2 4.3 4.4
Mesh Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Solver Utility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Building the Code for openFoam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Issues with openFoam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 16
5 Development of aeroCuda 5.1 5.2
Inﬂuences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Structural Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 5.2.1 5.2.2 5.2.3 Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Pre-Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Solver-Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Post-Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Pre-Computation: Interior Point Generation and Rotation Capabilities . . . . . . . . 20 5.3.1 5.3.2 5.3.3 5.3.4 5.3.5 5.3.6 Motivation behind Interior Point Generation . . . . . . . . . . . . . . . . . . 20 Interpolating the Surface of the Geometry . . . . . . . . . . . . . . . . . . . . 20 Developing the Cloaking Mechanism . . . . . . . . . . . . . . . . . . . . . . . 21 Developing the Delaunay Mechanism . . . . . . . . . . . . . . . . . . . . . . . 22 Comparing the Delaunay and Cloaking Mechanisms . . . . . . . . . . . . . . 22 Implementing the Rotation Algorithm . . . . . . . . . . . . . . . . . . . . . . 23
Developing the Solver In Serial Code . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 5.4.1 5.4.2 5.4.3 5.4.4 Implementing the Projection Method: Steps 2 and 4 . . . . . . . . . . . . . . 24 Implementing the Projection Method: Step 3 . . . . . . . . . . . . . . . . . . 25 Implementing the Interpolation Step . . . . . . . . . . . . . . . . . . . . . . . 26 Implementing the Forcing Field . . . . . . . . . . . . . . . . . . . . . . . . . . 27 28
6 Code Reﬁnements and Optimization 6.1
The Variable-Spring Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 6.1.1 6.1.2 6.1.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 Underlying Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 6.2.1 6.2.2 6.2.3 Evaluation of MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Evaluation of CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Going with CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Implementing the CUDA-optimized Structure . . . . . . . . . . . . . . . . . . . . . . 32 6.3.1 6.3.2 6.3.3 Implementing the Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Implementing the Forcing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Implementing the Intermediate Velocity and Final Velocity Calculations . . . 34 34
7 Results Obtained with aeroCuda 7.1 7.2 7.3 7.4
The Eﬀect of Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Numerical Conﬁrmations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Expected Physical Phenomena and Further Validation . . . . . . . . . . . . . . . . . 36 A Closer Look at the Physical Response of the Immersed Solid . . . . . . . . . . . . 36 2
Physical Location of the Immersed Solid Points . . . . . . . . . . . . . . . . . . . . . 37 37
8 Test Case: Swimmer in Glide Position 8.1 8.2 8.3 8.4
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Simulation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Reynolds Number Transition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 39
9 Conclusion 9.1 9.2 9.3 9.4
Numerical Improvements to aeroCuda . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Technical Improvements to aeroCuda . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Capability Enhancements to aeroCuda . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Final Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 41
10.1 Resources Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 10.2 Upgrades . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 11 Appendix 41
11.1 Solving the Immersed Solid-inﬂuenced Navier-Stokes Equations . . . . . . . . . . . . 41 11.1.1 Step 1: Force Projection  . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 11.1.2 Step 2: Calculating the intermediate velocity ﬁeld  . . . . . . . . . . . . . 43 11.1.3 Step 3: Calculating the Pressure Field  . . . . . . . . . . . . . . . . . . . 45
11.1.4 Step 4: Calculating the Final Velocity ﬁelds . . . . . . . . . . . . . . . . . 46 11.1.5 Step 5: Interpolation and Velocity  . . . . . . . . . . . . . . . . . . . . . . 47
Computational Fluid Dynamics
The ﬁeld of computational ﬂuid dynamics (CFD) gradually arose as there was a demonstrated need to evaluate aerodynamic, mechanical, biological, and/or environmental systems, either for design or the study of naturally-occurring phenomena like vortex shedding. However, owing to the complexity of solving the Navier-Stokes equations, the ﬁeld of CFD grew to integrate three
disciplines (computer science, applied mathematics, and ﬂuid dynamics) in order to develop eﬃcient and accurate solutions to the Navier-Stokes equations. The most common CFD simulations involve 3 main steps: pre-processing, simulation, and postprocessing. In the step of pre-processing, the problem at hand (e.g. 2-d cylinder in a wind-tunnel) is decomposed into either 2-d or 3-d geometry depending on the dimensionality of problem. This decomposition involves breaking the domain into a contiguous sequence of triangles or other simple geometrical shapes. For example, a 2-d cylinder would need to be partitioned into triangles to have its solution developed. Moreover, this decomposition can take lots of time if the geometries are complicated—the elemental partitions must not have any overlaps, jagged edges, or displaced elements. This process is only for a steady-state problem; for transient solutions, moving meshes might be implemented. In such simulations, a mesh with a time-dependent orientation would be developed, allowing for the simulation to take place at a very computationally expensive cost, given that the mesh would have to be updated to reﬂect the new orientation at each timestep. To reiterate, a moving mesh would be desired in the case of an object that is either changing shape or orientation as time progresses.
In the simulation step, depending on the Reynolds number ( ρud ) magnitude of the problem, µ diﬀerent parameters and solution methods might need to be implemented to ensure stability of the solution. For example, in high Reynolds number problems where lots of turbulence is expected, more sophisticated models might have to be applied to properly resolve the solutions. In other cases, the time-step and grid-size might have to be reduced to ensure accurate solutions. In the event that such reductions are implemented, the code must be as eﬃcient as possible to ensure that lots of time isn’t needed to achieve good solutions. In the post-processing step, the ﬂow-ﬁelds at diﬀerent times are observed and the convergence of the force or another ﬁeld variable to its steady-state levels are observed. In this case, for an object with prescribed motion that is either periodic or constant, steady-state refers to the situation in which the forces experienced are either periodic or constant. Being able to track the convergence of the forces allows us to know when the simulation can be terminated with suﬃcient results.
Moving Mesh and a Translating Cylinder
To illustrate the complexity of implementing a moving mesh simulation, the case of a translating cylinder is considered. For the algorithm, the r-method outlined by Tao Tang of the University of 4
Maryland is observed. In the method, gridpoints are moved in such a way that at each timestep, a high concentration of points is located where strong changes in the variable ﬁelds (such as pressure or velocity) is expected. To support the r-method, there are functions, such as interpolation of ﬁeld variables to reﬂect ﬁeld values at translated notes, that need to be implemented as well. In the case of the translating cylinder, suppose that the cylinder moves with timestep of δt = 0.001s at a velocity of u = 1m/s, at a Reynolds number of 100. This means that 1000 iterations are needed to see the cylinder translate 1 meter. For the ﬂow to develop properly, usually 6-10 meters are needed before the Von Karman shedding phenomenon can be observed. Therefore, the nodes and variables are translated and interpolated, respectively, 6000 times to see the quantities develop. In addition, the necessity of mappings between the the actual domain and a test domain needed for a ﬁnite element formulation needs to be taken into account as well. Depending on the clustering of nodes around the cylinder, the number of points that need to be interpolated and updated may range from tens to thousands, depending on the accuracy desired. The complexities of the equations at hand as well as the coding would set a barrier to someone who is not well-versed in computer science and ﬂuid mechanics. For a student just beginning to learn ﬂuid mechanics, implementing a moving mesh simulation to observe the ﬂow around a translating cylinder is an unrealistic task. Moreover, the initial mesh itself has to be generated, which may or may not be diﬃcult depending on the complexity of the object. In conclusion, many steps have to be executed at each timestep. For problems like ﬂapping wings which depend on rapid optimization of a variety of parameters, the overall cost of running the simulations would be very high. The immersed boundary method oﬀers a much less expensive method, though at the cost of reduced accuracy (to be explained).
Governing Equations and Solutions
In CFD simulations, two primary equations are usually solved in the simulation, the momentum and mass convervation equations; collectively they are known as the Navier-Stokes equations. In two dimensions, the primary quantities dealt with are pressure p and the velocity ﬁelds, u and v. The quantity ν is the dynamic viscosity and ρ is density. Let the 2-d velocity ﬁeld be denoted as
u = (u, v). The equations together are: Momentum: Mass: ∂u + (u · ∂t )u = − p + ν ·u=0
Together, these equations establish the condition of incompressible ﬂow, where the ﬂuid does not change density during the solution phase. This approximation is critical to the formulation of the Projection Method, the algorithm used to solve the equations in this project. The idea behind the projection method is that the velocity is propogated forward in time and corrected to account for the incompressible condition. The steps for solving the Navier-Stokes equations in this algorithm are : 1. Solve for the intermediate velocity, u∗ = RHS. 2. Solve for the pressure using the divergence of the ﬂow ﬁeld, a Fast Fourier Transform (FFT). 3. Project the intermediate velocity to get the divergence-free ﬁnal velocity, un+1 = u∗ −
δt ρ 2 pn+1
· u∗ . Done using
To note, aeroCuda solves the problems with periodic boundary conditions in both the X- and Ydirections. There is no inlet or outlet ﬂow, but moving objects in a stationary ﬂuid.
Why Immersed Boundary
Being able to modify the geometry and run simulations, without having to recreate a mesh, would be a great step forward in eﬃciency. Similarly, being able to change parameters and rerun simulations at fast runtimes would be very advantageous, especially for optimization. As an added beneﬁt, in the event that a decomposition of the domain is not needed to develop solutions, then simpler solutions can be implemented with very high eﬃciency. The immersed boundary method developed by Peskin allows us to do exactly this. In Peskin’s formulation, an extra forcing term is added in an attempt to enforce the desired boundary conditions in the ﬂuid simulation (i.e. ﬂow around the cylinder surface should match its prescribed velocity). Since the forcing terms coincide with gridpoints, a cartesian mesh can be used with simpliﬁed solver
routines. This is one of the foundations of this design project. Such a routine can be implemented and optimized while retaining accuracy, making immersed boundary an attractive choice.
Figure 1: Point Decomposition
Figure 2: Mesh of a Similar Disk
For example, in Figure 1 the points within the boundary are marked as those that provide forcing throughout the simulation (they follow prescribed motion as discussed later). However, in Figure 2 the nodes that compose the mesh of the disk would provide the Dirichlet or Neumann boundary conditions, depending on the type of simulation being run. Immersed boundary does not require the regeneration of a mesh at every new timestep in the simulation, which would require a mesh similar to that in Figure 2 to be translated (including all of the nodes) at every timestep. Given the avoidance of this task by immersed boundary, a signiﬁcant speedup in runtime is observed and forms part of the motivation behind building a code to run immersed boundary simulations.
Immersed Boundary Method and Solution to the Navier-Stokes equations
Professor Charles Peskin of NYU was the founder of the immersed boundary method, a method of solving the Navier-Stokes equations in a complicated or smooth domain with a structured grid. Peskin originally developed the immersed boundary formulation to model the ﬂuid ﬂow in the heart; however, it has been widely adapted to many ﬂow problems. His formulation is outlined in the following subsections. For this speciﬁc project, it has been coupled with the the projection method outlined by Tryggvason. The full scope of the problem is now addressed.
Modiﬁcation of the Navier-Stokes
In his formulation, Peskin modiﬁes the momentum equation so that an extra forcing term, f , is included. For the equations used, Tryggvason’s ρ-normalized equation is adopted, where p is the ρ-normalized pressure term. ∂u + (u · ∂t )u = − p + ν ·u=0
The addition of the forcing term in the Navier-Stokes equation allows the ﬂuid around a certain point (with prescribed velocity) to be forced such that the prescribed velocity is observed by the ﬂuid ﬁeld.6 Ordinarily in the Navier-Stokes equations, either a no-slip or a slip boundary condition would be prescribed via a Dirichlet or Neumann boundary condition on the object in the ﬂowﬁeld. However, in a problem involving a moving boundary, the location of these conditions would be dependent on the orientation and location of the mesh. By using the forcing term, the boundary conditions are implicit in the formulation but do not need to have their locations respeciﬁed, as the locations of the aforementioned points provide that capability.
Developing the Forcing Term
In Peskin’s original formulation of the immersed boundary method, the forcing term had a magnid tude of κ| dxb |τ , where κ is the membraneous force constant, the derivative represents the curvature 2
of the membrane, and τ represents the tangential vector.
In doing so, Peskin allows for ﬂuid-
structure interaction to take place (ﬂuid forcing the boundary as well as vice-versa).6 However, in 8
the case of immersed solids, both boundary and interior points matter. Therefore, a modiﬁcation of Peskin’s implementation as a network of springs is applied. In this alternative formulation, Peskin simulates the object via springs; this method was used by Peskin and Lai to simulate the ﬂow around a stationary cylinder with great accuracy. The force is then given by κ(xp − xb ), where xp , yp are the points prescribed by the user and xb , yb are those that move with and force the ﬂow ﬁeld. While this setup is useful for simple geometries and motions, for higher Reynolds ﬂow problems a way of ensuring that immersed boundary points do not oscillate spuriously is required. The harmonic oscillator-forcing mechanism implemented by Saiki and Biringen in their study of the ﬂow around a cylinder, f = κ(xp − xs ) + β(vs − vp ) is used. It is very similar to the actions of a damped harmonic oscillator and helps obtain convergence of the velocity while dissipating the energy exhibited by strongly-oscillating particles, as can be seen in the force plots discussed later. Peskin’s implementation of the forward Euler method is used to compute the integral in the code, where the position xn+1 = xn + un δt. To b b b introduce the forcing terms into the Navier-Stokes, Peskin uses the Dirac delta function to transfer the boundary point’s force to an area of gridpoints via a stencil of coeﬃcients. In addition, to get the velocity of the boundary points Peskin interpolates from the surrounding ﬂuid velocity points via the same delta function stencil.  Henceforth, the immersed boundary shall be referred to as an immersed solid.
Relationship between the Solid and Prescribed Points
To reiterate, there are two sets of points: the solid points, (xb , yb and the prescribed points, (xp , yp . There is a one-to-one correspondence between the solid and prescribed points; each solid point tracks the prescribed point as the latter moves based on the motion speciﬁed by the user. The solid point derives its velocity from that of the ﬂuid points surrounding it. In Peskin’s formulation, each solid point receives velocity and projects force to all gridpoints within a radius of 2 gridspaces. The act of calculating the velocity for a speciﬁc gridpoint is done by means of the Dirac delta functionss. This process is done twice: initially to calculate the damper force in the forcing equation and ﬁnally to advance the solid points. The act of projecting the force is done by obtaining the velocity of each solid point and the distance between the solid point and its prescribed counterpart.
These are provided as inputs into the forcing equation and a single value is obtained for each pair of solid and prescribed points. These forces are then transferred to the grid via the same Dirac delta function, except in this instance the value is spread to the surrounding ﬂuid points to inﬂuence their motion. In sum:
Force Projection • Obtain solid point velocities from surrounding ﬂuid via Dirac delta function. • Calculate forces via forcing equation with solid point velocities, prescribed point velocities, and distances between each solid point and prescribed counterpart. • Spread force to ﬂuid points surrounding the solid point via Dirac delta function. Point Update • Obtain solid point velocities from surrounding ﬂuid via Dirac delta function. • Use forward Euler to progress solid points by respective interpolated velocities and prescribed points by speciﬁed functional velocity.
Goal and Design Phase
Goal of aeroCuda
The goal of developing aeroCuda is to design either an add-on component to an existing CFD software or a standalone CFD code that is capable of handling immersed solid implementations for transient Navier-Stokes problems. Given that the scale and types of problems could range very extensively, certain targets for both user inputs and speciﬁcations were set. While the ﬁnal design did not match all of these, it did satisfy the design expectations that were initially set. These are outlined in the following tables: The speciﬁcations were set to allow for users to eﬃciently calculate solutions to problems involving rigid bodies. The eﬃciency comes from introducing parallelization into the code, whereby tasks are broken down amongst multiple processing units versus one processor. CUDA was chosen over MPI for the bulk of parallelization as it allowed for massive parallelization of very basic arithmetic operations. Concerning the rigid bodies, such implementations were the initial goal; however, the concentration of points in important regions could be decreased if the object expanded, leading 10
Speciﬁcation Dimensionality Parallelization Numerical Accuracy Object Discretization Movement of Solid Points Object Type
Table 1: Speciﬁcations Initial 3-d MPI/CUDA > 4th-order User-Speciﬁed Specify Positions for all Time Deformable
Final 2-d 9:1 CUDA:MPI 1st- and 2nd-Order Internally-Generated Prescribed Motion Rigid
to forcing problems (discussed in later sections). In addition, for this project it was simpler to prescribe consistent motion for the entire body; prescribing motion for all internal points would result in a drastic loss of eﬃciency and introduce a very complicated structure in point-dependent functions. Lastly, 2 dimensions were chosen instead of 3-dimensions as grid-sizing and execution would lead to memory and slow-downs in runtime. The latter case can be developed if necessary. Table 2: Solver Input and Output Input Output Nodes/Connectivity Full Variable Fields Situational Parameters Solver Timings Point Locations Problem Parameters Functional Motion Total Force Part of the motivation behind this project was to place as much control as possible in the hands of the user. To this end, the user can input any 2-d surface and leave it to the software to generate the internal points. In addition, the motion can be prescribed through lambda functions, which are functions of variables that do not require formal declarations. Of importance to the user is the CFL condition, speciﬁcally making sure that enough time- and space-reﬁnement is used to ensure convergence and accuracy of the solution. In terms of output, almost all calculated variables and analytics are outputted either at a certain frequency or every cycle. More in-depth analysis of the software will be provided in upcoming sections.
Reasons for Evaluation
The ﬁnal structure of aeroCuda, as well as the decision to construct a CFD code from scratch, were both decided upon after evaluating and working with a number of existing CFD platforms. The initial stage of the project focused exclusively on identifying a platform to implement the immersed solid method and a coding language to develop the module. The target criterion for a platform
was a software whose solver routines could be directly interfaced with via external code.
Comsol is a widely-used industrial solver utility. It has modules available for all disciplines of engineering, including a ﬂuid-dynamics module. It interfaces directly with Matlab, whiich has an extensive library of tools that would provide good support to the user. However, COMSOL would require a new solve at each timestep, in order to update the new locations of points and forces. In addition, COMSOL did not allow for speciﬁc quantities to be placed on the ﬁeld, which complicated the ability of force and point placement. 3.3.2 Ansys
Ansys is the industry standard for ﬂuid-dynamics problems. It comes with everything from a strong CAD capability to mesh generation and a great CFD solver routine in Fluent. However, the user interface can be very complicated for individuals to operate, even for very basic test cases. The CAD interface allows for the construction of great geometries, but to operate the ICEM-CFD (mesher) and Fluent requires a high-level understanding. The Ansys suite allows for the implementation of an immersed solid functionality, which allows for motion to be prescribed to an object that won’t deform, but has no capability for a deformable object. While it was ultimately chosen to pursue a rigid object, Ansys did not appeal due to the diﬃculty of engaging Fluent and working with it directly (as is needed for the immersed solid implementation). 3.3.3 openFoam
openFoam is an open-source CFD library available as a set of C++ modules that can run any type of problem. The motivation behind using openFoam is that all of the solvers are coded and so one can go straight to implementing the immersed solid method. Additionally, the solver allows for the output of ﬁelds every certain number of cycles so a visual analysis of the ﬁeld can take place. It gave control at a really low-level, which meant implementing the immersed solid method would be considerably easier with this software than with Ansys and COMSOL. Moreover, its opensource attributes assured that no copyright or license violations would be incurred in modiﬁying the software.
Language for the Module
For developing the code infrastructure to support the project, Python was the choice language. More than just being object-oriented, Python is quite easy to code in and can easily interface with a host of packages, from visualization to parallelization. Among those useful for this project were: • Numpy: A Python math library that allows for the development and usage of arrays, with a host of functions that can work with these arrays. It also has a FFT package embedded within. • Matplotlib: A Python plotting library that can very quickly generate contours, vector ﬁelds, and other plots. • Pickle/ H5: These two libraries allow for quick and eﬃcient outputting of data. In the case of pickle, Python variables output directly to a ﬁle. H5 allows for great data compression and its ﬁles (known as cubes) are very quick to write to and read from. Moreover, many of the underlying functions used in Python libraries have already been optimized using C and C++. These two languages were also considered for the project, but interfacing them with other packages and developing visualization would have been diﬃcult.
Working with openFoam
To generate the mesh, the blockMesh utilities called from withinopenFoam from the central directory. The blockMeshDict contains all of our information and the blockMesh utility will ﬁnd the ﬁle and output it. Of the ﬁles that are outputted by blockMesh, all but the boundary ﬁle are important to have a discretization of the structured grid that will be worked with. The boundary ﬁle itself will be where the boundary conditions that need to be applied will be called.
Once these ﬁles are produced, the solver utility testFoam is called to run the code. testFoam simply needs to be called without any command-line parameters, as it will read and output ﬁles so long as they follow the openFoam ﬁle structure. At each iteration, testFoam will output a time ﬁle with the speciﬁc results of that iterations results until the total runtime of the problem is completed. 13
Building the Code for openFoam
Figure 3: Structure of openFoam Immersed Solid Code
The structure of the openFoam immersed solid code that was developed took the above structure. All of the modules were produced via Python. The program works as follows and is detailed in Figure 3: 1. Reading in the User’s Object: The user provides a node ﬁle and a connectivity ﬁle. This allows for point placement on the grid. 2. Parsing the Mesh File: After the blockMesh utility is called to generate the desired cartesian mesh, a parser is run to read in the mesh data. This consists of four ﬁles: the nodes, faces, cells, and neighbors. A parser runs on each ﬁle and stores the data, which is outputted via the pickle module to a data folder. Each speciﬁc value is stored as a key in a dictionary, and the relevant info (node coordinates, connectivity) as the items relative to that dictionary key. 3. Placing the Points on the Grid: Once the object data is obtained, a series of modules is run to triangulate which faces are closest to the object’s points. This is done by iterating through all of the faces and seeing which one’s centroid has the lowest distance relative to the object point. Once the centroids and consequently, cells, are speciﬁed, a boundary condition ﬁle is generated. 14
4. Developing the Boundary Condition File: Depending on what the user speciﬁed for motion, a ﬁle with all of the boundary conditions (patch, cell number, value) is outputted. The ﬁile has all of the boundary conditions listed in sequential order with the relevant data and can be parsed by the csv.reader function in Python. 5. Parsing Input File: The user should have a ﬁle with a list of all of the boundary conditions that need to be speciﬁed in the program; this would consist of a list of patches (or face boundary conditions) which would have the information. In addition, the ﬁle should also contain speciﬁcations of the mesh (grid size, spacing) and solver (time step, ﬂuid speciﬁcations). These ﬁles will be parsed by an input module and the data will then be generated into a boundary condition ﬁle. 6. Generating the Initial Field: Once the boundary conditions are read, the mesh data is used to generate the initial ﬁeld which represents the problem at the initial time of 0, with the boundary conditions reﬂected in the same manner. 7. Run the Solver Loop: The solver loop is executed for 1 iteration. At the next iteration, the same process takes place.
Issues with openFoam
The full immersed solid formulation was not implemented in openFoam, as it became apparent upon running a ﬁrst iteration of the code that the software was not a suitable choice. Multiple issues arose: 1. openFoam had its own ﬁle structure formats, and consistently developing the input ﬁles was not only costly (at every iteration) but also prone to errors (if even one letter was oﬀ or there was an errant space, the program would crash). 2. openFoam required a structured mesh, with the above implementation. Because the forcing term would be implemented via patches (facial boundary conditions), it soon became apparent that every face would have to be speciﬁed with a certain value; linking all of the faces together was not possible, unless the mesh was structured in this way. 3. The patch method was very quick to generate continuity errors; while these might have been resolved, due to the amount of time available for this project, it was not feasible to pursue this issue further. 15
4. The openFoam data ﬁelds were structured in the software’s speciﬁc ﬁle format, and while openFoam had a graphical user interface known as ParaView to do post-processing, it simply wasn’t feasible to use this for all analyses, as a faster development loop was desired.
Development of aeroCuda
Having evaluated the issue with openFoam, it seemed the best decision was to develop a code from scratch that would be malleable, eﬀective, and eﬃcient for users of all backgrounds to use. The ﬁnal structure of the code drew its inﬂuence from codes developed by Peskin and openFoam. From Peskin’s, the structure of the solver routine as well as the force projection, interpolation, and advancement of the solid points were incorporated into the ﬁnal version. With respect to openFoam, the data input/output structure were adapted into the ﬁnal version of the code. Given that these codes were robust, and in the case of openFoam, well-established, they would serve as good templates. A notable part of the ﬁnal structure is the solving of the problem via an Nvidia GPU, which allows for very signiﬁcant parallelization and provides immense speedups in the solution phase. On his website, Peskin provided Matlab code that simulated the problem of an immersed elastic
d b membrane forcing ﬂuid via a tensile force (forcing is proportional to κ| d2 x |). The code served as
a template for how an the immersed solid software might be structured. Since it was written in Matlab, the code was translated to Python to get a feel for what Python functions and/or modules would play a critical role in the CFD package. Among those that were useful were Pylab, Numpy, and Scipy, which provided vectors for handling the data but also arraywise operations. Of particular note was the pointer-referencing issues that arose in Python and not in Matlab. In Matlab, when a variable is set to take on the value of another variable, it receives a value by copying, not by direct memory reference. In Python, however, the data is transferred by direct memory referencing, unless a copy function is called, creating a duplicate of the value itself. Therefore, in certain cases where function calls and variable storage were dealt with, the code needed to be modiﬁed to ensure the original variable wasn’t altered during the update process. From the Python version of Peskin’s code, there were a few important points that would ﬁgure in the development of an immersed solid code: • Peskin’s implementation was for an immersed membrane but this project’s goal was to sim16
ulate immersed solid bodies. A key diﬀerence is that the ﬂuid inside the body surface does not move if the delta stencils of the boundary points do not cover the full interior. • Peskin’s code used for-loops and other runtime expensive mechanisms, which led to high runtimes for large grids and/or a large number of immersed solid points. • Peskin’s code provided a template where reconﬁguration and adaptation (i.e. for array operations ) could provide serious optimization. Those areas which presented serious potential for optimization were the choice of solution algorithm and the parallelization of the code. • Spurious oscillations occurred within the code when diﬀerent situations were implemented, i.e. a wider membrane radius. With respect to the last item, in a paper by Saiki and Biringen it was noted that spectrallydiscretized ﬂow-solvers tend to result in spurious oscillations. In Peskin’s code, Fourier transforms were widely used to solve the equations. This claim made by Saiki and Biringen motivated the usage of Tryggvason’s formulation of the projection method.
Figure 4: Structure of Input
For the input, there are 3 main components (outlined in Figure 4). First, nodes that deﬁne the surface of the object are needed. These will serve as a portion of the prescribed points (more might be needed, as explained in the following section). Second, the connectivities of these points are required as well, to help in guiding appropriate distance-checking and interpolation between 17
consecutive nodes. The nodes are inputted as an n x 2 array (x-coordinates in one column, ycoordinates in the other) and the connectivities are also n x 2, where the ith row has the id’s of the 2 points that the ith nodet attaches to. Lastly, the parameters of the solve need to be provided. These range from the constants to grid-spacing, as well as the speciﬁcations for the GPU (thread conﬁguration per block, block conﬁguration per grid).
Figure 5: Structure of Pre-Processing
The pre-processing phase is broken into multiple steps, as shown in Figure 5. First, the nodes and connectivities are checked for the spacing (tolerance prescribed by the user). This alerts him/her to problematic spacing. Second, the nodes and connectivity are then taken into the ”Complete” module and wherever the spacing between two connected nodes is greater than the actual grid spacing, enough points are generated between the two nodes via interpolation until the gap is suﬃciently small. Once the surface is closed, points inside the bounding surface are generated. Regardless of whether the user wants to rotate the orientation of the object, the rotation module is run to retrieve the angles and radii (relative to the speciﬁed origin of motion) of the points. These are important if any angular velocity is prescribed. Lastly, each speciﬁc point is given a speciﬁc spring constant to keep it as close as possible to the prescribed point’s position—the reason is that external points have fewer points to rely on for additional forcing and therefore need a higher spring constant.
Figure 6: Structure of Solver Loop
At the conclusion of pre-processing, the solver loop is engaged. It is a repeat of 6 steps that feed data in and out, all shown in Figure 6. In the ﬁrst step, the velocities of the solid points are obtained via delta stencil interpolation from the variable ﬁelds. Once obtained, the forces on each solid point are calculated via the forcing equation and projected to the grid. Next, the equations are solved via the projection method: the intermediate velocity, pressure ﬁeld, and velocity correction. Lastly, the ﬁnal velocities are obtained via interpolation, and both prescribed/boundary solid are translated by their respective velocities. Just to note, all calculations take place via the GPU, to optimize their runtimes.
The post-processing takes place as the solver loop executes, detailed in Figure 7. There are two types of outputs that take place. Those of type ’Transient’ are ones that take place with each execution cycle. Those of type ’Frequency’ take place after a certain number (user-speciﬁed) of cycles executes. Those values outputted of type Frequency tend to have lots of data and therefore should only be outputted after a large number of cycles, otherwise a slowdown in runtime and massive memory consumption will take place. The idea of Frequency outputs was taken from
Figure 7: Structure of Post-Processing
openFoam, as it seemed the most logical way to view variables without incurring the aforementioned costs.
Pre-Computation: Interior Point Generation and Rotation Capabilities
Motivation behind Interior Point Generation
In the immersed solid formulation, interior points need to be speciﬁed inside the 2-d or 3-d geometry to force the ﬂuid internally, as suggested by Peskin in correspondence. To put in perspective, if a circle is moving at a velocity u, it should have to force the ﬂuid on its outside only; the interior points should be moving at the same velocity u. If interior points are not speciﬁed, then the velocity in the interior of the circle will not be at u, as no force will be present to move the ﬂuid at the velocity u; this is the case with a moving membrane, which is not the focus of this project. To make the task easier for the user, the code requires a 2-d surface to be passed in and develops the interior points afterwards. 5.3.2 Interpolating the Surface of the Geometry
Since the immersed solid relies on points forcing the ﬂuid, it is important that points completely enclose the object at hand to prevent ﬂuid from penetrating the intended boundary. To handle this issue, an interpolation module is implemented to close gaps in the surface. It takes in a list of nodes and connectivities to generate the surface of the object. Once completed, each point has its connectivity checked to ensure that the distance between two nodes is less than a certain amount (for best results, this should be smaller than the grid-spacing). If the between two nodes is too large, a linear interpolation scheme is implemented by traversing a vector between the nodes and 20
Figure 8: Cloaking from Diﬀerent Directions
placing a point every h units, where h is a tolerance deﬁned by the user. In doing so, it is ensured that the object has no compromising gaps. 5.3.3 Developing the Cloaking Mechanism
Cloaking is a mechanism developed to help construct a point cloud that most closely resembles the object’s geometry. The mechanism is illustrated in the following ﬁgure: The principle behind cloaking is to isolate all points that lie within a boundary by using normal vectors from all 4 sides. The nodes are mapped to locations on the grid via the prescribed spacings, with a magnitude of 1 (all of the gridpoints are initialized to 1). Cumulative sums are then executed from all 4 sides of the grid, using the Numpy.cumsum function. Therefore, any point which lies in the normal vector direction from a boundary point will have a value greater than 0, due to the cumulative sum. At the end, the grids are taken and examined for those points with a nonzero value in all four runs; these points form the point cloud that composes the object. The drawback to cloaking is that if the tolerance for cloaking is less than the interpolation spacing, then there will be gaps in the solid, which may reduce the eﬀectiveness of the mechanism. This process is detailed in Figure 8. The Dark Blue portions of the ﬁgures represent those points where the sum is 0; where there is color (ranging from blue to red), the value of the sum is greater than 0 (the closer to red, the higher the sum). All those points with nonzero values in all 4 cases are taken to form the body of the object.
Developing the Delaunay Mechanism
An alternative method to the cloaking mechanism is the Delaunay Triangulation Method. While this was originally developed to help form meshes, it has been adapted here to develop interior points for an arbitrary 2-d geometry. As adapted and modiﬁed as necessary from the notes of Tautges, the algorithm is as follows : 1. Identify an interior point (ﬁnd average (x, y) coordinate). 2. Initialize arrays to keep track of point ids, (x, y) locations, and those checked for neighbors. 3. Starting with the central, check to see if any interior/boundary points exist within the up, down, left, and right directions based on a radius r. If so, create a new point entry and log its x,y coordinates along with a checked status of 0 (empty). 4. Repeat the previous step until no new points have been added after a certain number of times executed. The beneﬁt of using the Delaunay method is that the generate points can very quickly conform to the boundary of the object without distorting its actual surface just to ﬁll the interior. In addition, the tolerance can be adjusted to help ensure that the boundary is matched quite nicely. While the implemented algorithm does not involve any adaptive point generation, such a capability can be implemented in the code and would allow for more robust results. 5.3.5 Comparing the Delaunay and Cloaking Mechanisms
The following ﬁgure best depicts the eﬀects of both mechanisms using diﬀerent spacings on a NACA 6716 airfoil: From a ﬁrst glance, the Cloaking mechanism appears to provide more than enough points for the interior but which do not stay inside the shape, that is, cross the boundary (though the violations is not too apparent). In the case of the Delaunay method, fewer points are provided but they remain inside the boundary. While grid- and point-spacing certainly aﬀect the outcome of the immersed boundary, conforming to the body of the object is important in CFD, regardless of the problem being solved. However, in the immersed boundary method, it is important the points deﬁning the boundary be supported by interior points. In essence, since points moving in the same general direction at the same general speed, their force contributions will be split across the surrounding boundary points. Therefore, the point generation mechanism must be able to place
Figure 9: Interior Point Generation using Both Mechanisms
points very closely to the boundary. Since the Cloaking mechanism does this more eﬃciently, it is used to generate the point clouds for the following simulations. 5.3.6 Implementing the Rotation Algorithm
To allow the user to test diﬀerent angles of attack or orientations, a rotation module was implemented to provide the geometry with a certain angular orientation. The general structure behind the rotation algorithm is as follows: 1. Calculate where the central point of the geometry lies. 2. Shift the entire object to be centered over origin. 3. Get distance from origin to all points. 4. Get angles of all points relative to origin by converting them to complex vectors and using the angle function in Python. 5. Add the theta desired to all of the angles. 6. Use the r(cosθ, sinθ) formulation to regenerate the points. 7. Shift them back to the original central point.
Developing the Solver In Serial Code
The projection method has 5 steps that need to be solved. The algorithm presented here summarizes the full solution of the algorithm detailed in the Solver-Loop subsection of the Structural Overview section: 1. An interpolation of velocities from the ﬁeld and a projection of the calculated forces to the ﬁeld  2. An explicit solve for the intermediate velocity  3. An implicit solve for the pressure ﬁeld to correct the intermediate velocity  4. An explicit solve for the ﬁnal velocity via pressure correction  5. An interpolation of velocities from the ﬁeld and an update of the prescribed and solid locations  5.4.1 Implementing the Projection Method: Steps 2 and 4
Steps 2 and 4 are the easiest to implement since they are explicit and involve shifting operations. For the simulations of this project, it is important to note that periodic boundary conditions are enforced, so over a domain of size [0, L] × [0, L], the conditions x(0) = x(L) and y(0) = y(L) hold for all variables and their derivatives. It would be important to make sure that the cells on the
boundaries read their data from those on the opposite if the applied operator requires a cell past the boundary. In Python, such an operation can be implemented via the Numpy.roll function. This function allows for the shifting of an array of n dimensions via a speciﬁc axis and by a certain magnitude. Therefore, the second step of the algorithm was laid out as follows. • un = ﬁeld velocity (ux , uy ) at step n, f = force ﬁeld, us = intermediate velocity (us , vs ), δx=x-spacing, δy = y-spacing, ρ = density, δt = timestep, ν = viscosity • Deﬁne the function partial-ﬁrst(variable, spacing, magnitude, axis): (roll(variable,-1, axis)roll(variable, 1, axis))/(2*spacing) • Deﬁne the function partial-second(variable, spacing, magnitude, axis): (roll(variable,-1,axis)2*variable + roll(variable,1,axis))/pow(spacing,2)
• us = un + δt*(-1*(partial-ﬁrst(un ,δx,1,2)*ux + partial-ﬁrst(un ,δy,0,2)*uy ) + ν*(partial-second(un ,δx,1,2) + partial-second(un ,δy,0,2)) + f /ρ) Likewise, for the fourth step of the algorithm: • un+1 = ﬁeld velocity at step n + 1, p = pressure • un+1 = us - δt*partial-ﬁrst(p,***,1,2) *** denotes relevant axis (x-axis: δx, y-axis: δy) In utilizing the Numpy.roll function, two beneﬁts are gained. First, because the Numpy functions are coded in C++ and operate array-wise, the cost of iterating through the array via looping is avoided. Secondly, the roll function implicitly accommodates periodic boundary conditions, helping to avoid conditional statements to ensure the nodes on the boundary and interior are treated properly. 5.4.2 Implementing the Projection Method: Step 3
In the description of the algorithm used, it was outlined that the FFT was used to solve the Poisson equation. However, this was only arrived at after considering the implementation of the matrix solution method. The Poisson equation, given as, ∂2p ∂2p + =f ∂2x ∂2y takes the following form when decomposed via ﬁnite-diﬀerences: pi,j+1 − 2pi,j + pi,j−1 pi+1,j − 2pi,j + pi−1,j + = fi,j . (∆x)2 (∆y)2 A matrix method like BICGSTAB can be utilized to solve this equation. The coeﬃcient matrix would have ﬁve bands, since there are ﬁve variables involved in each equation, shown in Figure 10. From a computational perspective, this means that for every point on the computational grid, there are 5 values to be stored in the matrix. Since the smallest grid used is of size (512,512),
about 10mb is allocated for the coeﬃcient matrix. While a method like BICGSTAB can indeed work with a coeﬃcient matrix of this size, it would require many iterations in addition to ensuring that memory allocation is not a problem (creating such the matrix outlined resulted in a MemoryError being called by Numpy). Since a speedy and accurate CFD solution is desired, and one that does 25
Figure 10: Coeﬃcient Matrix Structure for Poisson Equation on 8 Node x 8 Node grid not require massive amounts of cores to run, too, implementing a spectral solution to the Poisson equation is an eﬃcient way of obtaining a good solution to the equation.
Implementing the Interpolation Step
The delta function stencil is a 4x4 stencil but has uniform x and y values which are multiplied together by the ﬁeld values. In his code, Peskin conducted this interpolation in the following method : 1. Calculate location of point, radius, and other necessary parameters 2. Iterate through all of the points 3. Multiply stencil by ﬁeld values and get the total sum For a quantity of 1000 points, the time to execute such a loop would be very large. Therefore, it is important to vectorize these calculations and avoid looping to produce quick iterations. To do this, the stencil should be examined: it is a combination of 16 coeﬃcients multiplied by 16 corresponding values from the the ﬁeld. Therefore, for each immersed solid point there are 4 unique delta values in the x-dimension and 4 unique delta values in the y-dimension. Therefore, for each immersed solid point two 4 × 4 arrays are generated, each with the x-values uniform across the
rows and the y-values uniform across the columns. Once obtained, the x-value arrays are stacked on top of each other and y-value arrays aligned next to each other using the Numpy.column-stack and Numpy.row-stack functions, respectively. These gives two 4n × 4 size matrices. To get the full delta values, another set of matrices, one that has the x-values and y-values of the corresponding points, is generated. Now the Numpy.flatten function is used on the arrays to convert them all to 1-dimensional vectors (i.e. flatten(2 × 16 vector) = 1 × 32 vector). Numpy lends motivation to this idea, as an array of values can be yielded by some variable if a 1-dimensional array or list (multiple dimensions are not supported) is passed as the index. The delta values become relatively easy to work with, as the list of relevant ﬁeld values is multiplied by the x- and y- delta vectors. The resulting values are then taken, and using the Numpy.reshape function to convert them back to nx16 matrices, the Numpy.sum function is executed across across the 1 axis (horizontal, or row-wise) to add up all of the values and return the relevant u and v velocities for each solid point. 5.4.4 Implementing the Forcing Field
The projection of the forcing ﬁeld onto the grid is similar to the interpolation step, except in this case values are passed instead of taken from the grid. Assuming that the forces has been calculated for all solid points, the force value at each solid point needs to be projected to the surrounding points using the delta-function. In addition, this property is additive, meaning that other points in the vicinity might be aﬀecting the same gridpoint and so the forces will need to be added together. First, a force variable of the grid’s size is initialized to 0 and converted to a 1-d vector via the Numpy.flatten command. Tbe same delta-stencil and global location arrays are implemented as in the interpolation step. However, instead of retrieving values from the grid, similarly-sized ﬁeld value arrays are created by repeating our force values in 16 sets; therefore, if the array is [1,2,3...] the new array will have the 0-15th indices corresponding to 1, the 16-31st indices corresponding to 2, and so forth. These are multiplied by the delta-matrices and the force vector, yielding the stencil values. The global location values are then used to initialize a defaultdict dictionary pointing to a list. A defaultdict is an object in Python that allows for one to place values with certain keys based on an object type, like a list or a ﬂoat. This satisﬁes the need well, group by location is desired. Therefore, a Python generator (which takes much less time than a for-loop since it does not create the object in memory) to iterate through the global location vector and place the stencil values with their appropriate locations. The the stencil values are summed at each point using another defaultdict, except one that’s initialized to a ﬂoat (thing of this as the reduce-portion of 27
map-reduce). Since the global locations are stored as the dictionary keys and the force magnitudes as their values, passing these to the force grid is an easy process. Both keys and values can be isolated as lists, and the relevant gridpoints can be augmented by passing the keys directly to the force grid (Force-grid[keys]) and add the values (Force-grid[keys] += values). The reshape function is then used to reshape force-grid to the size of the domain.
Code Reﬁnements and Optimization
The Variable-Spring Model
In the immersed solid method, the outermost layer of solid points are responsible for breaking the ﬂow as the object moves. Consequently, these points are the ones that also happen to shift positions the most (due to ﬂuid forces) and thereby are most likely to begin a chain of displacement within the layers of surrounding points. The easy solution would be to raise all the spring constants to massive levels; however, this is not feasible since the object (at the beginning of its motion) would be destroyed by a massive spring force from the initial motion. However, by raising the stiﬀness of those points in areas with fewer solid points, more force can be eﬀected by those points to compensate for the compounding eﬀect of having multiple points forcing the same gridpoint. Raising stiﬀness also ensures that the solid points will closely follow the prescribed points, with higher forces being the penalty for widening distances. Therefore, the variable spring model is proposed. 6.1.2 Underlying Principle
In the variable-spring model, spring constants are inversely proportionaly to the number of surrounding the points. The reason traces back to Peskin’s delta function. Since it is a 4x4 stencil, neighboring points are more than likely to overlap on the same gridpoints; as a result, their forces will compound, applying a much stronger spring force than an individual point alone. However, if a point is rather secluded in the geometry (on the surface or the point end of an airfoil), that point must have its spring constant raised to compensate for the fewer surrounding points but also having to deal with the boundary layer.
In the variable-spring model, the algorithm implemented is as follows: • Produce the distance vector of one point to all of the points in the object • Run a logic statement to ﬁnd those within a speciﬁed radius • Sum up the logic “1” values to ﬁnd the total • Repeat the above for all points in the solid. • User prescribes a slope to apply based on the number of surrounding points and initial κo . • Let M ax denote the largest number of solid points within the speciﬁed radius of a solid point and Surr denote the number of solid points within the speciﬁed radius of the speciﬁc solid point being dealt with. Let m denote a slope constant prescribed by the user to specify how muc hthe spring constant should be raised for every point lying in the vicinity. • Once the maximum number of surrounding points is identiﬁed, assign the spring constant: κi = κo (1 + m(M ax − Surr + 1))
While the algorithm itself is not optimized for speed, it is easily parallelizable mainly because the operations employed in the solution involve basic arithmetic steps that involve data from multiple points. Since the algorithm has to be repeated for all points, the process can be executed by n processors if the points are split up into n groups. Each processor will then work on its group and return the value. Things are made easier by the MPI scatter and gather functions, which allow for the groups to be sent to respective processors and the same groups to be returned in the right order, respectively. Therefore, there is no issue with synchronization and the order of retrieval. The values are simply passed out, the function executed, and the outputs gather and concatenated into a 1-d vector with length equal to the number of immersed solid points. The algorithm above was initially implemented in a serial code. However, it took considerably long to run, even for the most basic cases. Therefore, the focus now shifted to optimizing the code via parallell processing. To this end, 2 options existed: using MPI or Nvidia’s CUDA GPU computing platform.
Evaluation of MPI
If MPI was used, the structure of the program would be as follows. For the interpolation scheme, the immersed solid points are scattered (they are broken up into n arrays for n processors to work on) amongst the diﬀerent processors and the velocities are gathered (collection of the processors computed values) back. For the force projection, the force grid (as an array of zeros) would be broadcasted (same copy sent) to all processors and each processor would add its projections to the grid. The outputs would be gathered and added up to obtain the grid values. The calculation of the intermediate velocity could be implemented via a domain decomposition method with ghost cell transfers. The most diﬃcult step would be the Poisson equation, as this would require an MPI-version of BICGSTAB to be implemented. The spectral Poisson solution would be a waste to implement via MPI, as the FFT is essentially a global operation. An implementation of an FFT algorithm might involve some sort of master-slave algorithm where one processor serves as the distributor of data that needs to be processed. As the other processors execute jobs, the central processor retrieves the data from the completed processors and provides new data to be processed. This process continues until the full operation is complete. Once done, the central processor would have to transpose the matrix and then pass out new arrays to have the FFT run. This would result in a lot of code to implement which might not even oﬀer a speedup. Given the goals of this project, this would detract from the malleability of the code but also prevent it from running even faster. 6.2.2 Evaluation of CUDA
If CUDA was used, the structure would be as follows. CUDA grants control of an individual thread, of which there are millions on a gpu, enabling the grassroots control of each grid point value. Therefore, the code can be parallelized at a level which would not be possible on MPI (or would be possible, but would require a vast amount of resources and code). For the interpolation code, one point could be assigned to each node, whose job would entail computing the full stencil for that speciﬁc point and return the velocity, eliminating the need for vectorization. For the forcing implementation, the same processed would be used but the values would be stored in an array with a correspoinding global id array, so that a group of threads could run the reduction very eﬃciently. The intermediate velocity calculation could also be run very quickly, as each thread needs to read the values from the cells surrounding it and execute two lines of operations. The Poisson equation can be solved using FFT libraries that exist with Python bindings to CUDA. The velocity correction
could be executed in a manner similar to the intermediate velocity calculation step. The source code required for CUDA (though more complicated) would be concise, but it would also help in another way. With CUDA, gpuarrays (pointer references to arrays) are allocated and left in device memory, avoiding the necessity of having to pass memory back and forth between the host and device (this can be avoided in MPI but would take much longer to implement and be much more complicated than the CUDA code). 6.2.3 Going with CUDA
Having thought about both approaches, the CUDA implementation appeared to be more feasiblel. It would be cleaner and more eﬀective by allowing a much more low-level approach than MPI. While it wouldn’t allow for functions like Numpy.roll to be used, it would provide a greater speedup by allowing for thread-based approaches. Figure 11: Technical Structure of CUDA 
The technical structure of an Nvidia GPU worked quite well with the solution method employed by aeroCuda to solve the Navier-Stokes equations. The structure is detailed in Figure 11. Each Nvidia gpu contains 3 levels of operation: the grid, the block, and the thread. The hierarchy, as shown in the relevant ﬁgure, functions as follows: 1. Thread: This is the lowest level of the hierarchy. It functions as a worker for executing the 31
functions and can access local, shared, or global memory. Local memory can only be accessed by each thread. 2. Block: This is a group of threads that function together. The blocks are important as shared memory can be accessed by all threads in a block – it is also quicker to read and write from than global memory. It can accommodate 32 × 32 threads. 3. Grid: This is a group of blocks that forms the basis of the computational grid. Only global memory exists on this level. It is also important to recognize that the GPU is separated from the CPU or computing platform, so memory will need to be allocated on the GPU to hold the computed data. The PyCUDA package developed by Klockner does exactly just that and more. Klockner’s gpuarray module allows for the creation of arrays on the gpu that have properties similar to Numpy arrays but also allow operations between arrays to be conducted on the gpu, providing a further speedup. The PyCUDA package will allow for the engagement of CUDA from a very high-level but use functions optimized for necessary operations. The projection method with the immersed solid formulation has 4 explicit steps and 1 implicit step. For the explicit steps, CUDA kernels (functions) can be written to execute them. For the implicit step, FFTs are needed to for these issues. Nvidia developed the cuFFT package to run FFTs using the CUDA programming structure—to adapt this in Python, the pyFFT package developed by Bogdan Opanchuk creates a binding with PyCUDA to pass gpuarray objects to cuFFT. The following sections will describe the programming scheme. Of worth noting is that CUDA will take an n-dimensional variable and decompose it to a 1d vector, whereby the indexing is carried through by the block and thread level. Therefore, in the following outlines of the CUDA algorithm, all global variables/quantites (while they might be 2-dimensional) are actually 1-dimensional when transferred to the GPU.
Implementing the CUDA-optimized Structure
Implementing the Interpolation
In the case of interpolation, the vectorizing process is completely averted. Since n immersed solid points exist, n threads can carry out the interpolation scheme for each point. The parameters for each point (xr , yr , rx , ry ,etc) are calculated in the similar fashion. However, for the stencils, each
point has a double for-loop that iterates through all of the possible indices. In each iteration, a new φ(x)φ(y) is calculated and multiplied by the relevant point, which the thread reads from the ﬁeld variable (this is stored in global memory, since it is available to all threads). Once the threads have completed, they write the interpolated values to an n-length vector. 6.3.2 Implementing the Forcing
Implementing the forcing is slightly more complicated than before. Multiple arrays are needed for this implementation. In one array, global IDs of the force projections will need to be stored (the mapping of global IDs is shown in Figure 12). For n points, this array will have to containt 16n elements, to ensure that each projection is written to a diﬀerent space. In addition, two arrays will have to be created, of the same 16n length, to store the magnitudes of the corresponding forces. In the fourth and ﬁfth arrays, the full force grid will need to be assembled, to store the total forcing at each point (if it is actually forced, else the value is just 0). Figure 12: Thread-to-Point Mapping Diagram
. In the ﬁrst step, for each immersed solid point a double for-loop is engaged. If it is the ith solid point, it will write to the [16i, 16i + 15] indices of the global ID and the corresponding force vectors. Therefore, all threads require 16 total iterations to get all the projected forces. The issue now becomes writing to the grid. In CUDA, a common issue is that of thread racing, where by multiple threads try to write to the same global memory location or shared memory location. If not executed properly or done in sequence, multiple threads can write at the same time or read at the same time and result in wrong values being written or read. Therefore, all of the threads simply cannot write to the same location. However, if recalled the stencil had 16 unique points; therefore, in the global ID vector, every 16 values should be completely diﬀerent, starting from 33
the beginning. Therefore, if 16 threads execute in a for-loop of size n, the threads [0, 15] will read from the [16n, 16(n + 1)) indices of the corresponding force vector. They willl then take the [16n, 16(n + 1)] values of the global ID vector and augment the respective locations on the full forcing grid. The threads are synchronized using the synchthreads() command to ensure that no threads begin executing the next for-loop, since they might interfere with the reads and/or writes of the threads still completing the previous for-loop. 6.3.3 Implementing the Intermediate Velocity and Final Velocity Calculations
Since both steps involve explicit ﬁnite diﬀerencing, the task is fairly straightforward. Referring to the following ﬁgure depicting the layout of the threads and the computational grid, so long as the number of gridpoints does not exceed the number of threads, every point will have a unique thread assigned to compute its value. Since the values are being stored in new (intermediate or ﬁnal velocity) arrays, there is no issue with race-conditions between threads. Therefore, the crux of the task at hand is to compute the proper ids of those points needed for computing the relevant center point’s value. Since the dimensions of the grids and blocks are set by the user (in addition to the solver parameters), the index can be calculated either blockwise (as done in this code) or row/column-wise; it depends on the user’s preference.
Results Obtained with aeroCuda
The Eﬀect of Optimization
Loading the code onto the GPU removes a considerable portion of the runtime. The speedups are especially noticeable in the 1st, 2nd, and 4th steps, as shown in Table 3. In the 1st step, substituting the thread-based force projection for the vectorized projection appears to have provided the bulk of the speedup, since in the 5th step where only the interpolation takes place, there is a much smaller speedup. The ﬁnite-diﬀerencing steps (2,4) show a very high speedup as well, especially in the case of the 4th step. The discrepancy between these two steps might be the total number of global memory reads that must be made—since the 2nd step requires many more variables than the 4th step, it is possible that the variable reads are forming somewhat of a bottleneck on that time. The actual time and speedup quantities for the simulations are listed in Table 3. For the serial code, the 1st and 2nd steps took the longest, while for the GPU the 1st and 3rd steps took the 34
longest. The issue behind this could be that for the serial code, the necessity of having multiple roll functions execute the partial derivatives resulting in a slowdown for the 2nd step. For the 1st step of the serial code, the forcing function was diﬃcult to optimize outside of the vectorization that was done. For aeroCuda, the runtime for the 1st step was large as it involved the for-loop iteration necessary to place all the forces at their respective points on the grid. The Poisson equaton, the 3rd step, took the second-longest to execute, yet, still provided a good speedup over the 3rd step runtime of the serial code. For an improvement to aeroCuda, a more robust algorithm for transferring forces to the grid would help shave some time oﬀ the 1st step. Table 3: Simulation Serial100 GPU100 Speedup Simulation Speedup for Re 100 Case 1st 2nd 3rd 4th 5th 0.87s 0.67s 0.28s 0.33s 0.035s 0.018s 0.008s 0.014s 0.004s 0.011s 48.2 81.1 19.9 89.1 3.13
The results obtained are expected in terms of magnitude, though they vary slightly from those obtained in other papers. The Drag Coeﬀocients are shown in Table 4. Multiple sources are used to conﬁrm the tests conducted in this paper. In particular, Henderson’s work on studying the drag around a cylinder shows a graph of the drag coeﬃcient as a function of Re . All of the values fall within the expected regions according to that graph. For numerical conﬁrmation, in the case of the Re 100 cylinder the experimental coeﬃcient of drag obtained by Peskin and Lai is very closely matched by the Re 100 case . In the Re 25 case, the coeﬃcient of drag is on the higher end of the numerical studies presented by Saiki and Biringen, but is backed by other studies . Table 4: Coeﬃcient Simulation GPU 1000 GPU 100 GPU 25 of Drag Mean 1.53 1.4 2.24 Mean Std 0.35 0.19 1.07 and Standard Deviation Previous Work 1.5 1.44-1.54 1.54-2.26
The cylinders were run at the same conditions except velocity and timestep (details are given in Table 5). For the computational parameters, the spacings were δx =
1 128 , δy
while the density
ρ = 1. The following table outlines the time-stepping, dynamic viscosity, and velocity parameters for the diﬀerent simulations: 35
Table 5: Simulation Parameters Simulation ux δt ν Re 1000 1 0.0001 0.0003 Re 100 1 0.001 0.003 Re 25 0.25 0.001 0.003 As outlined in the notes of Tryggvason, for the projection method implemented the CFL condition was dt <
2ν 10 . |u|2
Given these constraints, the Re 25 and 100 cases can be executed at the
same parameters. For the Re 1000 case, since the dynamic viscosity is 10x lower than in the Re 25 and 100 cases, the time step must be lowered signiﬁcantly in order to get the best result. A timestep of 10−4 was satisfactory to get a good coeﬃcient of drag. In a conversation with Karl Helfrich, a CFD scientist at Woods Hole Oceanographic Institute, two issues with the present code were noticed. First, the method used to solve these problems is known as a direct numerical simulation—in eﬀect, no approximations are used and the reﬁnement of the spacing (spatial and temporal) is used to obtain solutions. Not only is this costly but it cannot be used for all problems. Second, the projection method employed is purely ﬁrst-order in time and a more accurate advancement of the solution would most likely help in both stability and accuracy.
Expected Physical Phenomena and Further Validation
In ﬁgure 13 , there is no vortex shedding occuring in the case of the cylinder at Re 25. This is because the phenomenon, known as Von Karman Shedding, does not occur until about Re 50. The contours obtained here are also present in the paper by Saiki and Biringen, where the authors also simulate a Re 25 ﬂow around a cylinder. In the cases of the cylinders at Re 100 and Re 1000 (ﬁgures 9 and 10, respectively), the Von Karman Shedding takes place. In particular, the shape of the vortices obtained in the Re 100 case take on the similar ’pointy’ shape that those in Peskin and Lai’s paper for Re 150 took on as well. The immersed solid method has the points traveling at the prescribed 1 m/s in both ﬁgures, as the velocity magnitude matches that speciﬁed by the color bar on the right-hand sides of both plots. The velocity magnitude was taken to be: |umag | = (u2 + v 2 ) 2 .
A Closer Look at the Physical Response of the Immersed Solid
Looking at the drag forces in Figure 15, the oscillatory motion of the immersed solid points is evidenced by the nature of the drag forces. It should be noted that the forces listed here are in
as the solution is 2-dimensional and not 3-dimensional. While this is a graph focused on the
converged portion of the drag force, initially one can expect to see a damped harmonic oscillator response from the system, where the both the drag and lift forces gradually converge to their steady state value(s). To measure the oscillations’ magnitudes, the means and the standard deviations are computed to provide a better understanding of the steady-state behavior. In the case of the cylinder at Re 25, it can be assumed that the force is in the mid 0.02-0.025 range, while for the Re 100 and 1000 cases, the force will be in the mid 0.2-0.24. The forces should not be very diﬀerent for these last two cases, as their coeﬃcients of drag are relatively close to each other.
Physical Location of the Immersed Solid Points
Looking at Figure 16, the colorbars indicate the magnitude of the displacement of solid points from their prescribed counterparts. In the diagrams, the point dispersion goes from best to worst as Re 25, 1000, and 100. This should be expected as the Re 25 case faces the lowest velocity, while in the Re 1000 case a very high spring constant is used. In the case of the Re 100 cylinder, on the upper and lower surfaces of the cylinder that break the ﬂow (right side of the cylinder), the points have shifted more than in the left half of the cylinder. While they might be moving along at the proper velocity, in moving out of position they might have minorly aﬀected the expected drag value by applying forces to ﬂuid from their shift positions. Since the coeﬃcient of drag for the Re 100 case matched that achieved in other papers, the eﬀect was negligible. Looking at all cases, no point was more than 1/2 of a gridspace width away from its intended position. This provides more conﬁrmation that the ﬂow was properly matched and that the object’s structure held strong throughout the simulations, preventing a distortion of the ﬂow around its surface.
Test Case: Swimmer in Glide Position
One of the underlying motivations behind developing this code was to apply it towards problems involving biological motion. In the study conducted by Von Loebbecke et. al., the authors attempt to analyze the ﬂow around a swimmer performing the dolphin kick in 3-dimensions. The study is adapted to ﬁt the present capabilities of this solver: the case of a swimmer in the glide position at constant velocity in 2-dimensions. The geometrical ﬁgure of the swimmer was obtained from the paper, via an image-tracing mechanism in MATLAB. The outline was then provided to the aeroCuda code and the siimulation parameters were provided. 37
The swimmer obtained from the paper was scaled to about 1.7m in length. The max width of the swimmer was 0.23m. The outline consisted of 7998 points when it was taken from the paper. After being submitted to the cloaking module with a spacing of 1/128, the ﬁnal point cloud is shown in Figure 18. The grid size was set to 512 x 4096, with the spacing in both dimensions set to 1/128.0. The kinematic viscosity was set to 3x10−4 with the density set to ρ = 1000. The timestep was 10−4 s, well with the CFL condition range. The general spring constant was set to 5x107 , with a slope of 0.5 for the variable spring model. Figure 18 details the spring constants at all of the points around the body. Concerning the motion of the swimmer, the velocity was set to 1 m/s in the positive x-direction.
The ﬂow around the body of the swimmer is very similar to that around an airfoil, perhaps due to the streamlined nature of the body; the Reynolds number of the simulation was placed at 5666.7. Two major conﬁrmations of the solution are presented, the plots for which can be found in Figures 19-21. First, the magnitude of shift visible in the immersed solid points is very minimal; the largest separation between a solid point and its prescribed point is less than one gridspace away. Second, the forces felt by the swimmer oscillate at steady levels; the drag force is concentrated around 60-70 while the lift force shows sinusoidal oscillations at steady periods. If the points have not shifted much, then the integrity of the body is whole and the ﬂow around can be deemed rather accurate. The force diagrams conﬁrm, as if points have shifted dramatically then the force would not stabilize around a mean value. Looking at the ﬂow around the swimmer, the shedding of vortices is continuous at this stage, showing that steady state has been achieved.
Reynolds Number Transition
To further expand the analysis of the swimmer, the ﬂow patterns at diﬀerent Reynolds Number regimes are examined (found in Figure 22). These simulations are done by varying the kinematic viscosity of the problem at hand, the values of which are listed below. However, with higher Reynolds number problems, the κ of the springs as well as the timesteps must be changed to accommodate the changing nature of the problem. In increasing κ, it is ensured that points do not shift at the higher Reynolds ﬂows while decreasing the timestep allows for much more accuracy to
be obtained than at higher timesteps, in addition to satisfying the CFL condition. Table 6: Flow Low Medium High Swimmer at Diﬀerent Reynolds Numbers Reynolds Number ν δt κ 566.7 3x10−3 10−3 107 5666.7 3x10−4 10−4 5x107 56666.7 3x10−5 10−5 108
At a low Reynolds number, the boundary layer should remain intact, which it does in the simulation run. However, at the medium Reynolds number, the shedding of vortices and a thinner boundary layer are to be expected, as the ﬂuid is not as viscous. At the high Reynolds number case, the boundary layer separates and vortices are shed not just at the feet of the swimmer (as in the medium Reynolds case) but also along the body. These diagrams were obtained at 5s for the low and medium cases, and 4s for the high case. Due to cluster compute time issues, it was diﬃcult to run long simulations as holding a position longer than 4 hours resulting in dismissal from the cluster.
Numerical Improvements to aeroCuda
While the aeroCuda code provides good accuracy and eﬃciency, it can be optimized in a few critical areas that can unlock its potential as a strong CFD code. The ﬁrst of these is the numerical methods, including the algorithms behind the solutions and the governing principles of ﬂuid dynamics. The ﬁrst numerical improvement is the updating of the projection method implemented here to a more numerically-accurate projection method. In particular, the algorithm here is ﬁrstorder in time and second-order in space; however, in his notes Tryggvason presents an Runge-Kutta fourth order method (in time) that was developed by Weinan E.  The diﬀerence between the methods should be noticeable at high Reynolds number, where the temporal discretization makes a diﬀerence. To improve the actual spatial accuracy, implementing higher-order ﬁnite-diﬀerence approximations would be a good start. A potential issue with higher-order numerical methods might be their stability; when the projection method was ﬁrst implemented, higher-numerical derivatives were implemented for both ﬁrst and second derivatives. However, implementing anything greater than second-order accurate methods resulted in a deterioration of the solution. Of help might be expanding the delta function stencil, which would also provide more accurate forcing of the 39
ﬂuid around the immersed solid points. The delta function itself can also be improved in order of accuracy, which should help develop a more accurate forcing function.
Technical Improvements to aeroCuda
The numerical improvements above may prove to be diﬃcult to code if the eﬃciency and optimization of aeroCuda’s runtime is taken into consideration. For example, in order to increase numerical accuracy more gridpoints would have to be worked with; this means that global memory reads would increase signiﬁcantly to process the accuracy. In the force projection portion of step one, a large for-loop is executed by 16 points to add forces to their respective points. These are just some areas which result in slowdowns to the code—if their execution times can be reduced by 30-40
Capability Enhancements to aeroCuda
There are two areas which should be the next focus for aeroCuda development: expansion to 3-d and video processing. In the case of the expansion to 3-d, two issues manifest: memory transfer and thread execution. In the case of 3-d, there will be a massive increase in the amount of memory consumed, simply because the domain is being extended to another dimension. Therefore, the transfer times of large chunks of memory (in the case of a cubic grid with 1000 gridpoints per dimensions and ﬂoats, 8 gb would be needed) would be very high. Moreover, for 1 billion threads to execute, there would be a noticeable increase in time; high quantitires of threads executing do take more time to execute. Perhaps a combination of MPI and CUDA could be used to execute the problem—however, data would have to be transferred to and from the GPU at every timestep. For video processing, there are probably multiple ways to implement this, the most direct being output of variables at every timestep. However, this would result in a massive and unrealistic memory requirement, especially with a large grid. If the GPU could be worked with at a deeper level, perhaps video processing could be made as part of the solver process—if it results in a drop in runtime, however, then perhaps it would not be wise to pursue this attribute.
The immersed solid implementation developed in this paper proves to be reliable for the cases demonstrated here. It conﬁmed the observations made in other papers and for situations at high Reynolds numbers, large force ﬂuctuations were observed but the model itself appears to have worked well. This theory developed by Peskin can truly unlock the potential for eﬃcient CFD 40
simulations of transient ﬂow problems; it has been demonstrated in others’ works just as it has in this one. With an improvement to the numerical method and technical speciﬁcations of aeroCuda, it is hoped that this code could be of great use to researchers and students alike. Computational ﬂuid dynamics is a diﬃcult ﬁeld, but one which still holds many secrets to be unlocked. Hopefully, aeroCuda will help shed light on some of them in the future.
To develop aeroCuda, the Enthought Python distribution was used. This license is free for students and academic organizations but requires a fee for those in industry to obtain. The Nvidia GPU used for these computations was a Tesla C2070, which retails for 2111.85 dollars at SabrePC. This GPU was accessed through the Resonance cluster at the Harvard School of Engineering and Applied Sciences. Therefore, the total budget of the project was 0 dollars.
With an expansion to a 3-d case, the memory transfer times will drastically increase. As a result, GPUs that are capable of transferring larger amounts of data at very low runtime cost should be sought out. The author’s knowledge of GPU market is limited, but more technical users might be able to target an optimal GPU for aeroCuda to operate.
Solving the Immersed Solid-inﬂuenced Navier-Stokes Equations
Step 1: Force Projection 
Let us begin by deﬁning our prescribed points as (xp , yp ) . These points are given some analytical function (u(p (t), vp (t)) by the user that deﬁnes their motion through space.We then deﬁne the Eulerian co-points as (xb , yb ). These points retrieve their motion (ub , vb ) from the velocities that are calculated by the grid. To obtain these velocities, Peskin’s delta function method is used.6 In the delta function stencil, a reference point must chosen for each boundary point. On the
stencil, the reference point is located at the [0, 0] location, with stencil indices in both the x- and y-directions deﬁned over the range [-1,2]. Assuming the spacing is the same along the x-axis and y-axis, the reference gridpoint’s actual grid position is attained by: xr = xb δx yr = yb δy
We then obtain the displacements between (xr , yr ) and (xb , yb ): rx = xb − xr δx ry = yb − yr δy
The delta function stencil is 4 × 4 in size. Each position on it is a function of (rx , yx ): δ(x, y) = φ(rx )φ(ry ) The purpose of the delta function is to blend the value of the force to the surrounding gridpoints or derive the velocity at a certain point from the surrounding grid point velocities. This is integral to the formulation as it allows for information (such as force and velocity) to be transferred between the immersed solid points and the gridpoints. The phi function is taken to be: • φ(r + 1) = • φ(r) =
√ 3−2r+ 1+4r−4r2 8
√ 3−2r+ 1+4r−4r2 8 4r−2 8 1 2
• φ(r − 1) = • φ(r − 2) =
√ 3−2r+ 1+4r−4r2 8
√ 3−2r+ 1+4r−4r2 8
The stencil is given in Table 1. Deﬁne xR = φ(rx ) and yR = φ(ry ). The top row of indices is j and the leftmost column of indices are i. Table 7: Delta Stencil 0 1 6−4ry 6−4r ( 8 − yR)xR ( 8 y − yR)( 4rx −2 + xR) 8 xRyR yR( 4rx −2 + xR) 8 4r −2 4r −2 xR( y + yR) ( 4rx −2 + xR)( y + yR) 8 8 8 4rx −2 xR(0.5 − yR) ( 8 + xR)(0.5 − yR)
Index -1 0 1 2
-1 6−4r ( 8 y ( 6−4rx 8 ( 6−4rx 8 ( 6−4rx 8
− − xR) − xR)yR 4r −2 − xR)( y + yR) 8 − xR)(0.5 − yR)
yR)( 6−4rx 8
2 6−4r ( 8 y − yR)(0.5 − xR) yR(0.5 − xR) 4r −2 ( y + yR)(0.5 − xR) 8 (0.5 − yR)(0.5 − xR)
We then obtain the interpolated value for ub , vb using the stencil coeﬃcients and the u, v of the surrounding gridpoints. Denote the above stencil as the function s(i, j, rx , ry ):
u(yr +i,xr +j) s(i, j, rx , ry )
v(yr +i,xr +j) s(i, j, rx , ry )
Once we have the velocities, we can now calculate the total force. In the harmonic oscillator function, that developed by Saiki and Biringen, the κ was the spring constant and β was the damping coeﬃcient. The motion up , vp is known to us as we prescribed it. Therefore, we have for the k th boundary point: Fx,k = κ(xp,k − xb,k ) − β(ub,k − up,k ) Fy,k = κ(yp,k − yb,k ) − β(vb,k − vp,k )
With the force per point calculated, we now need to project it to the surrounding gridpoints. This operation is done in a manner similar to the interpolation step. Instead of aggregating the values through the summation step, though, the values are added to their respective locations on a grid. Therefore, assume that f x, f y represent the force ﬁeld terms in both dimensions in the modiﬁed Navier-Stokes. We initialize them to 0 at each iteration, and then do the following for the k th boundary point:
f xyr,k +i,xr,k +j = Fx,k s(i, j, rx , ry )
f yyr,k +i,xr,k +j = Fy,k s(i, j, rx , ry )
Now we move to solving the equations via the projection method. 11.1.2 Step 2: Calculating the intermediate velocity ﬁeld 
Previously, we established the force ﬁelds interacting with the main equations as fx , fy . Therefore, we now need to solve those equations. Let us deﬁne the primary ﬁeld quantities: • un =< un , v n >= Primary Velocity Fields at time =n • u∗ =< u∗ , v ∗ >= Intermediate Velocity Fields • un+1 =< un+1 , v n+1 >= Final Velocity Fields • p = Pressure Field • f =< f x, f y >= Forcing Fields 43
The fully-modiﬁed Navier-Stokes equations are given by: • •
· un )un = − p + ν
· un = 0
We begin by decomposing the equation via ﬁnite diﬀerences. The time derivative is represented through forward Euler, and all other derivatives are obtained through a second-order centraldiﬀerence scheme. The ﬁrst and second derivatives, when evaluated via centered diﬀerencing, are given as: ∂qi,j qi,j+1 − qi,j−1 = ∂x 2δx ∂ 2 qi,j qi,j+1 − 2qi,j + qi,j−1 = 2x ∂ (δx)2
Note that the same procedures follows for the y-axis derivatives, with a change in the axis of diﬀerencing and the magnitude of spacing Applying these operators on the modiﬁed Momentum equation, we obtain the following breakdown of the terms: • Time Derivative:
un+1 −un i,j i,j δt
n ∂un i,j+1 −2u+∂ui,j+1 2 δx
• Viscous Derivative: ν(
n ∂un i+1,j −2u+∂ui−1,j ) 2 δy
One term remains: the convective derivative. With a basic centered-diﬀerencing scheme it would be given as: un i,j
n n ∂un ∂un i+1,j − ∂ui−1,j i,j+1 − ∂ui,j−1 n + vi,j δx δy
The above equation also applies for the v-ﬁeld. Mattheus Ueckermann of MIT was consulted on the occurrence of oscillations observed with the code when using the above centered diﬀerence scheme for the code. His explanation of the issue was that in the advection equation, using a centered diﬀerence scheme doesn’t allow for information in the direction of the ﬂow to be transmitted properly. For example, if the ﬂow is negative, we need to see what the value at the ﬂux between cells j and j+1 is, as opposed to cells j-1 and j+1 which do not necessarily average to the proper value. Therefore, the centered diﬀerencing operation was reﬂected to adjust the following scheme: un i,j
n un i,j+1 − ui,j , un < 0 i,j δx
un − un i,j i,j−1 , un > 0 i,j δx
The above equations also apply for the v-ﬁeld. The idea is to look upwind if positive advection and downwind if negative advection. However, because this is a ﬁrst-order approximation, the accuracy is not very good. To improve upon this, the CFD-Wiiki online website was consulted for the QUICK (Quadratic Interpolation for Convective Kinematics) formulation. The idea behind this implementation is that in the centered diﬀerencing operations, instead of relying on two points to ﬁnd the derivative 4 are used. If positive advection, 2 upwind points and 1 downwind point are used; if negative advection, 2 downwind points and 1 upwind point are used. By applying the QUICK algorithm, the following formula results for the convective derivative : Convective Derivative:
n n n (0.375un i,j+1 + 0.375ui,j − 0.875ui,j−1 + 0.125ui,j−2 ) , un > 0 i,j δx n n n − 0.375un (−0.125ui,j+2 + 0.875ui,j+1 − 0.375ui,j i,j−1 ) un , un < 0 i,j i,j δx
The above equations also apply for the v-ﬁeld. By incorporating more points into the analysis, a more accurate and stable solution is obtained. Therefore, the QUICK formulation was used for evaluating the convective term. In the projection method, an intermediate velocity is inserted into the time derivative to isolate the pressure term on the left-hand side of equation. Therefore, two equations are developed: u∗ = un + δt(−Convective Derivative + Viscous Derivative + fx ) i,j i,j un+1 − u∗ i,j i,j δt
The same equations exist for the v-velocity ﬁeld. The ﬁrst equation is purely explicit and can be solved by decomposition through ﬁnite-diﬀerences. The second equation is implicit and will yield two more equations to be solved in the subsequent steps. 11.1.3 Step 3: Calculating the Pressure Field 
To solve the linking equation, un+1 − u∗ =− p δt
the continuity equation is introduced and used to generate a pressure ﬁeld that imposes the divergence-free condition. The gradient operator is applied to the linking equation: ·( By the divergence condition, un+1 − u∗ )=− 2δt
· un+1 = 0. Therefore, the following Poisson equation is obtained:
We now need to solve for the pressure, p. The right hand side may be computed explicitly, call it U . Express p and U by in terms of their Fourier transforms: p(θ, φ) =
n,m 2πx L 2πy L .
U (θ, φ) =
where θ =
and φ =
Then, taking the second derivative
(−n2 − m2 )pn,m eı(nθ+mφ)
and equating Fourier modes with Un,m yields 4π 2 (−n2 − m2 )pn,m = Un,m −Un,m + m2 )
4π 2 (n2
and the pressure p is given by the inverse Fourier transform of the right hand side. Thus, we simply need to compute the 2-d FFT of U , divide by the matrix of corresponding coeﬃcients, and compute the 2-d iFFT (inverse FFT) to get the matrix for p. 11.1.4 Step 4: Calculating the Final Velocity ﬁelds
Now that the pressure ﬁeld p and velocity vield u∗ have been calculated, the ﬁnal velocity, as given by Tryggvason, can be obtained : un+1 = u∗ − δt( p)
To do this, the prior equations are decomposed via ﬁnite-diﬀerencing as in Step 2: un+1 = u∗ − i,j i,j 11.1.5 pi,j+1 − pi,j−1 δt 2δx
n+1 ∗ vi,j = vi,j −
pi+1,j − pi−1,j δt 2δy
Step 5: Interpolation and Velocity 
We follow the same interpolation procedure used in Step One to obtain the velocities.
u(yr +i,xr +j) s(i, j, rx , ry )
v(yr +i,xr +j) s(i, j, rx , ry )
Using the Peskin method of forward Euler, we update the positions of the solid and the prescribed points : • xn+1 = xn + un δt b b b
n+1 n n • yb = yb + vb δt
• xn+1 = xn + un δt p p p
n n n+1 • yp = yp + vp δt
We now have the new locations of the points and can proceed to the next iteration of our solution.
 Russell Mark James Hahn Afred Von Loebbecke, Rajat Mittal. A computational method for analysis of underwater dolphin kick hydrodynamics in human swimming. Sports Biomechanics Journal, 8(1):60–77, March 2009.  CFD-Wiki. Linear schemes - structured grids.  Nvidia Corporation. Nvidia cuda c programming guide, 11 2011.  Ronald Henderson. Details of the drag curve near the onset of vortex shedding.  Andreas Kl¨ckner, Nicolas Pinto, Yunsup Lee, Bryan C. Catanzaro, Paul Ivanov, and Ahmed o Fasih. Pycuda: Gpu run-time code generation for high-performance computing. CoRR, abs/0911.3456, 2009.
 Ming-Chih Lai and Charles S. Peskin. An immersed boundary method with formal secondorder accuracy and reduced numerical viscosity. Journal of Computational Physics, 160(2):705– 719, 2000.  Prof. Charles Peskin. The immersed boundary method in a simple special case.  Prof. Charles Peskin. tar ﬁle of matlab programs.  E.M. Saiki and S. Biringen. An immersed boundary method with formal second-order accuracy and reduced numerical viscosity. Journal of Computational Physics, 123(2):450–465, 1996.  Tao Tang. Moving mesh methods for computational ﬂuid dynamics.  Prof. Timothy Tautges. Mesh generation.  Prof. Gretar Tryggvason. Solving the navier-stokes in primitive variables i, Spring 2010.
Figure 13: Vorticity Contours at Diﬀerent Reynolds Numbers
Figure 14: Velocity Magnitude at Diﬀerent Reynolds Numbers
Figure 15: Forces at Diﬀerent Reynolds Numbers
Figure 16: Immersed Solid Point Dispersion at Diﬀerent Reynolds Numbers
Figure 17: Discretization of the Swimmer
Figure 18: Variable Spring Model of the Swimmer
Figure 19: Point Shift of the Swimmer
Figure 20: Forces on the Swimmer
Figure 21: Flow Around the Swimmer at T= 25s
Figure 22: Flow Transition Dependent on Reynolds Number