Dramatically Improve Compute-Intense Applications in the Supercomputing Cloud

June 2011

WhIte pAper

June 2011

WhIte pAper

Dramatically Improve Compute-Intense Applications in the Supercomputing Cloud
A Practical Guide to High-Performance Computing with COMSOL Multiphysics and Microsoft HPC
By KonrAD Juethner

J2Methods

1

Dramatically Improve Compute-Intense Applications in the Supercomputing Cloud

June 2011

WhIte pAper

the Supercomputing Buzz
y now, we have all heard the buzz about parallel computing, clusters, and the cloud. But what does it all mean? In this white paper, I will share with you my experience with setting up and running cluster simulations with Windows HPC Server. While the benefits of large and collaborative computer networks have been obvious for many years, clusters have been available to a select few in wellfunded environments such as government research and big industry. The technology has become so mature and flexible that it is possible to configure very small experimental clusters ad hoc and excel at much grander scales, reaching previously unseen performance levels. COMSOL’s collaboration with Microsoft has led to a tight integration of COMSOL Multiphysics® with Windows® HPC Server. This software combination reduces the upfront investment in equipment and technical expertise so dramatically that high-performance computing is now ready for the main stream. To enable this exploration and the production of real data, COMSOL and Microsoft provided direct support, the necessary trial software, and a brief intro course to Windows HPC Server. Within a few days, they said, I would be up and running. This I had to see and took the bait.

COntentS
the Supercomputing Buzz .......................................... 2 Microsoft Clusters Are here ........................................ 2 the Status Quo ....................................................................... 3 three Big Questions .......................................................... 3 Workstation to Cluster transition.......................... 3 home Computer network ............................................ 4 CoWs as a Windows hpC Feature ........................... 5 easy CoMSoL Installation ............................................ 5 taking a test Drive .............................................................. 6 Budget-Cluster performance..................................... 7 Microsoft Field Support Cluster .............................. 8 embarrassingly parallel Computations ............ 9 Field Cluster performance............................................ 9 taming any problem Size...........................................11 Going Large Scale with Azure ................................12 the Verdict ..............................................................................12

B

Microsoft Clusters Are here
During the last year, Microsoft has increased its highperformance computing staff from 50 to 500 and is bringing software and compute clusters to customers in many markets. My first big surprise was Microsoft’s claim to be offering the most economical HPC solution. How could it be cheaper than a free and open source operating system such as Linux? The answer, of course, is labor! Although the flexibility of Linux is attractive, its administration can be a handful. This is reflected in commercial production

© 2011, COMSOL. All rights reserved. This White Paper is published by COMSOL, Inc. and its associated companies. COMSOL and COMSOL Multiphysics are registered trademarks of COMSOL AB. Capture the Concept, COMSOL Desktop, and LiveLink are trademarks of COMSOL AB. Other product or brand names are trademarks or registered trademarks of their respective holders.

2

Dramatically Improve Compute-Intense Applications in the Supercomputing Cloud

June 2011

WhIte pAper

environments where Windows administrative support is much more readily available and arguably cheaper. My perception of the single workstation world is similar. Where die-hard numerical experts typically gravitate to Linux, the broad mass of engineering practitioners does favor the Windows platform due to their generally greater familiarity with its GUI-driven operation. With COMSOL Multiphysics running on all relevant operating systems available today, the choice appears to be simply one of user preference. Personally, I tend to think Linux in single workstation settings. However, when reviewing Windows HPC Server, my mind experienced a paradigm shift in realizing that it is exactly this GUI familiarity with Windows OS that makes this product highly relevant in the cluster world and establishes a distinct competitive advantage. I followed this notion and decided to explore whether or not the combination of COMSOL Multiphysics and Windows HPC Server would make practical sense.

2. Do parametric sweeps scale favorably, providing a distinct advantage over single workstations or servers? 3. Can contiguous memory blocks, originally required to solve very large problems, be segmented across cluster nodes?

Workstation to Cluster transition
To address Big Question 1, how can we define “easy” and “low cost”? I thought a good way was to enter an engineer’s home and see if it was possible to build a Microsoft HPC Cluster by only using what we can find. So, my home was about to be declared the test lab for this experiment. Right off the bat, I ran into a relatively tall order for home networks and that was Microsoft Active Directory. It is a hard requirement for Windows HPC Server and you really need to set aside one computer to become the domain controller. For those of you who are unfamiliar with network domains and Active Directory, let me just say that it elevates what you are used to seeing in Control Panel -> User Accounts on a single computer to a domain level user/ computer administration tool that spans across a computer network. A second, however less significant hurdle surfaced in the installation requirements of Windows HPC Server 2008 R23. Head nodes, broker nodes, and compute nodes only run on 64-bit architectures. While this could be a show stopper for folks who are still holding on to their 32-bit tractors, I convinced myself by visiting electronic stores in the Boston area that new systems for residential computing are practically 64-bit. What eases potential migration concerns even further is the recently introduced workstation node feature3 which empowers us to integrate x86 based processors (i.e. 32-bit) via Windows 7 Professional or Enterprise to form clusters of workstations (COWs).

the Status Quo
To increase computational performance, we have typically bought larger and larger workstations, increased utilization by centralizing servers, justified internally managed clusters in industry and government, and have negotiated or purchased access to them. The last resort has been to simply declare a computational problem too large or too time-consuming to solve. So, where would this product combination take us?

three Big Questions
To convince myself of the value of the COMSOL Multiphysics/Windows HPC Server combination, three questions had to be answered: 1. Can an engineer with moderate IT skills configure a low-budget cluster easily, thus supporting the ease of use and low-cost claims?

3

Dramatically Improve Compute-Intense Applications in the Supercomputing Cloud

June 2011

WhIte pAper

home Computer network
An inventory of my home surfaced two relatively late model 64-bit machines. To keep things separate from my existing network, I reactivated an old router which had been collecting dust and established a separate cluster computing subnet. As I later learned, this turned out to be a good move since the domain controller (DC) was much happier with Figure 1: Microsoft Network Topology 5 can be deployed immediately in any simple netits own DNS server. The choice for work and is great for testing. That’s why I used it for this low-budget cluster. In a producthe DC became clear during a coffee tion environment, you would want to isolate cluster traffic from your main (i.e. enterprise) network and optimize it via Network Topology 3. break. headnode (AMD Athlon™ II X2 250u Processor 1.60 GHz, 4.00 configuration of headnode and provides access to the GB Installed Memory), our all-in-one kitchen computer cluster manager from where you administer the cluster was only serving up electronic recipes and latest photoand configure a variety of cluster node types such as graphs, and could be reconfigured temporarily. compute, broker, and workstation. I obtained domestic approval for such a drastic For convenient administration, Windows HPC Serchange by providing the ironclad guarantee that ver enables you to deploy cluster nodes directly from headnode could be returned to its current operating the head node across the network from “bare metal” state at a moment’s notice. This meant swapping out its hard drive for this cluster project. The next steps were cookie cutter: Download and install Microsoft’s standard 180 day trial for Windows HPC Server 2008 R2 and assign to it DC as well as DNS server roles. In a production environment, you would typically separate the DC role from DNS server and computational head node. However, I was all about getting by with the least possible amount of hardware resources and therefore ended up with Windows HPC Server detecting Network Topology 5 as shown in Figure 1. This is a Microsoft definition and describes that all nodes reside on the enterprise network. Since this topology routes all cluster traffic through the general enterprise network, it is not recommended for performance. However, it is lowest cost and therefore useful for our Figure 2: The low-budget cluster is shown in the home network low-budget testing purposes. context. Although a router is used to isolate all nodes from the Finally, the installation of HPC Pack 2008 R2 (which is home network, they still share the same subnet and therefore represent Network Topology 5. part of Microsoft’s 180 day trial offer) completes the

4

Dramatically Improve Compute-Intense Applications in the Supercomputing Cloud

June 2011

WhIte pAper

via the specification of *.iso operating system images. However, I did not have any computers to spare that could assume the roles of dedicated compute nodes. I was looking for another way and found it in one of the latest features of Windows HPC Server 2008 R2 — Enterprise Edition.

CoWs as a Windows hpC Feature
According to Windows System Requirements 3, 64-bit and 32-bit workstations Figure 4: The HPC Cluster Manager provides you with true mission control. In this status view, both HEADNODE and WORKERNODE are reported as online and ready running Windows 7 Professional can asfor computational tasks. sume workstation node status and join as well as collaborate with the pool of compute nodes. This was great news and the perfect assigned it to the cluster domain and carried out the option for my main Windows 7 Professional workhorse brief client installation of HPC Pack 2008 R2. The named WoRKeRnode (Intel® Core™ i7 CPU Q820 @ cluster manager could now be accessed from any1.73 GHz, 8.00 GB Installed Memory). Following this where on the network and WoRKeRnode utilized as concept, I moved WoRKeRnode from the home a computational resource via the workstation node network to the cluster subnet as shown in Figure 2, deployment method as shown in Figure 3. As a result, WoRKeRnode was now listed in the cluster manager in Figure 4 and ready for work. The integration was seamless and, to my amazement, even supported cluster computations while I was logged into WoRKeRnode as a user. It quickly got even better.

easy CoMSoL Installation
Out of the box, COMSOL Multiphysics can connect to a Windows HPC Cluster via the addition of a COMSOL cluster computing node in the model builder tree of any computational model as illustrated in Figure 5. Since COMSOL recommends one single physical

Figure 3: I was wowed by the variety and ease of Windows HPC node deployment methods. For instance, you can make any networked workstation deployable within a minute by installing the tiny Windows HPC Pack. When adding a node to your cluster, you are asked to make a simple choice as shown.

5

Dramatically Improve Compute-Intense Applications in the Supercomputing Cloud

June 2011

WhIte pAper

Figure 5: In this COMSOL Multiphysics GUI view, the Vented Loudspeaker Enclosure model from the built-in Model Library is shown after the Cluster Computing node was added. This node establishes the connection to the desired compute cluster and can be added to any model. It is a nifty design that makes switching gears between workstation and cluster computing very convenient.

installation on the head node and sharing it via UNC path, i.e. \\HEADNODE\comsol42, the installation was as straight-forward as on a standalone workstation. While I chose a minimalistic setup and neglected all performance-enhancing recommendations such as role separation and parallel subnets to handle communication and application data, I was taken aback by the flexibility of this software system and its ability for reconfiguration on the fly. In this context, it should be noted that the HPCS2008 Pack comes free with HPC Server and enables cluster access from any domain workstation. This option would be typical in fast LAN environments and preferable if end users require interaction with COMSOL Multiphysics high end graphics capabilities for modeling or report generation purposes.

taking a test Drive
To test this cluster, I loaded the Vented Loudspeaker Enclosure shown in Figure 5. Similar to most COMSOL Multiphysics models, it is based on a simple

user choice of the appropriate physics in an intuitive graphical user interface. In this case, the built-in acoustic-structure interaction formulation describes how an acoustic wave communicates physically with a structure which is what a loudspeaker membrane and its surrounding air pressure field do. COMSOL Multiphysics evaluates such formulation on each of the tetrahedral subdivisions shown in the computational mesh of Figure 6 and finds a piecewise continuous solution that spans the entire domain via the finite element method. Within each element the solution is continuous and characterized by polynomial coefficients which represent the unknown variables or degrees of freedom (DOF). The DOF grow with increasing mesh density — a fact we will later use to increase problem size. Among the infinitely many ways to illustrate the results of this computation, one could present a slice plot of the sound pressure field illustrated in Figure 7 and the mechanical displacement field of the speaker membrane in Figure 8.

6

Dramatically Improve Compute-Intense Applications in the Supercomputing Cloud

June 2011

WhIte pAper

Figure 6: Computational mesh of the Vented Loudspeaker Enclosure

Figure 7: Illustration of the qualitative sound pressure level field sliced along the geometry’s plane of symmetry

Figure 8: Illustration of the qualitative displacement field of the moving speaker components.

Budget-Cluster performance
When used one at a time, headnode and WoRK­ eRnode carried out baseline sweeps of 32 frequencies at 135,540 DOF in 2,711 and 1,420 seconds, respectively. When headnode was instructed to utilize WoRKeRnode as a workstation node in Figure 9, the same computation took 1,729 seconds.

While this was faster than what headnode could accomplish by itself, the low-budget cluster was slower than WoRKeRnode. This probably amounts to cluster network traffic that WoRKeRnode does not encounter by itself, my disregard for performance recommendations, and my choice for the least desirable Topology 5. After all, this low-budget cluster was not intended to perform but to verify ease of configuration and use in the context of humble hardware resources. Looking back, the configuration of this low-budget cluster took less than a day. And, now that I know what to do, I could probably do it again within one morning while comfortably sipping a cup of coffee. To reach greater performance would mean investment in additional computing and networking hardware. And, this is what we did in the old days.

7

Dramatically Improve Compute-Intense Applications in the Supercomputing Cloud

June 2011

WhIte pAper

Microsoft Field Support Cluster
Today, we are seeing an onslaught of hosted cluster solutions which are deployed to support high-performance computing applications such as data warehousing, transaction processing, and engineering simulation. Many big Internet and software companies have begun to offer such services. To take this investigation to the next level and answer Big Question 2 about favorable scaling of embarrassingly parallel computations, no other than Microsoft came to the rescue by configuring the bigger and better field support cluster shown in Figure 10.

Figure 9: View of the HPC Cluster Manager during low-budget cluster testing; note that the request for one node fires up one node in the Heat Map.

Figure 10: Network flow chart of Microsoft EEC Field Support Cluster #2; note how compute nodes are isolated on separate private and application networks to represent Network Topology 3 (see also Figure 11)

8

Dramatically Improve Compute-Intense Applications in the Supercomputing Cloud

June 2011

WhIte pAper

embarrassingly parallel Computations

The most trivial need for parallel computing arises when the goal is to carry out many similar computations. Think of the thousands of customers of an investment bank whose portfolio performance needs to be predicted regularly based on an ever-changing investment tactics. Since investment decisions are time-sensitive, it is easy to see that the edge goes to those brokers who can evaluate client portfolios the fastest. Instead of figuring out one client at a time, the idea is to compute all portfolio predictions at once, i.e. in parallel. You can take this further and even fan out the computations for each individual stock. The happy medium Figure 11: Microsoft Network Topology 3 will be anywhere between a feasible hardware price tag and ROI. I accessed this Microsoft cluster by sequentially With Microsoft Excel at the forefront of computations VPN’ing into a Microsoft gateway machine and the head in many industries, it comes as no surprise that Winnode \\node000. While we expected this Microsoft’s dows HPC Server support for parallel Microsoft Excel VPN service to be fast and reliable, I will admit that I have was largely driven by the financial industry. never seen anything faster. Such VPN connections are The analogous engineering problem is called favorably light on WAN traffic and remarkable in their par ametric and exemplified in the vented loudefficiency. However, there is a trade off in graphics perspeaker enclosure of the previous section. The formance which is inferior to running the cluster from a parameter investigated in this case is the excitalocal workstation as discussed earlier. tion frequency of the speaker which affects both the membrane deformation and the surrounding air pressure. Unlike computing one frequency at a time Field Cluster performance as done earlier, we will utilize Windows HPC Server to solve as many parameters as possible at any given The configuration of the cluster computing node to moment in time. communicate with the Field Cluster is analogous to It should be noted that the following computations that of the low-budget cluster in Figure 9. However, were carried out ad hoc and in the absence of a highly now we have the ability to request 16 nodes. controlled benchmarking environment. This choice Shortly after invoking the Solve command in was quite deliberate to reflect realistic working condiCOMSOL, the heat map in the cluster manager in tions of the typical engineer who has to produce reFigure 12 lights up and shows all 16 compute nodes at sults regardless of circumstance. But then, I was being nearly full CPU capacity.

spoiled with the Microsoft cluster which was configured using the optimal cluster network Topology 3 as detected by the cluster manager. As Figure 11 shows, traffic from the enterprise network is routed through the head node and cluster traffic confined to its dedicated private and application networks.

9

Dramatically Improve Compute-Intense Applications in the Supercomputing Cloud

June 2011

WhIte pAper

WoRKeRnode, the same figure indicates that this Microsoft test cluster achieved a speedup of a factor of 6 down to 200 seconds when using all 16 nodes Of course, there are many dependencies such as number of parameters and problems size. To get a feeling for their significance, I divided the minimum and maximum element size requirements by 2 which increased the DOF from 135,540 to 727,788 and unleashed economies of scale of our cluster solution. With all 16 nodes engaged, the maximum speedup jumped from less than 6x for 135,540 DOF to more than 11x for 727,788 DOF as presented in Figures 13 and 14, respectively. Given that engineering computations are routinely measured in days or weeks, an imFigure 12: View of the HPC Cluster Manager during Field Cluster testing; proved turnaround of a factor of 11 is commernote that the request for 16 nodes fires up 16 nodes in the Heat Map. cially viable. When running this last set, I noticed that the consumed amount of memory ranged around 15 GB With this configuration, I was now able to measure which made me curious whether or not this larger probcomputation time with respect to number of compute lem would still run on the low-budget cluster. It did not nodes assigned. which I interpreted as an out-of-memory issue and a At zero compute nodes, head node node000 did perfect entry point for Big Question 3 about taming proball the work and finished in about 1,000 seconds or lem size via memory segregation or decomposition. 18 minutes as shown in Figure 13. Already faster than

Figure 13: Field Cluster performance for embarrassingly parallel computations using a finite element model with 135,540 DOF. At 16 compute nodes, the speedup nearly reaches 6x.

Figure 14: Field Cluster performance for embarrassingly parallel computations using a finite element model with 727,788 DOF. At 16 compute nodes, the speedup exceeds 11x.

10

Dramatically Improve Compute-Intense Applications in the Supercomputing Cloud

June 2011

WhIte pAper

taming any problem Size
As the size of the numerical problem increases, we can instruct COMSOL Multiphysics to distribute the assembled matrices across all available nodes via Windows HPC Server. With this technique, any out-of-reach problem can be brought back into scope by simply adding compute nodes to the cluster. Figure 16: After adding NOTEBOOKNODE to the cluster, its contribution can be monitored via the Since the low-budget cluster Heat Map. It is intriguing to think of linking up the combined resources of departments or even did not have sufficient memcompanies to address ad-hoc and temporary supercomputing needs in this way. ory, I decided to upgrade my n o t e b o o k f r o m Wi n d o w 7 in the model tree. Although it would be much more efHome to Windows 7 Professional, install HPC Pack ficient to only store selected points or derived values 2008 R2, and add NOTEBOOKNODE (Intel® ATOM™ such as average, maximum, or integral, it was my deCPU @ 1.60 GHz, 3.00 GB Installed Memory) to inliberate intent to produce an out-of-memory situation crease computational capacity to the low-budget and solve it by adding cluster hardware. cluster as depicted in Figures 15 and 16. And, this did Resorting to brute force in this fashion has never the trick. been easer and can make a lot of business sense. This Without a doubt, a large amount of memory is conbecomes particularly attractive when tapping your sumed in this particular model by storing all solutions organization’s creative genius is expensive or model simplification just not possible. What a trick to have up your sleeve to simply add a few nodes to increment collective cluster computing power. I stood no chance to beat the Microsoft field cluster in performance, but was able to solve a problem that would have been out of scope without the integrating Microsoft HPC platform. Note how different this is from out-of-core strategies that utilize the hard drive. Instead of off-loading excess demand to slow hard drives, all computations are firmly kept in the much faster memory context, making Figure 15: Adding further memory capacity to the low-budget cluster which the scaling of hardware a real and not only allows you to tackle larger problems via COMSOL Multiphysics memory theoretical solution proposition. segregation algorithms.

11

Dramatically Improve Compute-Intense Applications in the Supercomputing Cloud

June 2011

WhIte pAper

Going Large Scale with Azure
Since there are only so many computers to salvage in any environment, we would eventually incur the capital expense of additional hardware. This sounds expensive and highly committed. However, with the increasing number of hosted cluster solutions coming online daily, the idea of adding temporary resources becomes increasingly tangible. One of the most recent and exciting propositions in this context comes again from Microsoft. Its Windows Azure platform interfaces seamlessly with Windows HPC Server as pointed out by David Chappell in September of 20101. COMSOL Multiphysics support for Azure is expected to be available in a near future. What this means is that, similarly to adding workstation nodes like we did in the low-budget cluster section, you can augment your existing On-Premises Compute Cluster with any number of Azure compute nodes as shown in Figure 17. It is important that you could do so temporarily and at the flip of a switch in situations where on-premises capacity is exceeded. Furthermore, there is nothing to stop you from offloading all compute nodes to the Windows Azure Data Center as shown in Figure 18 and only retaining the head node on-premises. The idea is to rent clusters of any size, at any time, and for any time frame. Revolutionary from a busi-

Figure 18: Replacement of On-Premises Compute Cluster with Windows Azure Data Center 1

ness perspective is, according to Chappell1, that “this allows tilting HPC costs away from capital expense and toward operating expense, something that’s attractive to many organizations.”

the Verdict
Playing guinea pig as an engineer with elementary IT skills, I was able to understand the available network topologies and configure the low-budget cluster. In fact, the experience was quite enjoyable. A welcome surprise was Windows HPC Server’s configuration flexibility and viability in very small networks like my own. Workstation node integration on the fly enables standard business computers as compute nodes and the temporary metamorphosis of entire business networks into COWs that play supercomputer on nights and weekends. While the concept is neither very complicated nor new, Windows HPC Server is the first software system that has pulled this vision together feasibly for the main stream. Out of this world is the ability to manage these changes centrally via one configuration manager without any additional hardware and physical configuration requirements. Exploratory speedup factors of 6x and 11x in the context of embarrassingly parallel COMSOL Multi-

Figure 17: Augmentation of On-Premises Compute Cluster with Windows Azure Data Center 1

12

Dramatically Improve Compute-Intense Applications in the Supercomputing Cloud

June 2011

WhIte pAper

physics computations provide a powerful business justification for Windows HPC Server. The ability to divide and conquer by distributing the memory required of any problem size allows us to draw conclusions to problems we can’t even fathom today. In addition to integrating business networks with traditional HPC Clusters, Windows Azure expands the flexible configuration concept to the domain of incredibly fast growing cloud computing services. The blend of all three provides us with a powerful tactical toolset that enables you to conquer today’s largest and toughest technical challenges. If you have been thinking about a COMSOL Cluster solution, there is no time to waste. COMSOL Inc. has introduced an extremely generous cluster licensing scheme which consumes only one floating network license (FNL) key per cluster. In other words, if you intend to run ten thousand nodes in parallel, you will only need one FNL key.

“High-performance computing” has entered “a new era. The enormous scale and low cost of cloud computing resources is sure to change how and where HPC applications are run. Ignoring this change isn’t an option.” 1 n

references
1

Windows HPC Server and Windows Azure — HighPerformance Computing in the Cloud, David Chappell, September 2010, Sponsored by Microsoft Corporation http://www.microsoft.com/windowsazure/Whitepapers/ HPCServerAndAzure/default.aspx Windows HPC Server 2008 R2 Suite — Technical Resources http://www.microsoft.com/hpc/en/us/technical-resources/ overview.aspx Windows HPC Server 2008 R2 Suite — System Requirements http://www.microsoft.com/hpc/en/us/product/systemrequirements.aspx COMSOL Multiphysics 4.2 Product Documentation

2

3

4

KonrAD Juethner J2Methods
www.j2methods.com Konrad Juethner is a Consultant and Owner of J2Methods. As a physicist and mechanical engineer by training, he has accumulated an extensive background in simulation-driven engineering. J2Methods employs software integration solutions that deliver significant efficiency and quality gains for large engineering organizations.

J2Methods 15 Knowlton Drive Acton, MA 01720 781-354-2764 www.j2methods.com

13

www.comsol.com
COMSOL, Inc. 1 New England Executive Park Suite 350 Burlington, MA 01803 U. S. A. Tel: +1-781-273-3322 Fax: +1-781-273-6603

Sign up to vote on this title
UsefulNot useful