cps home works

Performance of a Parallelized Euler Code

My field of research is the CFD (computational Fluid Dynamics Field). Therefore I picked a computational application where a simple Two dimensional Euler code has been parallelized as an example of what parallelization has to offer to CFD.

The Euler Code FLO52

An already existing flow code, FLO52 written by Antony Jameson [1] has been parallelized by Braszcz [2]in march 1989. FLO52 is widely used in research and industrial applications throughout the world. It produces good results for problems in its domain of application(steady inviscid flow around two-dimensional bodies), giving solutions in which shocks are captured with no oscillations. It converges rapidly to steady state and executes rapidly on conventional supercomputers (uniprocessor) architectures. In addition to its widespread use, FLO52 was chosen by Barzcz [2] for study because of the multigrid acceleration used. In particular, the efficient parallelization of the algorithm's grid hierarchy is non-trivial.

In summary, the algorithm under consideration consists of an explicit multistage relaxation method ( 4 step Runge Kutta-like method)with local time stepping, implicit residual averaging, and multigrid to accelerate convergence.

Parallelization of FLO52

Conceptually, converting the sequential FLO52 code to a parallel version is relatively simple. The logical domain is decomposed into equal sized subdomains which are then assigned to different processors. Each subdomain has a boundary that contains values updated by other processors. Assignments were made using a binary reflected Gray code in two dimensional.

The code structure in the main body of the computation closely resembles the sequential version, with the exception of some re-ordering of the computations to decrease the communication overhead. The algorithm is fully explicit for an implicit residual averaging scheme. The nested loops in the explicit sections now operate on the local subdomains instead of the whole domain.Each processor updates points in its subdomain( applying a local differencing operator). Then each processor exchanges the updated boundary values with the appropriate neighbor. If a boundary corresponds to a physical boundary, then some boundary condition may be evaluated.

The code corresponds to a local 5 or 9 point operator, but the fourth order dissipation operator requires a larger stencil( utilizing information from grid points at a distance two). This larger stencil results in additional communication since information from points on the subdomain boundary and points adjacent to the boundary must communicate to the neighbors.

The implicit part of the code involves solving a series of tridiagonal matrices. Being implicit, most of the operations cannot be performed well in parallel. Many experiments were made with this element of the code.

Optimization of Parallel FLO52

To quantify the complications that arise as a result of parallelization the main modules of the algorithm were analized. Several modifications were then introduced to improve parallel efficiency.

Modules of the code were grouped into the categories described below:

1. Flux Calculations.

2. Dissipation.

3. Local Time Step.

4. Residual Averaging( Solving the tridiagonal systems).

5. Boundary conditions.

6. Enthalpy Damping.

7. Time Advance.

8. I/O.

9. Projection and Interpolation: Multigrid operations for grid transfers.

The run time(secs),percentage of CPU time , and efficiency for the total operations running on one node and 16 nodes of the iPSC/2 for the initial version of the code were:

     1 node          %    |   16 nodes        %      e1(16)
                          |                                
      25027         100   |     2767         100      .57  

The efficiency was defined relative to a base number of processors as follows:

e(p)=b*T(b)/(p*T(p))

where p is the number of processors whose efficiency is calculated, b is the base number of processors, and T(n) is the run time on n processors. For more details of the results please refer to [2].

The results obtained above were obtained using a three level sawtooth multigrid method with one Runge Kutta pre-relaxation sweep;30 multigrid iterations were used on each of the two coarser grids to get an initial approximation on the next finer grid, and 200 iterations were used on the finest grid. The algorithm was applied to a transonic flow problem with an angle of attack of 1.25 degrees and a Mach number of .8. The finest grid in the multigrid procedure was a 256*64 mesh.

We can see from the above results that the efficiency is only 57% on the 16 node system. Still, a 57% efficiency corresponds to a speedup of better than 9 on the 16 node machine.

Detailed inspection of the timings( not shown in this presentation) revealed three areas where the poor machine utilization significantly degraded the overall run time. These areas correspond to the flux, dissipation and residual averaging routines. Modifications were thus applied to the original implementations of these subelements.

For the flux and dissipation routines , the code was rewrote to reduce the number of messages. Unfortunately , reducing the number of messages in the dissipation routine required a significant re-ordering of the computations and additional memory.

Despite the improvements made to the tridiagonal solvers they were still quite costly. In particular, for large numbers of processors the low efficiency of the tridiagonal solvers seriously affected the run time of the whole algorithm.

Based on these conclusions the tridiagonal solvers were replaced by an explicit residual averaging scheme.On a single processor, the new algorithm takes approximately the same time to execute a single iteration as the tridiagonal scheme. However, as more processors are used , the new algorithm is faster per iteration than the tridiagonal scheme and residual averaging efficiency rises from 43% to 78% on 16 processors of the iPSC/2 [2].

The final version with all the modifications was run and the following results were obtained:

     1 node    %   |     16 nodes      %      e1(16) 
                   | 
     24393    100  |        1725      100       .88   

The main figure to notice is the overall improvement from the original version. The total run time dropped from 2767 seconds to 1725 seconds on the 16 node system. This corresponded to a rise in efficiency from 57% to 88%. Unfortunately, this implied that it is important to consider the computer architecture when designing and implementing even small sub-sections of an algorithm( This is the conclusion of the people who did the parallelization of the FLO52 code).

Summary of the above presentation

Parallelization of FLO52 ( based on domain decomposition) is straight forward . In fact, the main body of the code resembles the serial version with the exception of the residual averaging. However, a substantial effort was required to produce working implementations. More time had to be spent optimizing the code for the machine to obtain efficient utilization. Nevertheless, the code has gained tremendous speed in execution which leads to the conclusion that writing CFD codes for parallel computing will be the main trend in the near future.

Future And current Status of Parallel Computational Fluid Dynamics

Parallel computational Fluid Dynamics is a growing area. It shows great promises in dealing with FUTURE COMPUTATIONAL CHALLENGES.

Future challenges are fundamental problems in science and engineering, with broad applications, whose solutions would be enabled by the application of high performance computing technology, which could become available in the near future.

There are two challenges in Computational Fluid Dynamics:

1. integrated multi-disciplinary simulations and design optimizations of aerospace vehicles throughout their mission profiles.

2. multi disciplinary modeling and data analysis of earth and space science physical phenomena.

In the high performance aircraft(HPA) area, the primary interest is to develop the capability to predict the performance of the next generation fighter concepts operating in the most critical portion of their flight envelope. To achieve performance levels beyond present vehicles, these generation fighter designs must include higher levels of system integration than can be obtained with present design tools. Towards this goal, aerodymanic, propulsion system, controls, structural, and even acoustic, analysis modules will be integrated into a single software system.

The challenge posed by the development and application of such a multi-disciplinary high performance aircraft analysis can only be met through parallel computing.

References:

[1]A. Jamson, Solution of the Euler Equations for Two Dimensional Transonics Flow by Multigrid Method, Appl. Math. and Comp. 13:327-355(1983)

[2] E. Barszcz, T.F. Chan, D.C. Jespersen, & R.S. Tuminaro "Performance of an Euler Code on Hypercubes" NAS Technical Report RNR-89-005,April 1989