cps home works
Performance of a Parallelized Euler Code
My field of research is the CFD (computational Fluid Dynamics Field).
Therefore I picked a computational application where a simple Two
dimensional Euler code has been parallelized as an example of what
parallelization has to offer to CFD.
The Euler Code FLO52
An already existing flow code, FLO52 written by Antony Jameson [1] has
been parallelized by Braszcz [2]in march 1989. FLO52 is widely used in
research and industrial applications throughout the world. It produces
good results for problems in its domain of application(steady inviscid
flow around two-dimensional bodies), giving solutions in which shocks are
captured with no oscillations. It converges rapidly to steady state and
executes rapidly on conventional supercomputers (uniprocessor)
architectures.
In addition to its widespread use, FLO52 was chosen by
Barzcz [2] for study because of the multigrid acceleration used. In
particular, the efficient parallelization of the algorithm's grid
hierarchy is non-trivial.
In summary, the algorithm under consideration
consists of an explicit multistage relaxation method ( 4 step Runge
Kutta-like method)with local time stepping, implicit residual averaging,
and multigrid to accelerate convergence.
Parallelization of FLO52
Conceptually, converting the sequential FLO52 code to a parallel version
is relatively simple. The logical domain is decomposed into equal sized
subdomains which are then assigned to different processors. Each
subdomain has a boundary that contains values updated by other processors.
Assignments were made using a binary reflected Gray code in two
dimensional.
The code structure in the main body of the computation closely resembles
the sequential version, with the exception of some re-ordering of the
computations to decrease the communication overhead.
The algorithm is fully explicit for an implicit residual averaging scheme.
The nested loops in the explicit sections now operate on the local
subdomains instead of the whole domain.Each processor updates points in
its subdomain( applying a local differencing operator). Then each
processor exchanges the updated boundary values with the appropriate
neighbor. If a boundary corresponds to a physical boundary, then some
boundary condition may be evaluated.
The code corresponds to a local 5 or 9 point operator, but the fourth
order dissipation operator requires a larger stencil( utilizing
information from grid points at a distance two). This larger stencil
results in additional communication since information from points on the
subdomain boundary and points adjacent to the boundary must communicate to
the neighbors.
The implicit part of the code involves solving a series of tridiagonal
matrices. Being implicit, most of the operations cannot be performed well
in parallel. Many experiments were made with this element of the code.
Optimization of Parallel FLO52
To quantify the complications that arise as a result of parallelization
the main modules of the algorithm were analized. Several modifications
were then introduced to improve parallel efficiency.
Modules of the code were grouped into the categories described below:
1. Flux Calculations.
2. Dissipation.
3. Local Time Step.
4. Residual Averaging( Solving the tridiagonal systems).
5. Boundary conditions.
6. Enthalpy Damping.
7. Time Advance.
8. I/O.
9. Projection and Interpolation: Multigrid operations for grid
transfers.
The run time(secs),percentage of CPU time , and efficiency for the
total operations running on one node and 16 nodes of the iPSC/2 for the
initial version of the code were:
1 node % | 16 nodes % e1(16)
|
25027 100 | 2767 100 .57
The efficiency was defined relative to a base number of processors as
follows:
e(p)=b*T(b)/(p*T(p))
where p is the number of processors whose efficiency is calculated, b is
the base number of processors, and T(n) is the run time on n processors.
For more details of the results please refer to [2].
The results obtained above were obtained using a three level sawtooth
multigrid method with one Runge Kutta pre-relaxation sweep;30 multigrid
iterations were used on each of the two coarser grids to get an initial
approximation on the next finer grid, and 200 iterations were used on the
finest grid. The algorithm was applied to a transonic flow problem with an
angle of attack of 1.25 degrees and a Mach number of .8. The finest grid
in the multigrid procedure was a 256*64 mesh.
We can see from the above results that the efficiency is only 57% on the
16 node system. Still, a 57% efficiency corresponds to a speedup of better
than 9 on the 16 node machine.
Detailed inspection of the timings( not shown in this presentation)
revealed three areas where the poor machine utilization significantly
degraded the overall run time. These areas correspond to the flux,
dissipation and residual averaging routines. Modifications were thus
applied to the original implementations of these subelements.
For the flux and dissipation routines , the code was rewrote to reduce the
number of messages. Unfortunately , reducing the number of messages in the
dissipation routine required a significant re-ordering of the computations
and additional memory.
Despite the improvements made to the tridiagonal solvers they were still
quite costly. In particular, for large numbers of processors the low
efficiency of the tridiagonal solvers seriously affected the run time of
the whole algorithm.
Based on these conclusions the tridiagonal solvers were replaced by an
explicit residual averaging scheme.On a single processor, the new
algorithm takes approximately the same time to execute a single iteration
as the tridiagonal scheme. However, as more processors are used , the new
algorithm is faster per iteration than the tridiagonal scheme and residual
averaging efficiency rises from 43% to 78% on 16 processors of the iPSC/2
[2].
The final version with all the modifications was run and the following
results were obtained:
1 node % | 16 nodes % e1(16)
|
24393 100 | 1725 100 .88
The main figure to notice is the overall improvement from the original
version. The total run time dropped from 2767 seconds to 1725 seconds on
the 16 node system. This corresponded to a rise in efficiency from 57% to
88%. Unfortunately, this implied that it is important to consider the
computer architecture when designing and implementing even small
sub-sections of an algorithm( This is the conclusion of the people who did
the parallelization of the FLO52 code).
Summary of the above presentation
Parallelization of FLO52 ( based on domain decomposition) is straight
forward . In fact, the main body of the code resembles the serial version
with the exception of the residual averaging. However, a substantial
effort was required to produce working implementations. More time had to
be spent optimizing the code for the machine to obtain efficient
utilization. Nevertheless, the code has gained tremendous speed in
execution which leads to the conclusion that writing CFD codes for
parallel computing will be the main trend in the near future.
Future And current Status of Parallel Computational Fluid Dynamics
Parallel computational Fluid Dynamics is a growing area. It shows great
promises in dealing with FUTURE COMPUTATIONAL CHALLENGES.
Future challenges are fundamental problems in science and engineering,
with broad applications, whose solutions would be enabled by the
application of high performance computing technology, which could become
available in the near future.
There are two challenges in Computational Fluid Dynamics:
1. integrated multi-disciplinary simulations and design optimizations of
aerospace vehicles throughout their mission profiles.
2. multi disciplinary modeling and data analysis of earth and space
science physical phenomena.
In the high performance aircraft(HPA) area, the primary interest is to
develop the capability to predict the performance of the next generation
fighter concepts operating in the most critical portion of their flight
envelope. To achieve performance levels beyond present vehicles, these
generation fighter designs must include higher levels of system
integration than can be obtained with present design tools. Towards this
goal, aerodymanic, propulsion system, controls, structural, and even
acoustic, analysis modules will be integrated into a single software
system.
The challenge posed by the development and application of such a
multi-disciplinary high performance aircraft analysis can only be met
through parallel computing.
References:
[1]A. Jamson, Solution of the Euler Equations for Two Dimensional
Transonics Flow by Multigrid Method, Appl. Math. and Comp.
13:327-355(1983)
[2] E. Barszcz, T.F. Chan, D.C. Jespersen, & R.S. Tuminaro
"Performance of an Euler Code on Hypercubes" NAS Technical
Report RNR-89-005,April 1989