V. YEAR 3 ACCOMPLISHMENTS

As has been noted above, the PET effort at the CEWES MSRC operates through
providing core support to CEWES MSRC users, performing specific focused efforts
designed to introduce new tools and computational technology into the CEWES
MSRC, conducting training courses and workshops for CEWES MSRC users, and
operating an HBCU/MI enhancement program.

The major accomplishments of the CEWES MSRC PET effort in enhancing the
programming environment at the CEWES MSRC are described in this section.
The presentation here is according to CTAs and technical infrastructure support 
areas, but there is much overlap in effort. 

Tools introduced into the CEWES MSRC in the course of Year 3 of the PET
effort are listed in Table 4, and are described in Section VI. Specific CEWES 
MSRC codes impacted by the PET effort during Year 3 are listed in Table 5, and 
items of technology transfer into the CEWES MSRC are listed in Table 6. More 
detail on the Year 3 effort is given in the specific Focused Efforts described 
in the Appendix and, especially, in the CEWES MSRC PET Technical Reports and
other publications from Year 3 that are abstracted in the Section X. Training 
conducted during Year 3 is described in Section VII, and outreach to CEWES MSRC 
users is covered in Section VIII. The accomplishments in the HBCU/MI component 
of the CEWES MSRC PET effort are discussed in Section IX.

------------------------------------------------------------------------------


CFD: Computational Fluid Dynamics CTA 
     (ERC-Mississippi State)

Year 3 efforts in support of the CFD CTA at CEWES MSRC were divided between core
support and focused effort. Primary core support contributions were made by the 
On-Site Lead (Bova) with overall coordination assistance from the at-university 
team. One Focused Effort - Scalable and Parallel Integration of Hydraulic, Wave,
and Sediment Transport Models - was funded during Year 3.

Through core support, the team was able to provide training and user outreach
via participation in Parallel Programming Workshops for Fortran Programmers
(Bring Your Own Code: BYOC), MPI tutorials, the JSU Summer Institute, and direct
CEWES MSRC user contacts. User support was provided to two CHSSI development 
teams (COBALT and OVERFLOW), as well as HPC service for individual CEWES MSRC 
users.
 
In collaboration with the NRC Computational Migration Group at CEWES MSRC, 
support was provided to J. C. T. Wang of the Aerospace Corporation by analyzing 
the message flow in a section of his PVM code, resulting in a successful code 
port to the IBM SP system. David Medina of AF Phillips Lab implemented a 
graph-based reordering strategy within the MAGI solver, resulting from PET 
consultation. The objective is to improve cache performance and interprocessor 
data locality. Results show about a 30% execution time reduction for two, four 
and sixteen processors compared to execution without reordering. Continued 
collaboration with Fernando Grinstein of NRL resulted in development of a 
parallel version of NSTURB3D capable of efficient execution on all CEWES MSRC 
computing platforms.

A dual-level parallel algorithm using both MPI and OpenMP was designed and
implemented into the CGWAVE solver supporting Zeki Demirbilek of CEWES
CHL. This resulted in dramatic reduction in turnaround time. Turnaround time
for the demonstration case was reduced from 2.1 days to 12 minutes using 256
SGI O2000 processors. This project, which also served as a testbed for the
MPI_Connect Tool implemented by the Tennessee SPPT team in PET, was initiated
through user interaction at one of the BYOC workshops and was selected as
the "Most Effective Engineering Methodology" in the HPC Challenge at SC'98.

In Year 3, the Focused Effort - Scalable and Parallel Integration of Hydraulic, 
Wave and Sediment Transport - was initiated in collaboration with the PET CWO 
team at Ohio State. The purpose of this focused effort was to enhance national 
defense and security by developing a scalable, parallel implementation of a 
coupled wave (WAM), hydraulic (CH3D) and sediment transport (SED/COSED) 
simulation code. Once developed, this code will enable more accurate and 
efficient evaluation of DoD applications such as naval harbor access, wind/wave 
hazard forecasting, coastal forecasts for amphibious operations, in addition to
various civil works applications. Stronger coupling of these technologies
through this project will result in a simulation capability including more
realistic representation of the interaction between surface and bottom shear
stresses that will improve representation of physical phenomena.
Parallelization and HPC support will substantially reduce computational
requirements for long period simulations and allow more timely simulation of
configurations of military interest.

Task responsibilities within this focused effort were distributed between the 
CFD and CWO teams. In Year 3, the primary contribution of the CFD team was
development of a parallel implementation of the CH3D solver, including the
non-cohesive sediment model (SED), and computer science support on model
integration. 

CEWES MSRC user constraints require results from the parallel solver to 
replicate results from the unmodified sequential solver within machine accuracy,
and that input and output files be compatible with sequential versions.
Consequently, the implemented parallelization strategy utilized a
three-stage approach: pre-processing and partitioning, parallel computation,
and post-processing. During the pre-processing and partitioning phase, the
grid is partitioned using a simple load-balancing technique to obtain
roughly equal number of computational cells per processor, with separate
grid files generated for each processor along with the other necessary input
files. All the processors performed the computation in parallel and
exchanged data at the processor boundaries in the parallel computation
phase. At the end of a simulation, each processor writes the corresponding
part of the solution to disk. These individual files are later merged during
the post-processing stage to generate output files suitable for visualization. 

To verify the accuracy of the parallelization effort, simulations were
performed on two different test cases: Lake Michigan and Old River.
Verification of parallel CH3D/SED was done in close collaboration with the
CWO team and CEWES personnel. The Lake Michigan case was tested on the SGI
Origin 2000 and the Cray T3E for different numbers of processors, and the
performance was analyzed. Significant reduction in the execution times was
observed up to 32 processors. The Old River case was tested for various numbers 
of processors, and a more long-term (in the time dimension) simulation was
performed. Details of the code scalability performance and verification are
included in the description of this focused effort in the Appendix of this 
report. 

The impact of this work on CEWES MSRC is to significantly reduce the time
required for high-resolution sediment transport simulations by making use of
the scalable nature of the high performance computing facilities. This work
was also essential for the development of a scalable coupled model of
hydraulic, sediment, and wind-wave processes, which greatly affects military
operations in the marine environment.


CSM: Computational Structural Mechanics CTA 
     (TICAM-Texas & ERC-Mississippi State)

The principal goal of the TICAM CSM support in the CEWES MSRC PET effort is to 
introduce, implement and test adaptive grid techniques for analysis of CSM 
problems for DoD simulation of "violent events".  Representative DoD codes 
include CTH and EPIC.

This work requires consideration of the following:  

  (1) appropriate a posteriori error estimates and corresponding indicators for 
      these problem classes;
 
  (2) strategies for h, (p and hp) adaption of the grid; 

  (3) complications in implementing the adaptive schemes related to the form of 
      "legacy code";
 
  (4) development and incorporation of appropriate data structures for adaptive
      refinement and coarsening; 

  (5) efficiency in implementation; and 

  (6) parallel, scalable HPC needs. 

CSM core support and focused effort activities on the project have been targeted
to these goals, and we have made several significant progress accomplishments:

(1).  We have developed a block adaptive strategy in collaboration with Sandia
and have carried out some preliminary simulation studies.  This work has been 
implemented as an extension of the CTH code.  Results of a parallel adaptive 
simulation carried out on the ASCI Red supercomputer system at Sandia are shown 
in Figure 1.  

  Figure 1.  Sample calculation of a 3D impact of a copper ball onto a steel
  plate, showing (a) active blocks and (b) materials.  There are 61 million 
  active cells in this calculation, compared to 200 million for a uniform grid 
  without adaptive refinement.

Here the test problem corresponds to hypervelocity impact of a spherical copper 
projectile with a target.  The number of cells in the adaptive simulation are 
approximately one-third that of a comparable uniform grid simulation that yields
the desired fine mesh resolution.  As part of the work we are also developing 
new error and feature indicators to guide refinement and assess computational 
reliability.

(2).  We have developed a local simplex refinement strategy and implemented it
in the test code EPIC (Elasto Plastic Impact Computation).  As the simulation 
proceeds, elements are locally refined based on element feature and error 
indicators.   Preliminary numerical tests have been carried out for the "Taylor 
Anvil" problem to assess the utility of the approach.  Results are presented in 
Figure 2, showing the mesh after several local adaptive refinement steps. 
Related work on space-filling curves for mesh partitioning and on grid quality 
is being carried out.

  Figure 2: Adapted mesh after several refinement steps

Both of the above studies are singular accomplishments since they are, we 
believe, the first adaptive refinement calculations for this class of 
applications codes.  They therefore constitute a strong statement concerning 
the main program goal - to introduce and tech transfer advanced adaptive grid
strategies from the university and research labs to the DoD application groups.

This work has been closely coordinated with our core support activities.  We 
gave a related short course on Finite Element Analysis of Nonlinear Problems at 
the DoD HPCMP Users Group Meeting at Rice University in June 1998,  interacted 
closely with the CSM On-Site Lead - Rick Weed, lectured on adaptive techniques 
and multi-scale modeling at the CSM workshop held at CEWES in November 1998, 
discussed results at the CEWES MSRC PET Annual Review in February 1999, and 
organized a workshop on Adaptive Techniques, Error Indicators and Applications 
held at Texas in March 1999.  In addition, we have worked closely with Sandia 
researchers and made three working trips to Sandia to foster the collaboration 
and expedite the work.

This core support and the focused efforts constitute a cohesive approach that 
includes the Texas CSM team, the Army Institute for Advanced Technology (at
Texas) activities, Sandia applications and software development, and CEWES MSRC 
applications using CTH and EPIC.  The work has a strong impact on DoD 
applications and CEWES MSRC since it promises to change the entire approach for 
solving these problems in a timely and more reliable way.


CWO: Climate/Weather/Ocean Modeling CTA
     (Ohio State)

The rationale for the CWO team in the CEWES MSRC PET effort, and the evolution 
of its effort over the last three years continues to be based on the following 
three points:
	
First, and most importantly, DoD has an extremely urgent need for timely 
(forecasted) information about hazardous and mitigating conditions in its 
theaters of interest. The primary classes of problems encountered include: the 
prediction of extreme wave conditions for fleet operations both offshore and in 
nearshore regions, the maintenance of port and harbor navigability and 
associated inlet conditions; the prediction of ocean sediment storms in the 
North Atlantic for finding enemy submarines or providing camouflage for our 
submarines, improved prediction of acoustic waveguides, and integrated systems 
for predicting subsurface mine burial mechanics and weapons retrieval. One 
land-based extension of these activities is the prediction of sandstorms for 
desert troop operations.

Second, these problem areas have several common threads. The hazardous
conditions are extreme in magnitude and infrequent in occurrence, are forced
by associated atmospheric events, result from three-dimensional turbulent fluid
dynamics with a very broadband spectrum of nonlinearly interacting processes,
give rise to sediment resuspension and transport, and most often occur in
regions of shallow water or irregular coastline geometry.

Third, the present modeling strategies which underly the forecasting procedures
for these problems consist of isolated models of the various processes which
are not well linked and, therefore, not able to resolve the most extreme and,
consequently, the most important information that DoD needs from the
forecasting programs.

Therefore, the basis of the CWO program in PET at CEWES MSRC is to implement the
new model structures and couplings necessary to allow the prediction of these 
extreme rapid but infrequent events, especially in the nearshore, embayment, and
continental shelf environments. The increased scope and size of these new
Integrated Coastal Modeling Systems (ICOMS) cannot be achieved without the use
of highly parallelized software executed on highly parallel machines.
	
During the first two years, we have been able to proceed on two fronts: the 
first being the parallelization and improvement of the individual codes that
are to be integrated into ICOMS, and the second being the creation of the model
physics, structural, and software improvements, as well as the message passing
structures, necessary to begin full coupling of wave, circulation and
sediment transport codes for these integrated predictions.

The individual code improvements in Year 3 centered on the completion of the 
parallelization and performance benchmarking of the WAM code for shallow water 
wave prediction, and the SED code (in collaboration with the CFD team at MSU) 
for noncohesive sediment resuspension and transport calculations. These 
activities are complete. Coupling improvements completed in Year 3 included 
implementation of the unsteady current refraction and radiation stress physics 
necessary to couple the WAM wave and CH3D circulation models. The SED model was 
also fully coupled except for the bottom boundary conditions.

In Year 3, we have completed the coupling of the wave, circulation and 
noncohesive sediment transport codes and performed preliminary verification
on Lake Michigan, for which an extensive data set exists for assessment. The
coupling yielded results that are significantly improved in the nearshore zone,
and the tests for an extremely challenging storm, wave and sediment plume event
in Lake Michigan were quite satisfactory. Tasks remaining for Year 4 include
full atmospheric model coupling, the completion of the sediment bottom boundary
layer conditions, and the application of ICOMS to an area of DoD interest such
as the Gulf of Kuwait.


EQM: Environmental Quality Modeling CTA
     (TICAM - Texas)

In Year 3, the EQM team's core and focused efforts involved continuing
parallel EQM code migration, developing new algorithms and software for
coupling hydrodynamic flow and transport simulators, developing new
software for launching parallel codes from the Web, and providing user
training through frequent personal contact, workshops, and conferences.

The EQM team continued its efforts in the parallel migration of
CE-QUAL-ICM.  In Year 2, parallel version 1.0 of CE-QUAL-ICM was
developed.  In Year 3, a number of improvements were added to the code,
resulting in version 2.0.  In particular, the code is now able to run on
any number of processors (before it was limited to 32 processors due to the
way I/O was being handled) and the overall file I/O was improved.  The mesh
partitioning strategy was also improved through the use of PARMETIS, a
parallel mesh partitioning code developed at the Army High Performance Computin
Research Center (AHPCRC) at Minnesota.  Overall, version 2.0 is roughly 25% 
faster than version 1.0.  During Year 3, CEWES personnel Carl Cerco, Mark
Noel and Barry Bunch have run 106 10-year Chesapeake Bay scenarios.  In the
same amount of wall clock time, they would have been able to run only 11
scenarios with the previous serial version.  Thus, the parallel code has
resulted in an increase of an order of magnitude in computing capability
and is now being used for production simulation.

During Year 3, the EQM team migrated the finite element hydrodynamics code
ADCIRC (Advanced Circulation Model) to the CEWES MSRC parallel platforms.  The
same general parallelization approach that was used to parallelize
CE-QUAL-ICM was employed.  In particular, a preprocessing code was created
which splits the global finite element mesh and global data into local data
sets for each processor.  Here mesh partitioning algorithms based on
space-filling curves and the PARMETIS code were investigated.  The latter
code greatly improved the parallel performance for large numbers of
processors.  The parallel code was tested on several large data sets.  For
one data set with 271,240 nodes a speedup of 169 on 256 processors of the
Cray T3E was obtained.  Speedups on smaller numbers of processors was almost
linear with the number of processors.  Improved versions of ADCIRC
including wetting/drying and three-dimensional effects are under
development, and the EQM team continues to consult with CEWES MSRC users on the 
parallelization of the code.

The parallelization strategy used to migrate CE-QUAL-ICM and ADCIRC has
also been applied by CEWES EQM personnel to the migration of
CE-QUAL-ICM/TOXI (Mark Dortch and Terry Gerald) and FEMWATER (Fred Tracy).
The EQM team provided some user support for the migration of the TOXI water
quality model.  Tracy was able to parallelize FEMWATER on his own after
attending our Workshop on Parallel Technologies held in January of 1998 by
patterning his approach after ours.  Tracy reported on his results at the
DoD HPCMP Users Group Meeting at Rice in June 1998.

One of the EQM team's major efforts in Year 3 was in the development of a
projection algorithm and software for coupling hydrodynamics and water
quality models.  In this coupling, the major issue is in providing a mass
conservative velocity field suitable for the finite volume method used in
CE-QUAL-ICM.  We developed and implemented such an algorithm in the code
UTPROJ.  Theoretical error estimates for the methodology were derived which show
that the projected velocities, besides being mass conservative, are at
least as accurate as the original velocities produced by the hydrodynamics
code.  A two-dimensional code was first developed to show proof of concept.
Simultaneously, a similar two-dimensional code, PROFEM, was developed at
CEWES.  Comparisons between UTPROJ and PROFEM led to changes in the PROFEM
solver, speeding it up by an order of magnitude.  We extended UTPROJ to
three dimensions during Year 3 and have tested it on some synthetic data
sets.  The three-dimensional code is very general, allowing for
combinations of tetrahedral, prismatic and hexahedral elements.  To make
it a truly usable code, further testing, parallelization, and improvements
to the linear solver are planned for Year 4.

A prototypical demonstration of the coupling capability between ADCIRC and
CE-QUAL-ICM using UTPROJ was performed during Year 3.  Circulation in 
Galveston Bay was first simulated using the parallel ADCIRC code.  The output 
from this code (elevation, velocities) for each transport time step was 
projected onto the same grid.  Movement of a contaminant was then performed 
using a transport code similar to CE-QUAL-ICM that we developed.  A scenario 
whereby a point source of contaminant is released into the bay was simulated.  
At present, the links between flow, projection and transport are primitive.  
During Year 4, we hope to vastly improve the coupling process, so that it is
seemless to an EQM CEWES MSRC user.

In conjunction with our activities related to water quality modeling, the
EQM team investigated and implemented second-order accurate transport
schemes suitable for unstructured grids.  This is motivated by the need to
extend the second-order QUICKEST scheme currently used in CE-QUAL-ICM to
unstructured, three-dimensional grids for planned future CEWES
applications.  A two-dimensional code has been developed on triangular
grids, and a number of different algorithms are being tested in this
framework.  

The EQM team also had a focused effort on Web launching in Year 3.  The
launching capability was demonstrated using the PARSSIM subsurface
transport simulator, a parallel code developed at Texas.  A client Java
applet with GUI (graphical user interface) and three-dimensional
visualization capabilities was created that allows remote users to access
the PARSSIM code and data domains on CEWES MSRC servers.  The results of
the computation are saved on the CEWES MSRC local disks and can be
selectively retrieved from Java visualization applets.  The Java applet can
be instantiated from any Internet Web browser.  As a first prototype of
launching, we created tools appropriate for launching PARSSIM and/or a
visualization server selectively on CEWES MSRC server machines.  An outstanding 
task is to incorporate GLOBUS support to support metacomputing on multilple 
workstations.

Other activities in user support, training and outreach occurred during the
year.  The EQM team gave a joint presentation with Mark Noel from CEWES at
the DoD HPCMP Users Group Meeting at Rice in June 1998.  A workshop on Parallel 
Computation and EQM Modeling was held at Jackson State in June 1998.  A workshop
on Parallel Computing Technologies was held at CEWES MSRC in January 1999.
Moreover, we have had frequent user contact with CEWES EQM personnel and
CEWES MSRC users, including Carl Cerco, Mark Noel, Barry Bunch, Charlie Berger,
Mark Dortch and Terry Gerald. Finally, CEWES MSRC PET technical reports 
detailing our accomplishments have been written, and several conference 
presentations were given during the year on our work.



FMS: Forces Modeling and Simulation/C4I CTA (NPAC - Syracuse)

Year 3 FMS effort in PET at CEWES MSRC can be divided in to two distinct 
but related thrusts.  The major effort continued the core development of our
WebHLA framework, and we also developed and delivered demonstrations
of its use with specific FMS applications, such as parallel CMS.
Another effort used some of the tools and technologies that underlie
WebHLA (i.e. WebFlow, CORBA) to provide an environment which
facilitates the integration of multiple simulation applications under
Web-based user interfaces.

WebHLA Activities

WebHLA follows the 3-tier architecture of our Pragmatic Object Web,
with a mesh of JWORB(Java Web Object Request Broker)-based middleware
servers managing back-end simulation modules and offering Web portal
style interactive multi-user front-ends. JWORB is a multi-protocol
server capable to manage objects conforming to various distributed
object models and including CORBA, Java, COM and XML. HLA is supported
via Object Web RTI (OWRTI), i.e. Java/CORBA-based implementation of
DMSO RTI 1.3, packaged as a JWORB service. Distributed objects in any
of the popular commodity models can be naturally grouped within the
WebHLA framework as HLA federates and they can naturally communicate
by exchanging (via JWORB-based RTI) XML-ized events or messages
packaged as some suitable FOM interactions.

HLA-compliant M&S systems can be now integrated in WebHLA by porting
legacy codes (typically written in C/C++) to suitable HPC platforms,
wrapping such codes as WebHLA federates using cross-language
(Java/C++) RTICup API, and using them as plug-and-play components on
the JWORB/OWRTI software bus. In case of previous generation
simulations following the DIS (or ALSP) model, suitable bridges to the
HLA/RTI communication domain are now also available in WebHLA,
packaged as utility federates. To facilitate experiments with
CPU-intense HPC simulation modules, we developed suitable database tools
such as event logger, event database manager and event playback
federate that allow us to save the entire simulation segments and
replay later for some analysis, demo or review purposes. Finally, we
also constructed SimVis - a commodity (DirectX on NT) graphics-based 
battlefield visualizer federate that offers a real-time
interactive 3D front-end for typical DIS=>HLA entity level
(e.g. ModSAF style) simulations.

In parallel with prototyping core WebHLA technologies such as JWORB or
OWRTI, we are also analyzing some selected advanced M&S modules such
as the CMS (Comprehensive Mine Simulator) system developed at
Ft. Belvoir VA that simulates mines, mine fields, minefield
components, standalone detection systems and countermine systems
including ASTAMIDS, SMB and MMCM. The system can be viewed as a
virtual T&E tool to facilitate R&D in the area of new countermine
systems and detection technologies of relevance both for the Army and
the Navy.  We recently constructed a parallel port of the system to the
Origin2000, where it was packaged and can be used as either a DIS node
or as an HLA federate.

Based on the analysis of the sequential CMS code, we found the
semi-automatic, compiler directives based approach as the most
practical parallelization technique to start with in our case. The
most CPU-intensive inner loop of the CMS simulation runs over all
mines in the system and it is activated in response to each new entity
state PDU to check if there is a match between the current vehicle and
mine coordinates that could lead to a mine detonation. Using
directives such as "pragma parallel" and "pragma pfor" we managed to
partition the mine-vehicle tracking workload over the available
processors, and we achieved a linear speedup up to four
processors. For large multiprocessor configurations, the efficiency of
our pragma-based parallelization scheme deteriorates due to the NUMA
memory model and increased contention on the internal network.  This
makes efficient use of cache critically important to obtaining
scalable performance.  We were unable to enforce a cache-efficient
data decomposition using pragmas, probably due to the rather complex
object oriented and irregular code in the CMS inner loop. We are
currently rebuilding and simplifying the inner loop code so that the
associated memory layout of objects is more regular and hence
predictable.

Playing the real scenario over and over again for testing and analysis
is a time consuming and tedious effort. A database of the equivalent
PDU stream would be a good solution for selectively playing back
segments of a once-recorded scenario. For a prototype version of such
a PDU database (PDUDB) we used Microsoft's Access database and Java
servlets for loading as well as retrieving the data from the database
using JDBC. The PDU logger servlet receives its input via an HTTP PORT message 
in the form of XML-encoded PDU sequences. The input stream is decoded,
converted to SQL and stored in the database using JDBC.  The playback
is done using another servlet that sends the PDUs generated from the
database as a result of a query. The servlet is activated by accessing
it from a Web browser. Currently the queries are made on timestamps,
which are used to determine the frequency with which PDUs are to be
sent. But any possible queries can be made on the database to retrieve
any information. The servlet can send the PDUs either in DIS mode or
in HLA mode.

In our Pragmatic Object Web approach, we integrate CORBA, Java, COM
and WOM based distributed object technologies. We view CORBA and Java
as most adequate for the middleware and back-end, whereas COM as the
leading candidate for interactive front-ends due to the Microsoft
dominance on the desktop market.  Of particular interest for the M&S
community seems to be the COM package called DirectX which offers
multimedia API for developing powerful graphics, sound and network
play applications, based on a consistent interface to devices across
different hardware platforms.

Using DirectX/Direct3X technology, we constructed a real-time
battlefield visualizer, SimVis that can operate both in the DIS and
HLA modes. SimVis is a Windows NT application written in Visual C++ and
the DirectX/Direct3D API. The ModSAF terrain is the input for a
sampler program, which provides vertices, colors, and texture
information for each of the faces.  After the terrain is constructed,
it is added as a visual object to the scenario scene.  Geometry
objects and animation sets for typical battlefield entities such as
armored vehicles (tanks) and visual events such as explosions were
developed using 3D Studio MAX authoring system.  SimVis visual
interactive controls include base navigation support, choice of
rendering modes, and a variety of scene views.

The components described above have been combined into a demonstration
that features an HPC parallel application (CMS) interacting with other
simulation components running on geographically distributed computing
resources. A typical configuration involves a parallel CMS federate
running on Origin2000s at CEWES and ARL MSRCs and other modules
(ModSAF; JDIS, the Java-based DIS/HLA bridge; PDUDB; SimVis) running
on Syracuse SGI and NT workstations.

Our initial results are quite encouraging and we therefore believe
that WebHLA with evolve towards a powerful modeling and simulation
framework, capable of addressing new challenges of DoD and commodity
computing in many areas that require federation of multiple resources
and collaborative Web based access such as Simulation Based Design and
Acquisition.

WebFlow Activities

The value of high performance commodity computing (HPcc) is not
limited to those applications which require HLA compatibility.  It is
a new type of infrastructure which gives the user access to a full
range of commercial capabilities (i.e. databases, compute servers),
pervasive access from all platforms, and a natural incremental path
for enhancement as the computer industry juggernaut continues to
deliver software systems of rapidly increasing power.

The WebFlow system, which also underlies the WebHLA system described
above, is an HPcc system designed to provide seamless access to remote
resources through the use of a CORBA-based middleware. During Year 3,
Syracuse researchers have worked on using this tool to help couple
together existing applications code and provide them with Web-based
user interfaces in such a way that the user interface can run locally
to the user, perhaps in the field, while the applications run on one
or more "back-end" computational resources elsewhere on the network.

The principle example developed during the year is the Land Management
System (LMS), the development of which is lead by CEWES.  The primary
computational codes in LMS are EDYS, which simulates vegetation
growth, including disturbances (such as military exercises), and
CASC2D, which simulates watersheds and the effects of rainfall.  They
interact with a third module, the Watershed Management System (WMS)
which provides input processing and output visualization services to
both codes.  WMS requires a variety of input information including
digital elevation models, land use data, and details of the soils and
vegetation.  

Once the user defines the simulation boundaries, the WebFlow system is
used to extract the relevant information from databases, both locally,
and Internet-accessible databases run by, for example, the
U.S. Geological Survey.  WMS processes the database extracts into
input files appropriate for EDYS and CASC2D, and  WebFlow then launches
the simulation.  Because of the construction of the simulation codes,
CASC2D is invoked just once for the duration of the simulation and
"paused" while the vegetation simulation is being run, while EDYS is
invoked anew for each segment of the simulation.  This, along with
interconnection of the data streams for the two code is handled by
WebFlow.  It is also worth noting that the two codes happen to be run
on separate platforms, in accord with their separate resource
requirements.  When the simulation is complete, WMS can be used
through the Web-based interface to visualize the results.

The culmination of this project was a class, conducted by Tom Haupt
at the CEWES MSRC TEF which discussed the tools and technologies used
here, and how they can be applied to other problems.  This is yet
another highly successful demonstration of the value and generality of
the HPcc approach, and we look forward to helping users make further
use of this software development model.


C/C: Collaboration and Communications 
     (NPAC - Syracuse)

NPAC's efforts in the C/C area as a part of the CEWES MSRC PET effort encompass 
not just Collaboration and Communication, per se, but also the closely related 
area of tools and technologies for training and education.  The NPAC team is 
also involved in analagous support activities in the ARL MSRC and ASC MSRC
PET programs, as well as having a modest involvement in distance
training and education activities in conjunction with the NAVO MSRC
PET program.  These additional connections provide a great deal of
synergy with C/C activities sponsored by the CEWES MSRC.

The focus of our work has been understanding the needs and
requirements for "collaboration and training" tools in the context of
PET activities, and to work towards their deployment.  In this sense,
we are strongly involved with both the rest of the PET team and the
MSRC user community.  At CEWES MSRC, our work has focused primarily on
support for electronic training, and expecially on synchronous tools
for collaboration and training - support for "live" two-way
interactions over the network.  When combined with our activities for
the other PET programs, this work becomes part of a unified whole,
covering both synchronous and asynchronous tools for information
dissemination, collaboration, and training.

During Years 2 and 3, NPAC has worked extensively with PET partner
Jackson State University (JSU) on some experiments in network-based
distance education.  Using NPAC's Tango Interactive collaboratory
framework, a series of semester-length academic credit courses have
been taught by Syracuse-based instructors to students at JSU.  The
experience gained from these efforts has been critical in
understanding both the technical and sociological factors surrounding
distance education.

Tango Interactive is a framework that allows sharing of applications
across the Internet.  It includes a suite of tools useful for basic
collaboration and distance learning activities: shared Web browser,
chat, whiteboard, audio/video conferencing, shared text editor, etc.
It also provides an application program interface (API) that allows
other applications to be hooked into the Tango framework.

Tools of this sort are relatively new, and even the most
computer-savvy have little or no experience with them as yet.
Consequently it became clear rather early on that for these tools to
gain acceptance in either collaborative or educational applications,
they must be deployed in a staged fashion, starting from
well-structured environments (i.e. classroom style educational use)
and working towards less structures environments (i.e. general
research collaboration).

During Year 3, NPAC collaborated with the Ohio Supercomputer Center
(OSC), another CEWES MSRC PET partner, to transition the tools and experience
from the JSU education efforts into the PET training arena.  OSC
provided instructors and course content, while NPAC worked with OSC,
CEWES MSRC, and other recipient centers to provide support for the delivery
tools (Tango Interactive). This effort expanded our experience to
include instructors previously unfamiliar with the delivery tools, and
while retaining the structured instructor/student relationship, the
compressed delivery (over a few days rather than many weeks)
introduced additional requirements on the robustness of the system.

This highly successful effort has lead the way to what will become
widespread and fairly routine use of distance training across the PET
program in the coming year.  It has also lead to an increased interest
in the use of the Tango collaboratory in less structured environments,
such as some kinds of meetings and research collaborations.  This has
given us a number of exploratory groups to work with as we begin the
staged deployment of network collaboration tools into these
progressively less structured situations.

This and other work at CEWES MSRC, combined with related activities at the
other MSRCs in the collaboration and training area, have lead us to
develop an integrated picture of the current state of the art in both
synchronous and asynchronous collaboration technology and how we
believe they can be used effectively within the context of the PET
program.  


SPPT: Scalable Parallel Programming Tools
      (CRPC - Rice/Tennessee)

The primary goal of SPP Tools in the CEWES MSRC PET effort is to promote a 
uniform, high-level, easy-to-use environment for programming available to all 
CEWES MSRC users and, ultimately, to all of DOD.  We view "programming" very 
broadly, encompassing any means of describing a sequence of executable actions. 
This includes writing traditional CFD simulations as well as developing 
quick-and-dirty filters to scan output files.  Similarly, we view "tools" as 
including any software that eases the task of programming.  This includes tools,
such as compilers, that make programming possible at all; it includes other 
tools, such as libraries, that provide easier-to-use facilities for programming;
and it includes some tools, such as performance profilers, that help users 
understand their programs. SPP Tools thus cuts across disciplines.  

If we are successful, all CTAs will benefit from the new tools.  This has a 
long-term goal that will not be met by PET activities alone.  In order to 
provide a complete set of tools, new resources will be needed.  We therefore put
great emphasis on collaboration with CEWES MSRC users (to understand the 
requirements for new tools), with other PET team members (to disseminate and 
test the tools), and with groups outside of DOD (to identify and develop the 
tools).

Rice

The major parts of the Rice effort for Year 3 concern work accomplished by 
our SPPT On-Site Lead, Clay Breshears, in organizing the DoD HPCMP Users Group 
Meeting held at Rice in June 1998, completing various focused projects, and 
giving five tutorials.  Major impact came from the On-Site Lead working with 
users at CEWES MSRC, and other DoD sites.  Many users were provided with 
critical help on upgrading their application codes, or other 
programming-related issues.

The DoD HPCMP Users Group Meeting was held at Rice in June 1998.  Over 
350 attended the meeting.  This required a large effort from the SPP Tools Lead 
at Rice University, and the Rice support staff.  Ken Kennedy gave the keynote
address at the meeting.  

Year 3 project work included: Scalar Optimization for the CFD code HELIX; Use of
OpenMP in HELIX; and an evaluation of CAPTools - a tool suite used for the 
upgrade or development of parallel application codes.  These projects are 
summarized in the Appendix, and more detail appears in CEWES MSRC PET technical 
reports. 

The Scalar Optimization project had significant merit.  This project's findings 
resulted in the HELIX code being at least twice as fast, with no other changes 
required.  The code was transformed from its original form to an equivalent form
such that compiler optimization is effective.   The key idea is to use 
intermediate scalar stores.  This allowed the back-end code to use internal or 
working registers for intermediate storage.  This technique reduced the latency 
costs associated with storing in ordinary memory, allowing temporary stores to 
occur in the high-speed registers.  Thus we recommend the technique to CEWES 
MSRC users.

The OpenMP Evaluation project had merit, but it appears as a negative result.  
The HELIX code was analyzed or profiled to find where the time is spent.  More 
than 94% occurs in a single routine.  This routine was a likely candidate to 
benefit by installing OpenMP directives, and using related compiler options and 
environment variables, at run time.  This is an error-prone and tedious process 
Further, the reductions in running time were marginal.  Similar reductions can 
be routinely obtained in a variety of ways.  It is possible that the library of 
support software for the processed OpenMP or back-end code has an error or bug.
Given the current state of OpenMP, we recommend that users rely on proven 
technologies for parallel programming.  These include message passing, threads 
and tuple-space methods.

Programming with CAPTools involved feeding the sequential implementation of a 
model non-linear 2D elliptic PDE code to the CAPTools interactive 
parallelization system, and guiding the source-to-source code transformation by 
responding to various queries about quantities available only at runtime.  An 
important issue with this software involves a significant licensing fee, thus 
limiting its availability to a few machines at CEWES MSRC. 

Finally, Rice contributed two sets of PowerPoint slides, given as tutorials, to 
the production of the CD ROM prepared for CEWES MSRC PET by Syracuse.

Tennessee

The Tennessee Year 3 SPPT effort focused on meeting the programming tool
needs of a variety of DoD application developers and CEWES MSRC users, including
those associated with code migration projects, CHSSI and DoD Challenge projects,
and an SC'98 HPC Challenge entry.  Our approach has been to ensure that
appropriate tools are installed and working properly on CEWES MSRC systems, to
make information about installed tools available to as wide an audience
as possible, and to provide one-on-one assistance to users as needed.
All computational technology areas require robust easy-to-use debugging
and performance analysis tools.  Since many DoD users develop and/or run
applications on multiple MSRC platforms, it is highly desirable to have
the same tools available across platforms so that users receive the
maximum benefit from the time spent in learning to use a tool.
Large-scale DoD applications often push tool capabilities to their limits,
revealing scalability limitations and sometimes even breaking the tools.
Thus, a major focus of our Year 3 effort was to provide users with
reliable and effective cross-platform debugging and performance analysis
tools and to work with tool developers and vendors to address scalability
limitations of current tool technology.

To enable CEWES MSRC users to find out about and learn to use appropriate tools,
we have put together a Web-based Scalable Parallel Programming (SPP) Tools
repository at http://www.nhse.org/rib/repositories/cewes_spp_tools/.

This repository contains a listing of tools being made available and/or
being supported as part of our PET efforts.  In addition to debugging
and performance analysis tools, information is available about high 
performance math libraries, parallel languages and compilers, and
parallel I/O systems.  The repository includes a sofware deployment
matrix that provides a concise view of which tools are installed on
which platforms.  By clicking on an entry for a particular tool and
a particular platform, the user can access site-specific usage
information and tips as well as Web-based tutorials and quick-start guides.
The Repository in a Box (RIB) toolkit that was used to create the SPP
Tools repository is available for CTAs and other PET support areas to
use to make their software and tools more accessible and useful to users.

Debuggers listed in the SPP Tools repository include the cross-platform
TotalView multiprocess and multithread debugger, as well as 
platform-specific debuggers.  Although we highly recommend TotalView,
since it has excellent capabilities and is available on all CEWES MSRC 
platforms, information is included about platform-specific debuggers in case 
their special features are needed.  Performance analysis tools listed include
the Vampir cross-platform performance analysis tool, as well as 
platform-specific tools.  We have worked closely with CEWES MSRC systems
staff to ensure that the tools are properly installed and tested on
CEWES MSRC platforms and to report any bugs to the tool developers.

When requested, we have worked one-on-one with CHSSI and DoD Challenge
application developers to make effective use of performance analysis
tools to improve the performance of their codes.  These contacts are
detailed in Table 3.

For the SC'98 HPC Challenge entry from CEWES involving the CGWAVE harbor 
response simulation code, we worked closely with the CEWES MSRC team to develop
and debug a tool called MPI_Connect for MPI intercommunication between
application components running on different machines at different MSRCs, 
so as to achieve multiple levels of parallelism and help reduce the runtime
for CGWAVE from months to days.  The MPI_Connect system is being made
available for general use by DoD users who have similar intercommunication
and metacomputing needs.

Developers of several large-scale DoD applications, for example the Mach 3
code used in the Radio Frequency Weapons DoD Challenge project at AF Phillips
Lab, have run into scalability problems when trying to use current performance
analysis tools technology.  Trace-based techniques can produce extremely
large and unwieldy trace files which are difficult or impossible to
store, transfer, and analyze.  For long-running simulations it is often
desirable to turn tracing on and off during execution or to decide
what further performance data to collect based on an analysis of data
already collected during the same run.  To address these issues, we are
experimenting with dynamic instrumentation techniques and with virtual
reality immersion to enable scalable performance analysis of large DoD
applications.  Using dynamic instrumentation, the user can monitor
application execution during runtime, decide what data to collect,
and turn data collection on and off as desired.  We are initially using
dynamic instrumentation to turn Vampir tracing on and off under user control,
and to dynamically select, start, and read the values of hardware
performance counters.

To address the problem of scalable visualization of large trace files,
we have begun experimenting with 3-D virtual reality immersion using
the Virtue system developed at the University of Illinois.  By using
Virtue on an Immersadesk or a graphics workstation, the user can view
state changes and communication behavior for the tasks making up a parallel
execution in the form of a 3-D time tunnel display that can be interactively
manipulated and explored.  The dynamic call tree for a process can be
viewed as a 3-D tree with sizes and colors of nodes representing
attributes such as duration and number of calls.  The amount of data
that can be visualized at one time using 3-D virtual immersion is
an order or two of magnitude more than with a 2-D display.

Virtue also has multimedia tools that allow remote collaborators to view
and manipulate the same Virtue display (although scaled down and with
fewer navigation capabilities) from their desktop workstations.
We have successfully installed Virtue on the CEWES MSRC Scientific
Visualization Center systems.  We have written a converter program that
translates Vampirtrace files into the data format understood by Virtue,
and we have begun experimenting with Virtue to display large Vampirtrace
files produced by CEWES MSRC applications.

Our next step will be to enable scalable real-time performance visualization
by linking dynamic instrumentation with Virtue.  Performance data captured 
in real-time will be set to the Virtue system where the user will be able to 
visualize the data.  Our long-range goal is to allow the user to interact with 
and modify or steer a running application through the Virtue/dynamic 
instrumentation interface.


SV: Scientific Visualization
    (NCSA-Illinois & ERC-Mississippi State)

During Year 3, we continued an active dialog with CEWES MSRC users and 
visualization staff, provided information about emerging developments in
graphics and visualization technology, assisted with visualization
production, and conducted a number of training sessions.  We also
transferred a number of specific tools to CEWES MSRC users and provided training
in their use.  These tools are having direct impact on these users ability
to do their work.  These activities are consistent with the 5-Year
Strategic Plan developed for PET visualization, which calls for intiatives
in (1) evaluating and extending coprocessing environments, (2) identifying
strategies for management of very large data sets, and (3) defining user
requirements for collaborative visualization. 

Several application-specific tools were transferred to CEWES MSRC users.  These
include CbayVisGen, CbayTransport, ISTV and DIVA, and volDG. CbayVisGen is
a visualization tool specially designed to support the visualization needs
of Carl Cerco and his EQM team at CEWES.  This group is investigating long-term
phenomena in Chesapeake Bay, with results being provided to the EPA.
CbayVisGen used existing visualization libraries and customized a tool to
visualize the hydrodynamics and nutrient transport activity over 10-year
and 20-year time periods. Cerco's work has moved to full production
runs, and he has frequent needs to share his results with his EPA science
monitor.  CbayVisGen was customized to include easy image and movie
capture.  These items can be easily transferred to a Web page for sharing
with his colleagues.  A follow-on tool, CbayTransport, experimented with
alternatives for visualizing the transport flux data.  Cerco had no
mechanisms for viewing this part of his data, so this tool has added new
and needed capability.

The DIVA and ISTV visualization tools, developed at MSU's ERC, were used to 
support visualization for the CWO user community of CEWES MSRC.
Both tools were put to an initial task of generating stills and movies, in
an effort to assess applicability of each tool.  Later, ISTV was chosen to
visualize the output of a CH3D model of the lower Mississippi River.  We
have also used ISTV to show WAM output.  Robert Jensen of CEWES reports that
ISTV has been useful for looking at correlations between variables.
Animating through the time-series with ISTV was especially revealing - a
throbbing-effect is seen, apparently due to the assimilation of wind data
every 3 hours.  The significance of this is currently under study.
Finally, ISTV is being used to look at the coupled WAM-CH3D model being
tested against data for Lake Michigan.  We anticipate that continued use of 
these tools in varied applicatoins will further understanding for the warfighter
of the forces encountered during littoral operations.

The visualization tool volDG was designed to explore the use of wavelet
representations for very large datasets.  Wavelet representation provides a
compression scheme useful for large data.  Wavelets can also support detection 
of features, such as singularities, which can guide meaningful access and 
transfer of very large data sets.  For example, structure-significant encoding 
of a data set is possible.  In subsequent data exploration, regions of high 
significance can be examined first.  We are working with Raju Namburu of CEWES
on the application of these ideas to his CSM problem area.

The PET SV team also conducted an in-depth review of the various software
packages available to support computational monitoring and interactive
steering.  In our initial report, we summarized the characteristics of
several of these tools.  We also applied the most promising of these tools
to a CEWES MSRC application - the parallel version of CE-QUAL-IQM - and reported
on this hands-on activity.  In conjunction with this effort, we connected
the CE-QUAL-IQM code to NCSA's Collaborative Data Analysis Toolsuite
(CDAT).  CDAT is a multi-platform collection of collaborative visualization
tools.  In this scenario, participants on an ImmersaDesk, a desktop
workstation, and a laptop were able to simultaneously explore the
simulation output as it was generated from a 12-processor run of
CE-QUAL-IQM.  This was a particularly rewarding effort, since it involved
staff from across the PET team.  The parallel version of CE-QUAL-IQM is the
work of Mary Wheeler's (Texas) EQM PET team.  NCSA was responsible for
the visualization tools.  Those tools derive their collaboration
capabilities from Tango Interactive, from Geoffrey Fox (Syracuse).
CEWES MSRC user Carl Cerco is eager to try out these capabilities in
support of his work.

As part of our technology watch efforts, we monitored the industry for new
developments, paying particular attention to the rapidly increasing
graphics capabilities of desktop PCs.  We summarized these efforts in our
annual "Report from SIGGRAPH" publication.  We also canvassed users to
assess their current data management strategies, and provided information
discussing the applicability of NCSA's HDF5 data management package to CEWES 
MSRC users.

Finally, we conducted both informal and formal training sessions.  We developed 
and presented "An Introduction to Scientific Visualization" at the Jackson
State Summer Institute.  A half-day "Introduction to HDF" was
presented at both the CEWES MSRC and at Jackson State.  A day-long
class introduced the various packages that exist for supporting
computational monitoring, such as CUMULVS, DICE, and pV3.  In Year 4, we
intend to convert the material developed for this class to an on-line
format, such that the information will be available at any time for
continuing impact.  Another training day discussed the use of the
visualization tool ISTV.  These courses have led to follow-up and
continuing contact with CEWES MSRDC users. 


University of Southern California

Although many High Performance Computing (HPC) platforms have been
deployed at various MSRCs, there has been a "gap" in understanding the
underlying architecture from an end-users perspective.  To further
complicate matters, operating system characteristics, compiler
features, and various standards affect the performance of user
applications.  Most MSRC end-users have been developing their
applications ad hoc without much consideration given to the performance of
their algorithms.  The goal of the USC team's focussed effort was to
develop a set of benchmarks and a model of the underlying architecture
in order to help end-users develop efficient parallel algorithms for
their applications.

To evaluate the performance of HPC platforms deployed at the MSRCs,
researchers have proposed various benchmarks.  Some benchmarks
attempt to measure the peak performance of these platforms.  They
employ various optimizations and performance tuning to deliver
close-to-peak performance.  These benchmarks showcase the full
capability of the products.  However, for most users these
performance measures seem to be meaningless.  For end-users, the
actual performance depends on a number of factors including the
architecture and the compiler used.  Other benchmarks attempt to
measure the performance of these platforms with a set of
representative algorithms for a particular scientific domain.
Although useful, these benchmarks do not give the end-users a simple
method for evaluating their algorithms and implementations.

The USC team has taken a novel approach to benchmarking that addresses
the actual performance available to end-users.  The benchmarks allow
the end-users to understand the machine characteristics, the
communication environment, and the compiler features of the underlying
HPC platform at a user level.  Using the results of these benchmarks,
the USC team is able to provide end-users with a simple and accurate
model of HPC platforms, including that of the software environment.
Using such a model, end-users can analyze and predict the performance
of a given algorithm.  This allows algorithm designers to understand
tradeoffs and make critical decisions to optimize their code on a
given HPC platform.

Message Passing Systems such as the IBM SP and Shared Memory Systems
such as the SGI/Cray Origin2000 are widely-used HPC platforms.  Some
hybrid systems such as the SGI/Cray T3E attempt to provide features of
both types of architectures by including message passing capabilities
and a globally shared address space.  In using these HPC platforms,
there are several layers of interface to the actual hardware. These
include the operating system, compilers, library codes for computation
and communication such as ScaLAPACK, PBLAS, MPI, etc, and other
support utilities.  Initially the USC team investigated all the
various features that affect the performance for end-users and
formulated a set of parameters to model these factors.

In order to measure these parameters, a large set of experiments was
conducted on all available platforms. The results were then carefully
evaluated to produce a coherent picture of the factors affecting the
performance of the HPC platforms.  From these analyses, the USC team
determined that in predicting the performance of algorithms on HPC
platforms, the key factor is an accurate cost analysis of data access.
The cost for communication of data is heavily affected by the data
location. The data may be physically located in the local memory, in a
remote processor, or on secondary storage such as a disk.  The various
possible data locations can be thought of as a data hierarchy. Thus,
data may be communicated between processor and memory, between
processors or between secondary storage and the processor.  The cost
to access data increases dramatically as the data moves down along the
hierarchy.  The USC team's benchmarks measure the cost of accessing
the data along this hierarchy.

>From these benchmark results, the USC team has generated a set of
parameters and formulated the Integrated Memory Hierarchy (IMH) model.
The IMH model provides a uniform set of parameters across multiple
platforms that measures the cost of communication from various levels
of the data hierarchy.  This allows end-users to evaluate and predict
the performance of their algorithms on a particular HPC platform.

In order to evaluate the benchmark results and the IMH model, the USC
team has optimized a CFD application supplied by an end-user.  
A detailed explanation of this optimization is
included in Section VIII on outreach to users.  This optimization effort
resulted in a Win-Win situation for the end-user as well as for the
USC team. The optimization efforts produced an algorithm with
approximately a 5-fold increase in performance using 30 processors
compared with the original algorithm.  The resulting optimized
algorithm remained scalable well beyond 30 processors and was only
limited by the size of the data to be processed.  

The benchmark results and the IMH model allow end-users to evaluate
and predict the performance of their algorithms on a particular HPC
platform.  A uniform set of parameters across multiple platforms
allows end-users to make intelligent decisions in selecting the best
platform for a given application. Once a platform has been selected,
the end-user can use the IMH model to evaluate the performance of
their algorithm.  Using this evaluation in conjunction with the
developed optimization techniques, they can modify and optimize their
code to achieve superior performance.  The end-user will also be able
to estimate the performance of various algorithm alternatives without
actual coding, thus greatly reducing the time for developing optimized
applications.