V. YEAR 3 ACCOMPLISHMENTS As has been noted above, the PET effort at the CEWES MSRC operates through providing core support to CEWES MSRC users, performing specific focused efforts designed to introduce new tools and computational technology into the CEWES MSRC, conducting training courses and workshops for CEWES MSRC users, and operating an HBCU/MI enhancement program. The major accomplishments of the CEWES MSRC PET effort in enhancing the programming environment at the CEWES MSRC are described in this section. The presentation here is according to CTAs and technical infrastructure support areas, but there is much overlap in effort. Tools introduced into the CEWES MSRC in the course of Year 3 of the PET effort are listed in Table 4, and are described in Section VI. Specific CEWES MSRC codes impacted by the PET effort during Year 3 are listed in Table 5, and items of technology transfer into the CEWES MSRC are listed in Table 6. More detail on the Year 3 effort is given in the specific Focused Efforts described in the Appendix and, especially, in the CEWES MSRC PET Technical Reports and other publications from Year 3 that are abstracted in the Section X. Training conducted during Year 3 is described in Section VII, and outreach to CEWES MSRC users is covered in Section VIII. The accomplishments in the HBCU/MI component of the CEWES MSRC PET effort are discussed in Section IX. ------------------------------------------------------------------------------ CFD: Computational Fluid Dynamics CTA (ERC-Mississippi State) Year 3 efforts in support of the CFD CTA at CEWES MSRC were divided between core support and focused effort. Primary core support contributions were made by the On-Site Lead (Bova) with overall coordination assistance from the at-university team. One Focused Effort - Scalable and Parallel Integration of Hydraulic, Wave, and Sediment Transport Models - was funded during Year 3. Through core support, the team was able to provide training and user outreach via participation in Parallel Programming Workshops for Fortran Programmers (Bring Your Own Code: BYOC), MPI tutorials, the JSU Summer Institute, and direct CEWES MSRC user contacts. User support was provided to two CHSSI development teams (COBALT and OVERFLOW), as well as HPC service for individual CEWES MSRC users. In collaboration with the NRC Computational Migration Group at CEWES MSRC, support was provided to J. C. T. Wang of the Aerospace Corporation by analyzing the message flow in a section of his PVM code, resulting in a successful code port to the IBM SP system. David Medina of AF Phillips Lab implemented a graph-based reordering strategy within the MAGI solver, resulting from PET consultation. The objective is to improve cache performance and interprocessor data locality. Results show about a 30% execution time reduction for two, four and sixteen processors compared to execution without reordering. Continued collaboration with Fernando Grinstein of NRL resulted in development of a parallel version of NSTURB3D capable of efficient execution on all CEWES MSRC computing platforms. A dual-level parallel algorithm using both MPI and OpenMP was designed and implemented into the CGWAVE solver supporting Zeki Demirbilek of CEWES CHL. This resulted in dramatic reduction in turnaround time. Turnaround time for the demonstration case was reduced from 2.1 days to 12 minutes using 256 SGI O2000 processors. This project, which also served as a testbed for the MPI_Connect Tool implemented by the Tennessee SPPT team in PET, was initiated through user interaction at one of the BYOC workshops and was selected as the "Most Effective Engineering Methodology" in the HPC Challenge at SC'98. In Year 3, the Focused Effort - Scalable and Parallel Integration of Hydraulic, Wave and Sediment Transport - was initiated in collaboration with the PET CWO team at Ohio State. The purpose of this focused effort was to enhance national defense and security by developing a scalable, parallel implementation of a coupled wave (WAM), hydraulic (CH3D) and sediment transport (SED/COSED) simulation code. Once developed, this code will enable more accurate and efficient evaluation of DoD applications such as naval harbor access, wind/wave hazard forecasting, coastal forecasts for amphibious operations, in addition to various civil works applications. Stronger coupling of these technologies through this project will result in a simulation capability including more realistic representation of the interaction between surface and bottom shear stresses that will improve representation of physical phenomena. Parallelization and HPC support will substantially reduce computational requirements for long period simulations and allow more timely simulation of configurations of military interest. Task responsibilities within this focused effort were distributed between the CFD and CWO teams. In Year 3, the primary contribution of the CFD team was development of a parallel implementation of the CH3D solver, including the non-cohesive sediment model (SED), and computer science support on model integration. CEWES MSRC user constraints require results from the parallel solver to replicate results from the unmodified sequential solver within machine accuracy, and that input and output files be compatible with sequential versions. Consequently, the implemented parallelization strategy utilized a three-stage approach: pre-processing and partitioning, parallel computation, and post-processing. During the pre-processing and partitioning phase, the grid is partitioned using a simple load-balancing technique to obtain roughly equal number of computational cells per processor, with separate grid files generated for each processor along with the other necessary input files. All the processors performed the computation in parallel and exchanged data at the processor boundaries in the parallel computation phase. At the end of a simulation, each processor writes the corresponding part of the solution to disk. These individual files are later merged during the post-processing stage to generate output files suitable for visualization. To verify the accuracy of the parallelization effort, simulations were performed on two different test cases: Lake Michigan and Old River. Verification of parallel CH3D/SED was done in close collaboration with the CWO team and CEWES personnel. The Lake Michigan case was tested on the SGI Origin 2000 and the Cray T3E for different numbers of processors, and the performance was analyzed. Significant reduction in the execution times was observed up to 32 processors. The Old River case was tested for various numbers of processors, and a more long-term (in the time dimension) simulation was performed. Details of the code scalability performance and verification are included in the description of this focused effort in the Appendix of this report. The impact of this work on CEWES MSRC is to significantly reduce the time required for high-resolution sediment transport simulations by making use of the scalable nature of the high performance computing facilities. This work was also essential for the development of a scalable coupled model of hydraulic, sediment, and wind-wave processes, which greatly affects military operations in the marine environment. CSM: Computational Structural Mechanics CTA (TICAM-Texas & ERC-Mississippi State) The principal goal of the TICAM CSM support in the CEWES MSRC PET effort is to introduce, implement and test adaptive grid techniques for analysis of CSM problems for DoD simulation of "violent events". Representative DoD codes include CTH and EPIC. This work requires consideration of the following: (1) appropriate a posteriori error estimates and corresponding indicators for these problem classes; (2) strategies for h, (p and hp) adaption of the grid; (3) complications in implementing the adaptive schemes related to the form of "legacy code"; (4) development and incorporation of appropriate data structures for adaptive refinement and coarsening; (5) efficiency in implementation; and (6) parallel, scalable HPC needs. CSM core support and focused effort activities on the project have been targeted to these goals, and we have made several significant progress accomplishments: (1). We have developed a block adaptive strategy in collaboration with Sandia and have carried out some preliminary simulation studies. This work has been implemented as an extension of the CTH code. Results of a parallel adaptive simulation carried out on the ASCI Red supercomputer system at Sandia are shown in Figure 1. Figure 1. Sample calculation of a 3D impact of a copper ball onto a steel plate, showing (a) active blocks and (b) materials. There are 61 million active cells in this calculation, compared to 200 million for a uniform grid without adaptive refinement. Here the test problem corresponds to hypervelocity impact of a spherical copper projectile with a target. The number of cells in the adaptive simulation are approximately one-third that of a comparable uniform grid simulation that yields the desired fine mesh resolution. As part of the work we are also developing new error and feature indicators to guide refinement and assess computational reliability. (2). We have developed a local simplex refinement strategy and implemented it in the test code EPIC (Elasto Plastic Impact Computation). As the simulation proceeds, elements are locally refined based on element feature and error indicators. Preliminary numerical tests have been carried out for the "Taylor Anvil" problem to assess the utility of the approach. Results are presented in Figure 2, showing the mesh after several local adaptive refinement steps. Related work on space-filling curves for mesh partitioning and on grid quality is being carried out. Figure 2: Adapted mesh after several refinement steps Both of the above studies are singular accomplishments since they are, we believe, the first adaptive refinement calculations for this class of applications codes. They therefore constitute a strong statement concerning the main program goal - to introduce and tech transfer advanced adaptive grid strategies from the university and research labs to the DoD application groups. This work has been closely coordinated with our core support activities. We gave a related short course on Finite Element Analysis of Nonlinear Problems at the DoD HPCMP Users Group Meeting at Rice University in June 1998, interacted closely with the CSM On-Site Lead - Rick Weed, lectured on adaptive techniques and multi-scale modeling at the CSM workshop held at CEWES in November 1998, discussed results at the CEWES MSRC PET Annual Review in February 1999, and organized a workshop on Adaptive Techniques, Error Indicators and Applications held at Texas in March 1999. In addition, we have worked closely with Sandia researchers and made three working trips to Sandia to foster the collaboration and expedite the work. This core support and the focused efforts constitute a cohesive approach that includes the Texas CSM team, the Army Institute for Advanced Technology (at Texas) activities, Sandia applications and software development, and CEWES MSRC applications using CTH and EPIC. The work has a strong impact on DoD applications and CEWES MSRC since it promises to change the entire approach for solving these problems in a timely and more reliable way. CWO: Climate/Weather/Ocean Modeling CTA (Ohio State) The rationale for the CWO team in the CEWES MSRC PET effort, and the evolution of its effort over the last three years continues to be based on the following three points: First, and most importantly, DoD has an extremely urgent need for timely (forecasted) information about hazardous and mitigating conditions in its theaters of interest. The primary classes of problems encountered include: the prediction of extreme wave conditions for fleet operations both offshore and in nearshore regions, the maintenance of port and harbor navigability and associated inlet conditions; the prediction of ocean sediment storms in the North Atlantic for finding enemy submarines or providing camouflage for our submarines, improved prediction of acoustic waveguides, and integrated systems for predicting subsurface mine burial mechanics and weapons retrieval. One land-based extension of these activities is the prediction of sandstorms for desert troop operations. Second, these problem areas have several common threads. The hazardous conditions are extreme in magnitude and infrequent in occurrence, are forced by associated atmospheric events, result from three-dimensional turbulent fluid dynamics with a very broadband spectrum of nonlinearly interacting processes, give rise to sediment resuspension and transport, and most often occur in regions of shallow water or irregular coastline geometry. Third, the present modeling strategies which underly the forecasting procedures for these problems consist of isolated models of the various processes which are not well linked and, therefore, not able to resolve the most extreme and, consequently, the most important information that DoD needs from the forecasting programs. Therefore, the basis of the CWO program in PET at CEWES MSRC is to implement the new model structures and couplings necessary to allow the prediction of these extreme rapid but infrequent events, especially in the nearshore, embayment, and continental shelf environments. The increased scope and size of these new Integrated Coastal Modeling Systems (ICOMS) cannot be achieved without the use of highly parallelized software executed on highly parallel machines. During the first two years, we have been able to proceed on two fronts: the first being the parallelization and improvement of the individual codes that are to be integrated into ICOMS, and the second being the creation of the model physics, structural, and software improvements, as well as the message passing structures, necessary to begin full coupling of wave, circulation and sediment transport codes for these integrated predictions. The individual code improvements in Year 3 centered on the completion of the parallelization and performance benchmarking of the WAM code for shallow water wave prediction, and the SED code (in collaboration with the CFD team at MSU) for noncohesive sediment resuspension and transport calculations. These activities are complete. Coupling improvements completed in Year 3 included implementation of the unsteady current refraction and radiation stress physics necessary to couple the WAM wave and CH3D circulation models. The SED model was also fully coupled except for the bottom boundary conditions. In Year 3, we have completed the coupling of the wave, circulation and noncohesive sediment transport codes and performed preliminary verification on Lake Michigan, for which an extensive data set exists for assessment. The coupling yielded results that are significantly improved in the nearshore zone, and the tests for an extremely challenging storm, wave and sediment plume event in Lake Michigan were quite satisfactory. Tasks remaining for Year 4 include full atmospheric model coupling, the completion of the sediment bottom boundary layer conditions, and the application of ICOMS to an area of DoD interest such as the Gulf of Kuwait. EQM: Environmental Quality Modeling CTA (TICAM - Texas) In Year 3, the EQM team's core and focused efforts involved continuing parallel EQM code migration, developing new algorithms and software for coupling hydrodynamic flow and transport simulators, developing new software for launching parallel codes from the Web, and providing user training through frequent personal contact, workshops, and conferences. The EQM team continued its efforts in the parallel migration of CE-QUAL-ICM. In Year 2, parallel version 1.0 of CE-QUAL-ICM was developed. In Year 3, a number of improvements were added to the code, resulting in version 2.0. In particular, the code is now able to run on any number of processors (before it was limited to 32 processors due to the way I/O was being handled) and the overall file I/O was improved. The mesh partitioning strategy was also improved through the use of PARMETIS, a parallel mesh partitioning code developed at the Army High Performance Computin Research Center (AHPCRC) at Minnesota. Overall, version 2.0 is roughly 25% faster than version 1.0. During Year 3, CEWES personnel Carl Cerco, Mark Noel and Barry Bunch have run 106 10-year Chesapeake Bay scenarios. In the same amount of wall clock time, they would have been able to run only 11 scenarios with the previous serial version. Thus, the parallel code has resulted in an increase of an order of magnitude in computing capability and is now being used for production simulation. During Year 3, the EQM team migrated the finite element hydrodynamics code ADCIRC (Advanced Circulation Model) to the CEWES MSRC parallel platforms. The same general parallelization approach that was used to parallelize CE-QUAL-ICM was employed. In particular, a preprocessing code was created which splits the global finite element mesh and global data into local data sets for each processor. Here mesh partitioning algorithms based on space-filling curves and the PARMETIS code were investigated. The latter code greatly improved the parallel performance for large numbers of processors. The parallel code was tested on several large data sets. For one data set with 271,240 nodes a speedup of 169 on 256 processors of the Cray T3E was obtained. Speedups on smaller numbers of processors was almost linear with the number of processors. Improved versions of ADCIRC including wetting/drying and three-dimensional effects are under development, and the EQM team continues to consult with CEWES MSRC users on the parallelization of the code. The parallelization strategy used to migrate CE-QUAL-ICM and ADCIRC has also been applied by CEWES EQM personnel to the migration of CE-QUAL-ICM/TOXI (Mark Dortch and Terry Gerald) and FEMWATER (Fred Tracy). The EQM team provided some user support for the migration of the TOXI water quality model. Tracy was able to parallelize FEMWATER on his own after attending our Workshop on Parallel Technologies held in January of 1998 by patterning his approach after ours. Tracy reported on his results at the DoD HPCMP Users Group Meeting at Rice in June 1998. One of the EQM team's major efforts in Year 3 was in the development of a projection algorithm and software for coupling hydrodynamics and water quality models. In this coupling, the major issue is in providing a mass conservative velocity field suitable for the finite volume method used in CE-QUAL-ICM. We developed and implemented such an algorithm in the code UTPROJ. Theoretical error estimates for the methodology were derived which show that the projected velocities, besides being mass conservative, are at least as accurate as the original velocities produced by the hydrodynamics code. A two-dimensional code was first developed to show proof of concept. Simultaneously, a similar two-dimensional code, PROFEM, was developed at CEWES. Comparisons between UTPROJ and PROFEM led to changes in the PROFEM solver, speeding it up by an order of magnitude. We extended UTPROJ to three dimensions during Year 3 and have tested it on some synthetic data sets. The three-dimensional code is very general, allowing for combinations of tetrahedral, prismatic and hexahedral elements. To make it a truly usable code, further testing, parallelization, and improvements to the linear solver are planned for Year 4. A prototypical demonstration of the coupling capability between ADCIRC and CE-QUAL-ICM using UTPROJ was performed during Year 3. Circulation in Galveston Bay was first simulated using the parallel ADCIRC code. The output from this code (elevation, velocities) for each transport time step was projected onto the same grid. Movement of a contaminant was then performed using a transport code similar to CE-QUAL-ICM that we developed. A scenario whereby a point source of contaminant is released into the bay was simulated. At present, the links between flow, projection and transport are primitive. During Year 4, we hope to vastly improve the coupling process, so that it is seemless to an EQM CEWES MSRC user. In conjunction with our activities related to water quality modeling, the EQM team investigated and implemented second-order accurate transport schemes suitable for unstructured grids. This is motivated by the need to extend the second-order QUICKEST scheme currently used in CE-QUAL-ICM to unstructured, three-dimensional grids for planned future CEWES applications. A two-dimensional code has been developed on triangular grids, and a number of different algorithms are being tested in this framework. The EQM team also had a focused effort on Web launching in Year 3. The launching capability was demonstrated using the PARSSIM subsurface transport simulator, a parallel code developed at Texas. A client Java applet with GUI (graphical user interface) and three-dimensional visualization capabilities was created that allows remote users to access the PARSSIM code and data domains on CEWES MSRC servers. The results of the computation are saved on the CEWES MSRC local disks and can be selectively retrieved from Java visualization applets. The Java applet can be instantiated from any Internet Web browser. As a first prototype of launching, we created tools appropriate for launching PARSSIM and/or a visualization server selectively on CEWES MSRC server machines. An outstanding task is to incorporate GLOBUS support to support metacomputing on multilple workstations. Other activities in user support, training and outreach occurred during the year. The EQM team gave a joint presentation with Mark Noel from CEWES at the DoD HPCMP Users Group Meeting at Rice in June 1998. A workshop on Parallel Computation and EQM Modeling was held at Jackson State in June 1998. A workshop on Parallel Computing Technologies was held at CEWES MSRC in January 1999. Moreover, we have had frequent user contact with CEWES EQM personnel and CEWES MSRC users, including Carl Cerco, Mark Noel, Barry Bunch, Charlie Berger, Mark Dortch and Terry Gerald. Finally, CEWES MSRC PET technical reports detailing our accomplishments have been written, and several conference presentations were given during the year on our work. FMS: Forces Modeling and Simulation/C4I CTA (NPAC - Syracuse) Year 3 FMS effort in PET at CEWES MSRC can be divided in to two distinct but related thrusts. The major effort continued the core development of our WebHLA framework, and we also developed and delivered demonstrations of its use with specific FMS applications, such as parallel CMS. Another effort used some of the tools and technologies that underlie WebHLA (i.e. WebFlow, CORBA) to provide an environment which facilitates the integration of multiple simulation applications under Web-based user interfaces. WebHLA Activities WebHLA follows the 3-tier architecture of our Pragmatic Object Web, with a mesh of JWORB(Java Web Object Request Broker)-based middleware servers managing back-end simulation modules and offering Web portal style interactive multi-user front-ends. JWORB is a multi-protocol server capable to manage objects conforming to various distributed object models and including CORBA, Java, COM and XML. HLA is supported via Object Web RTI (OWRTI), i.e. Java/CORBA-based implementation of DMSO RTI 1.3, packaged as a JWORB service. Distributed objects in any of the popular commodity models can be naturally grouped within the WebHLA framework as HLA federates and they can naturally communicate by exchanging (via JWORB-based RTI) XML-ized events or messages packaged as some suitable FOM interactions. HLA-compliant M&S systems can be now integrated in WebHLA by porting legacy codes (typically written in C/C++) to suitable HPC platforms, wrapping such codes as WebHLA federates using cross-language (Java/C++) RTICup API, and using them as plug-and-play components on the JWORB/OWRTI software bus. In case of previous generation simulations following the DIS (or ALSP) model, suitable bridges to the HLA/RTI communication domain are now also available in WebHLA, packaged as utility federates. To facilitate experiments with CPU-intense HPC simulation modules, we developed suitable database tools such as event logger, event database manager and event playback federate that allow us to save the entire simulation segments and replay later for some analysis, demo or review purposes. Finally, we also constructed SimVis - a commodity (DirectX on NT) graphics-based battlefield visualizer federate that offers a real-time interactive 3D front-end for typical DIS=>HLA entity level (e.g. ModSAF style) simulations. In parallel with prototyping core WebHLA technologies such as JWORB or OWRTI, we are also analyzing some selected advanced M&S modules such as the CMS (Comprehensive Mine Simulator) system developed at Ft. Belvoir VA that simulates mines, mine fields, minefield components, standalone detection systems and countermine systems including ASTAMIDS, SMB and MMCM. The system can be viewed as a virtual T&E tool to facilitate R&D in the area of new countermine systems and detection technologies of relevance both for the Army and the Navy. We recently constructed a parallel port of the system to the Origin2000, where it was packaged and can be used as either a DIS node or as an HLA federate. Based on the analysis of the sequential CMS code, we found the semi-automatic, compiler directives based approach as the most practical parallelization technique to start with in our case. The most CPU-intensive inner loop of the CMS simulation runs over all mines in the system and it is activated in response to each new entity state PDU to check if there is a match between the current vehicle and mine coordinates that could lead to a mine detonation. Using directives such as "pragma parallel" and "pragma pfor" we managed to partition the mine-vehicle tracking workload over the available processors, and we achieved a linear speedup up to four processors. For large multiprocessor configurations, the efficiency of our pragma-based parallelization scheme deteriorates due to the NUMA memory model and increased contention on the internal network. This makes efficient use of cache critically important to obtaining scalable performance. We were unable to enforce a cache-efficient data decomposition using pragmas, probably due to the rather complex object oriented and irregular code in the CMS inner loop. We are currently rebuilding and simplifying the inner loop code so that the associated memory layout of objects is more regular and hence predictable. Playing the real scenario over and over again for testing and analysis is a time consuming and tedious effort. A database of the equivalent PDU stream would be a good solution for selectively playing back segments of a once-recorded scenario. For a prototype version of such a PDU database (PDUDB) we used Microsoft's Access database and Java servlets for loading as well as retrieving the data from the database using JDBC. The PDU logger servlet receives its input via an HTTP PORT message in the form of XML-encoded PDU sequences. The input stream is decoded, converted to SQL and stored in the database using JDBC. The playback is done using another servlet that sends the PDUs generated from the database as a result of a query. The servlet is activated by accessing it from a Web browser. Currently the queries are made on timestamps, which are used to determine the frequency with which PDUs are to be sent. But any possible queries can be made on the database to retrieve any information. The servlet can send the PDUs either in DIS mode or in HLA mode. In our Pragmatic Object Web approach, we integrate CORBA, Java, COM and WOM based distributed object technologies. We view CORBA and Java as most adequate for the middleware and back-end, whereas COM as the leading candidate for interactive front-ends due to the Microsoft dominance on the desktop market. Of particular interest for the M&S community seems to be the COM package called DirectX which offers multimedia API for developing powerful graphics, sound and network play applications, based on a consistent interface to devices across different hardware platforms. Using DirectX/Direct3X technology, we constructed a real-time battlefield visualizer, SimVis that can operate both in the DIS and HLA modes. SimVis is a Windows NT application written in Visual C++ and the DirectX/Direct3D API. The ModSAF terrain is the input for a sampler program, which provides vertices, colors, and texture information for each of the faces. After the terrain is constructed, it is added as a visual object to the scenario scene. Geometry objects and animation sets for typical battlefield entities such as armored vehicles (tanks) and visual events such as explosions were developed using 3D Studio MAX authoring system. SimVis visual interactive controls include base navigation support, choice of rendering modes, and a variety of scene views. The components described above have been combined into a demonstration that features an HPC parallel application (CMS) interacting with other simulation components running on geographically distributed computing resources. A typical configuration involves a parallel CMS federate running on Origin2000s at CEWES and ARL MSRCs and other modules (ModSAF; JDIS, the Java-based DIS/HLA bridge; PDUDB; SimVis) running on Syracuse SGI and NT workstations. Our initial results are quite encouraging and we therefore believe that WebHLA with evolve towards a powerful modeling and simulation framework, capable of addressing new challenges of DoD and commodity computing in many areas that require federation of multiple resources and collaborative Web based access such as Simulation Based Design and Acquisition. WebFlow Activities The value of high performance commodity computing (HPcc) is not limited to those applications which require HLA compatibility. It is a new type of infrastructure which gives the user access to a full range of commercial capabilities (i.e. databases, compute servers), pervasive access from all platforms, and a natural incremental path for enhancement as the computer industry juggernaut continues to deliver software systems of rapidly increasing power. The WebFlow system, which also underlies the WebHLA system described above, is an HPcc system designed to provide seamless access to remote resources through the use of a CORBA-based middleware. During Year 3, Syracuse researchers have worked on using this tool to help couple together existing applications code and provide them with Web-based user interfaces in such a way that the user interface can run locally to the user, perhaps in the field, while the applications run on one or more "back-end" computational resources elsewhere on the network. The principle example developed during the year is the Land Management System (LMS), the development of which is lead by CEWES. The primary computational codes in LMS are EDYS, which simulates vegetation growth, including disturbances (such as military exercises), and CASC2D, which simulates watersheds and the effects of rainfall. They interact with a third module, the Watershed Management System (WMS) which provides input processing and output visualization services to both codes. WMS requires a variety of input information including digital elevation models, land use data, and details of the soils and vegetation. Once the user defines the simulation boundaries, the WebFlow system is used to extract the relevant information from databases, both locally, and Internet-accessible databases run by, for example, the U.S. Geological Survey. WMS processes the database extracts into input files appropriate for EDYS and CASC2D, and WebFlow then launches the simulation. Because of the construction of the simulation codes, CASC2D is invoked just once for the duration of the simulation and "paused" while the vegetation simulation is being run, while EDYS is invoked anew for each segment of the simulation. This, along with interconnection of the data streams for the two code is handled by WebFlow. It is also worth noting that the two codes happen to be run on separate platforms, in accord with their separate resource requirements. When the simulation is complete, WMS can be used through the Web-based interface to visualize the results. The culmination of this project was a class, conducted by Tom Haupt at the CEWES MSRC TEF which discussed the tools and technologies used here, and how they can be applied to other problems. This is yet another highly successful demonstration of the value and generality of the HPcc approach, and we look forward to helping users make further use of this software development model. C/C: Collaboration and Communications (NPAC - Syracuse) NPAC's efforts in the C/C area as a part of the CEWES MSRC PET effort encompass not just Collaboration and Communication, per se, but also the closely related area of tools and technologies for training and education. The NPAC team is also involved in analagous support activities in the ARL MSRC and ASC MSRC PET programs, as well as having a modest involvement in distance training and education activities in conjunction with the NAVO MSRC PET program. These additional connections provide a great deal of synergy with C/C activities sponsored by the CEWES MSRC. The focus of our work has been understanding the needs and requirements for "collaboration and training" tools in the context of PET activities, and to work towards their deployment. In this sense, we are strongly involved with both the rest of the PET team and the MSRC user community. At CEWES MSRC, our work has focused primarily on support for electronic training, and expecially on synchronous tools for collaboration and training - support for "live" two-way interactions over the network. When combined with our activities for the other PET programs, this work becomes part of a unified whole, covering both synchronous and asynchronous tools for information dissemination, collaboration, and training. During Years 2 and 3, NPAC has worked extensively with PET partner Jackson State University (JSU) on some experiments in network-based distance education. Using NPAC's Tango Interactive collaboratory framework, a series of semester-length academic credit courses have been taught by Syracuse-based instructors to students at JSU. The experience gained from these efforts has been critical in understanding both the technical and sociological factors surrounding distance education. Tango Interactive is a framework that allows sharing of applications across the Internet. It includes a suite of tools useful for basic collaboration and distance learning activities: shared Web browser, chat, whiteboard, audio/video conferencing, shared text editor, etc. It also provides an application program interface (API) that allows other applications to be hooked into the Tango framework. Tools of this sort are relatively new, and even the most computer-savvy have little or no experience with them as yet. Consequently it became clear rather early on that for these tools to gain acceptance in either collaborative or educational applications, they must be deployed in a staged fashion, starting from well-structured environments (i.e. classroom style educational use) and working towards less structures environments (i.e. general research collaboration). During Year 3, NPAC collaborated with the Ohio Supercomputer Center (OSC), another CEWES MSRC PET partner, to transition the tools and experience from the JSU education efforts into the PET training arena. OSC provided instructors and course content, while NPAC worked with OSC, CEWES MSRC, and other recipient centers to provide support for the delivery tools (Tango Interactive). This effort expanded our experience to include instructors previously unfamiliar with the delivery tools, and while retaining the structured instructor/student relationship, the compressed delivery (over a few days rather than many weeks) introduced additional requirements on the robustness of the system. This highly successful effort has lead the way to what will become widespread and fairly routine use of distance training across the PET program in the coming year. It has also lead to an increased interest in the use of the Tango collaboratory in less structured environments, such as some kinds of meetings and research collaborations. This has given us a number of exploratory groups to work with as we begin the staged deployment of network collaboration tools into these progressively less structured situations. This and other work at CEWES MSRC, combined with related activities at the other MSRCs in the collaboration and training area, have lead us to develop an integrated picture of the current state of the art in both synchronous and asynchronous collaboration technology and how we believe they can be used effectively within the context of the PET program. SPPT: Scalable Parallel Programming Tools (CRPC - Rice/Tennessee) The primary goal of SPP Tools in the CEWES MSRC PET effort is to promote a uniform, high-level, easy-to-use environment for programming available to all CEWES MSRC users and, ultimately, to all of DOD. We view "programming" very broadly, encompassing any means of describing a sequence of executable actions. This includes writing traditional CFD simulations as well as developing quick-and-dirty filters to scan output files. Similarly, we view "tools" as including any software that eases the task of programming. This includes tools, such as compilers, that make programming possible at all; it includes other tools, such as libraries, that provide easier-to-use facilities for programming; and it includes some tools, such as performance profilers, that help users understand their programs. SPP Tools thus cuts across disciplines. If we are successful, all CTAs will benefit from the new tools. This has a long-term goal that will not be met by PET activities alone. In order to provide a complete set of tools, new resources will be needed. We therefore put great emphasis on collaboration with CEWES MSRC users (to understand the requirements for new tools), with other PET team members (to disseminate and test the tools), and with groups outside of DOD (to identify and develop the tools). Rice The major parts of the Rice effort for Year 3 concern work accomplished by our SPPT On-Site Lead, Clay Breshears, in organizing the DoD HPCMP Users Group Meeting held at Rice in June 1998, completing various focused projects, and giving five tutorials. Major impact came from the On-Site Lead working with users at CEWES MSRC, and other DoD sites. Many users were provided with critical help on upgrading their application codes, or other programming-related issues. The DoD HPCMP Users Group Meeting was held at Rice in June 1998. Over 350 attended the meeting. This required a large effort from the SPP Tools Lead at Rice University, and the Rice support staff. Ken Kennedy gave the keynote address at the meeting. Year 3 project work included: Scalar Optimization for the CFD code HELIX; Use of OpenMP in HELIX; and an evaluation of CAPTools - a tool suite used for the upgrade or development of parallel application codes. These projects are summarized in the Appendix, and more detail appears in CEWES MSRC PET technical reports. The Scalar Optimization project had significant merit. This project's findings resulted in the HELIX code being at least twice as fast, with no other changes required. The code was transformed from its original form to an equivalent form such that compiler optimization is effective. The key idea is to use intermediate scalar stores. This allowed the back-end code to use internal or working registers for intermediate storage. This technique reduced the latency costs associated with storing in ordinary memory, allowing temporary stores to occur in the high-speed registers. Thus we recommend the technique to CEWES MSRC users. The OpenMP Evaluation project had merit, but it appears as a negative result. The HELIX code was analyzed or profiled to find where the time is spent. More than 94% occurs in a single routine. This routine was a likely candidate to benefit by installing OpenMP directives, and using related compiler options and environment variables, at run time. This is an error-prone and tedious process Further, the reductions in running time were marginal. Similar reductions can be routinely obtained in a variety of ways. It is possible that the library of support software for the processed OpenMP or back-end code has an error or bug. Given the current state of OpenMP, we recommend that users rely on proven technologies for parallel programming. These include message passing, threads and tuple-space methods. Programming with CAPTools involved feeding the sequential implementation of a model non-linear 2D elliptic PDE code to the CAPTools interactive parallelization system, and guiding the source-to-source code transformation by responding to various queries about quantities available only at runtime. An important issue with this software involves a significant licensing fee, thus limiting its availability to a few machines at CEWES MSRC. Finally, Rice contributed two sets of PowerPoint slides, given as tutorials, to the production of the CD ROM prepared for CEWES MSRC PET by Syracuse. Tennessee The Tennessee Year 3 SPPT effort focused on meeting the programming tool needs of a variety of DoD application developers and CEWES MSRC users, including those associated with code migration projects, CHSSI and DoD Challenge projects, and an SC'98 HPC Challenge entry. Our approach has been to ensure that appropriate tools are installed and working properly on CEWES MSRC systems, to make information about installed tools available to as wide an audience as possible, and to provide one-on-one assistance to users as needed. All computational technology areas require robust easy-to-use debugging and performance analysis tools. Since many DoD users develop and/or run applications on multiple MSRC platforms, it is highly desirable to have the same tools available across platforms so that users receive the maximum benefit from the time spent in learning to use a tool. Large-scale DoD applications often push tool capabilities to their limits, revealing scalability limitations and sometimes even breaking the tools. Thus, a major focus of our Year 3 effort was to provide users with reliable and effective cross-platform debugging and performance analysis tools and to work with tool developers and vendors to address scalability limitations of current tool technology. To enable CEWES MSRC users to find out about and learn to use appropriate tools, we have put together a Web-based Scalable Parallel Programming (SPP) Tools repository at http://www.nhse.org/rib/repositories/cewes_spp_tools/. This repository contains a listing of tools being made available and/or being supported as part of our PET efforts. In addition to debugging and performance analysis tools, information is available about high performance math libraries, parallel languages and compilers, and parallel I/O systems. The repository includes a sofware deployment matrix that provides a concise view of which tools are installed on which platforms. By clicking on an entry for a particular tool and a particular platform, the user can access site-specific usage information and tips as well as Web-based tutorials and quick-start guides. The Repository in a Box (RIB) toolkit that was used to create the SPP Tools repository is available for CTAs and other PET support areas to use to make their software and tools more accessible and useful to users. Debuggers listed in the SPP Tools repository include the cross-platform TotalView multiprocess and multithread debugger, as well as platform-specific debuggers. Although we highly recommend TotalView, since it has excellent capabilities and is available on all CEWES MSRC platforms, information is included about platform-specific debuggers in case their special features are needed. Performance analysis tools listed include the Vampir cross-platform performance analysis tool, as well as platform-specific tools. We have worked closely with CEWES MSRC systems staff to ensure that the tools are properly installed and tested on CEWES MSRC platforms and to report any bugs to the tool developers. When requested, we have worked one-on-one with CHSSI and DoD Challenge application developers to make effective use of performance analysis tools to improve the performance of their codes. These contacts are detailed in Table 3. For the SC'98 HPC Challenge entry from CEWES involving the CGWAVE harbor response simulation code, we worked closely with the CEWES MSRC team to develop and debug a tool called MPI_Connect for MPI intercommunication between application components running on different machines at different MSRCs, so as to achieve multiple levels of parallelism and help reduce the runtime for CGWAVE from months to days. The MPI_Connect system is being made available for general use by DoD users who have similar intercommunication and metacomputing needs. Developers of several large-scale DoD applications, for example the Mach 3 code used in the Radio Frequency Weapons DoD Challenge project at AF Phillips Lab, have run into scalability problems when trying to use current performance analysis tools technology. Trace-based techniques can produce extremely large and unwieldy trace files which are difficult or impossible to store, transfer, and analyze. For long-running simulations it is often desirable to turn tracing on and off during execution or to decide what further performance data to collect based on an analysis of data already collected during the same run. To address these issues, we are experimenting with dynamic instrumentation techniques and with virtual reality immersion to enable scalable performance analysis of large DoD applications. Using dynamic instrumentation, the user can monitor application execution during runtime, decide what data to collect, and turn data collection on and off as desired. We are initially using dynamic instrumentation to turn Vampir tracing on and off under user control, and to dynamically select, start, and read the values of hardware performance counters. To address the problem of scalable visualization of large trace files, we have begun experimenting with 3-D virtual reality immersion using the Virtue system developed at the University of Illinois. By using Virtue on an Immersadesk or a graphics workstation, the user can view state changes and communication behavior for the tasks making up a parallel execution in the form of a 3-D time tunnel display that can be interactively manipulated and explored. The dynamic call tree for a process can be viewed as a 3-D tree with sizes and colors of nodes representing attributes such as duration and number of calls. The amount of data that can be visualized at one time using 3-D virtual immersion is an order or two of magnitude more than with a 2-D display. Virtue also has multimedia tools that allow remote collaborators to view and manipulate the same Virtue display (although scaled down and with fewer navigation capabilities) from their desktop workstations. We have successfully installed Virtue on the CEWES MSRC Scientific Visualization Center systems. We have written a converter program that translates Vampirtrace files into the data format understood by Virtue, and we have begun experimenting with Virtue to display large Vampirtrace files produced by CEWES MSRC applications. Our next step will be to enable scalable real-time performance visualization by linking dynamic instrumentation with Virtue. Performance data captured in real-time will be set to the Virtue system where the user will be able to visualize the data. Our long-range goal is to allow the user to interact with and modify or steer a running application through the Virtue/dynamic instrumentation interface. SV: Scientific Visualization (NCSA-Illinois & ERC-Mississippi State) During Year 3, we continued an active dialog with CEWES MSRC users and visualization staff, provided information about emerging developments in graphics and visualization technology, assisted with visualization production, and conducted a number of training sessions. We also transferred a number of specific tools to CEWES MSRC users and provided training in their use. These tools are having direct impact on these users ability to do their work. These activities are consistent with the 5-Year Strategic Plan developed for PET visualization, which calls for intiatives in (1) evaluating and extending coprocessing environments, (2) identifying strategies for management of very large data sets, and (3) defining user requirements for collaborative visualization. Several application-specific tools were transferred to CEWES MSRC users. These include CbayVisGen, CbayTransport, ISTV and DIVA, and volDG. CbayVisGen is a visualization tool specially designed to support the visualization needs of Carl Cerco and his EQM team at CEWES. This group is investigating long-term phenomena in Chesapeake Bay, with results being provided to the EPA. CbayVisGen used existing visualization libraries and customized a tool to visualize the hydrodynamics and nutrient transport activity over 10-year and 20-year time periods. Cerco's work has moved to full production runs, and he has frequent needs to share his results with his EPA science monitor. CbayVisGen was customized to include easy image and movie capture. These items can be easily transferred to a Web page for sharing with his colleagues. A follow-on tool, CbayTransport, experimented with alternatives for visualizing the transport flux data. Cerco had no mechanisms for viewing this part of his data, so this tool has added new and needed capability. The DIVA and ISTV visualization tools, developed at MSU's ERC, were used to support visualization for the CWO user community of CEWES MSRC. Both tools were put to an initial task of generating stills and movies, in an effort to assess applicability of each tool. Later, ISTV was chosen to visualize the output of a CH3D model of the lower Mississippi River. We have also used ISTV to show WAM output. Robert Jensen of CEWES reports that ISTV has been useful for looking at correlations between variables. Animating through the time-series with ISTV was especially revealing - a throbbing-effect is seen, apparently due to the assimilation of wind data every 3 hours. The significance of this is currently under study. Finally, ISTV is being used to look at the coupled WAM-CH3D model being tested against data for Lake Michigan. We anticipate that continued use of these tools in varied applicatoins will further understanding for the warfighter of the forces encountered during littoral operations. The visualization tool volDG was designed to explore the use of wavelet representations for very large datasets. Wavelet representation provides a compression scheme useful for large data. Wavelets can also support detection of features, such as singularities, which can guide meaningful access and transfer of very large data sets. For example, structure-significant encoding of a data set is possible. In subsequent data exploration, regions of high significance can be examined first. We are working with Raju Namburu of CEWES on the application of these ideas to his CSM problem area. The PET SV team also conducted an in-depth review of the various software packages available to support computational monitoring and interactive steering. In our initial report, we summarized the characteristics of several of these tools. We also applied the most promising of these tools to a CEWES MSRC application - the parallel version of CE-QUAL-IQM - and reported on this hands-on activity. In conjunction with this effort, we connected the CE-QUAL-IQM code to NCSA's Collaborative Data Analysis Toolsuite (CDAT). CDAT is a multi-platform collection of collaborative visualization tools. In this scenario, participants on an ImmersaDesk, a desktop workstation, and a laptop were able to simultaneously explore the simulation output as it was generated from a 12-processor run of CE-QUAL-IQM. This was a particularly rewarding effort, since it involved staff from across the PET team. The parallel version of CE-QUAL-IQM is the work of Mary Wheeler's (Texas) EQM PET team. NCSA was responsible for the visualization tools. Those tools derive their collaboration capabilities from Tango Interactive, from Geoffrey Fox (Syracuse). CEWES MSRC user Carl Cerco is eager to try out these capabilities in support of his work. As part of our technology watch efforts, we monitored the industry for new developments, paying particular attention to the rapidly increasing graphics capabilities of desktop PCs. We summarized these efforts in our annual "Report from SIGGRAPH" publication. We also canvassed users to assess their current data management strategies, and provided information discussing the applicability of NCSA's HDF5 data management package to CEWES MSRC users. Finally, we conducted both informal and formal training sessions. We developed and presented "An Introduction to Scientific Visualization" at the Jackson State Summer Institute. A half-day "Introduction to HDF" was presented at both the CEWES MSRC and at Jackson State. A day-long class introduced the various packages that exist for supporting computational monitoring, such as CUMULVS, DICE, and pV3. In Year 4, we intend to convert the material developed for this class to an on-line format, such that the information will be available at any time for continuing impact. Another training day discussed the use of the visualization tool ISTV. These courses have led to follow-up and continuing contact with CEWES MSRDC users. University of Southern California Although many High Performance Computing (HPC) platforms have been deployed at various MSRCs, there has been a "gap" in understanding the underlying architecture from an end-users perspective. To further complicate matters, operating system characteristics, compiler features, and various standards affect the performance of user applications. Most MSRC end-users have been developing their applications ad hoc without much consideration given to the performance of their algorithms. The goal of the USC team's focussed effort was to develop a set of benchmarks and a model of the underlying architecture in order to help end-users develop efficient parallel algorithms for their applications. To evaluate the performance of HPC platforms deployed at the MSRCs, researchers have proposed various benchmarks. Some benchmarks attempt to measure the peak performance of these platforms. They employ various optimizations and performance tuning to deliver close-to-peak performance. These benchmarks showcase the full capability of the products. However, for most users these performance measures seem to be meaningless. For end-users, the actual performance depends on a number of factors including the architecture and the compiler used. Other benchmarks attempt to measure the performance of these platforms with a set of representative algorithms for a particular scientific domain. Although useful, these benchmarks do not give the end-users a simple method for evaluating their algorithms and implementations. The USC team has taken a novel approach to benchmarking that addresses the actual performance available to end-users. The benchmarks allow the end-users to understand the machine characteristics, the communication environment, and the compiler features of the underlying HPC platform at a user level. Using the results of these benchmarks, the USC team is able to provide end-users with a simple and accurate model of HPC platforms, including that of the software environment. Using such a model, end-users can analyze and predict the performance of a given algorithm. This allows algorithm designers to understand tradeoffs and make critical decisions to optimize their code on a given HPC platform. Message Passing Systems such as the IBM SP and Shared Memory Systems such as the SGI/Cray Origin2000 are widely-used HPC platforms. Some hybrid systems such as the SGI/Cray T3E attempt to provide features of both types of architectures by including message passing capabilities and a globally shared address space. In using these HPC platforms, there are several layers of interface to the actual hardware. These include the operating system, compilers, library codes for computation and communication such as ScaLAPACK, PBLAS, MPI, etc, and other support utilities. Initially the USC team investigated all the various features that affect the performance for end-users and formulated a set of parameters to model these factors. In order to measure these parameters, a large set of experiments was conducted on all available platforms. The results were then carefully evaluated to produce a coherent picture of the factors affecting the performance of the HPC platforms. From these analyses, the USC team determined that in predicting the performance of algorithms on HPC platforms, the key factor is an accurate cost analysis of data access. The cost for communication of data is heavily affected by the data location. The data may be physically located in the local memory, in a remote processor, or on secondary storage such as a disk. The various possible data locations can be thought of as a data hierarchy. Thus, data may be communicated between processor and memory, between processors or between secondary storage and the processor. The cost to access data increases dramatically as the data moves down along the hierarchy. The USC team's benchmarks measure the cost of accessing the data along this hierarchy. >From these benchmark results, the USC team has generated a set of parameters and formulated the Integrated Memory Hierarchy (IMH) model. The IMH model provides a uniform set of parameters across multiple platforms that measures the cost of communication from various levels of the data hierarchy. This allows end-users to evaluate and predict the performance of their algorithms on a particular HPC platform. In order to evaluate the benchmark results and the IMH model, the USC team has optimized a CFD application supplied by an end-user. A detailed explanation of this optimization is included in Section VIII on outreach to users. This optimization effort resulted in a Win-Win situation for the end-user as well as for the USC team. The optimization efforts produced an algorithm with approximately a 5-fold increase in performance using 30 processors compared with the original algorithm. The resulting optimized algorithm remained scalable well beyond 30 processors and was only limited by the size of the data to be processed. The benchmark results and the IMH model allow end-users to evaluate and predict the performance of their algorithms on a particular HPC platform. A uniform set of parameters across multiple platforms allows end-users to make intelligent decisions in selecting the best platform for a given application. Once a platform has been selected, the end-user can use the IMH model to evaluate the performance of their algorithm. Using this evaluation in conjunction with the developed optimization techniques, they can modify and optimize their code to achieve superior performance. The end-user will also be able to estimate the performance of various algorithm alternatives without actual coding, thus greatly reducing the time for developing optimized applications.