Syracuse Contribution to Application Based Performance Evaluation and Prediction Project

Fig 1:The Performance Prediction Process

Fig2: HLAM - the Hierarchical Application Model

Fig. 3:The Machine Model and Data Movement Primitives

1: Overview

Our goal is to develop methodologies that will make it possible to provide approximate predictions of the performance that could be achieved by sophisticated new applications on new high performance architectures. While many of the techniques we develop will apply to any type of application, we will focus on two broad classes of applications Loosely synchronous adaptive applications include adaptive structured or unstructured multigrid codes, particle methods and multipole codes. Data exploration and data fusion applications are codes that carry out processing, analysis and exploration of one or several very large datasets. This class includes codes that carry out analysis exploration and fusion of sensor data from sensors located on different platforms (e.g. satellites, aircraft and ships), and codes that carry out sensor data analysis and fusion of data from conventional high power microscopy, confocal microscopy, electron microscopy, computerized tomography, and magnetic resonance imaging.

The groups involved in this proposal will leverage their extensive experience with high end applications through the process depicted in Figure 1. We list the following salient features of this process:

Construction of application emulators motivated by loosely synchronous adaptive applications and by data exploration and data fusion applications. An application-emulator is a suite of programs that, when run, exhibits computational and data access patterns that resemble the patterns observed in a particular type of application.
Development of a hierarchical high level application modeling framework (HLAM) shown in Figure 2. The HLAM will include mechanisms to allow users to provide: a) hierarchical description of application, b) high level specification of a target machine, c) procedures to define how the hierarchical application description is mapped to the machine model and d) cost model to be used in performance estimation.
Development of a simulation framework (PETASIM) that is able to generate performance predictions using information provided by the high level application modeling framework
Performance optimization of HLAM and PETASIM. PETASIM will itself be a highly irregular parallel application and it will be necessary to develop an efficient parallel implementation of PETASIM.
Validation of the HLAM/PETASIM modeling process. We will use the application emulators to produce application and machine specifications at varying levels of granularity and then use PETASIM to estimate performance obtained on selected current and future architectures. We will use detailed simulation tools developed at Maryland and at other sites to characterize the performance of the application-emulators on selected current architectures (e.g. IBM SP-2) and on a limited range of future architectures.
We will use various techniques including detailed simulation tools, instrumented static and runtime compilation and analytic models to produce PETASIM cost models.

In the next section we describe this performance prediction process in more detail while section 3 explains the application emulation approach and our approach to choosing applications which will test and motivate our project. In the final section, we describe how we will integrate the various components of our activity together.

2:The Performance Prediction Process

This project is built around the performance prediction process sketched in figure 1. The distinctive feature of our approach is the use of machine and problem abstractions which although less accurate than detailed complete representations, can be expected to be more robust and further quite appropriate for the rapid prototyping needed in the design of new machines, software and algorithms. The heart of this performance prediction process are two technologies - HLAM and PETASIM. These originate in the work of the Rutgers/UCSB and Syracuse respectively but will be generalized and fully integrated in this project. Further the HLAM/PETASIM performance prediction system will interface to the application emulators developed at Maryland using a combination of static and runtime compilation techniques from all 3 groups. We refer generically by HLAM to the four key inputs to PETASIM which describe respectively the target machine (sec. 2.1), application (sec. 2.2), script specifying execution of application on machine (sec 2.3) and finally the cost model for basic communication, I/O and computation primitives (sec. 2.4). As indicated, we discuss these in the following four sections with the major section 2.3 describing PETASIM with basic concept and implementation.

2.1 Target Machine Specification

There is a close relationship between performance modeling and the description of applications needed to allow either the user or compiler to properly map an application onto parallel systems. In general, reliable performance estimates need the same level of machine description as is needed to specify parallel programs in a way that allows good performance to be obtained when executing on the target machine. This machine description can either be explicit (as in MPI) or implicit as in an automatic parallelizing compiler which must essentially use such a machine description to define its internal optimizations of data placement and movement. Thus to be effective in estimating performance on a target machine, PETASIM must input an architectural description at the same level needed by parallel programming environments. The PetaSoft meetings identified the need for such architectural descriptions as essential in defining future extensions to parallel languages whether they be message or data parallel. MPI and HPF view parallel systems implicitly treat parallel systems as a three level memory hierarchy (local processor memory, remote memory and disk). As shown in figure 3, this is inadequate for some current and nearly all future expected high performance systems. Thus an important product of our project will be such a machine description which we will target at both today's (distributed shared memory) machines and the future designs typified by those examined in the PetaFlop process. The latter looked at "extrapolated conventional", Superconducting and Processor in Memory designs and our proposed specification is appropriate for these three cases.[FoxFurm 97] As explained above, this machine description will be very helpful in developing future parallel programming environments. We expect experience from our project to drive new developments in this field as we will determine which features of problem and machine are performance critical as we will use (reliable) models of expected complex memory hierarchies and not to wait for new hardware to be available.

Our proposed machine description in HLAM will allow specification of number of memory hierarchies, their sizing and data movement (latency and bandwidth) times. A typical example is shown in figure 3 taken from the PetaSoft discussion. These primitive machine operations will include collective as well primitive operations and cover both data movement and data replication (as in messaging and cache operation).

2.2: HLAM -- Hierarchical Application Modeling Framework

Critical to our project will be a high level application modeling framework HLAM which is based on existing Rutgers/UCSB research but will be substantially generalized in this project as well as being integrated into a powerful simulation system. We believe there are several advantages in using simplified application abstractions rather than the full application. Firstly it is expensive to simulate full applications - especially large problems on future high performance (PetaFlop) systems. Secondly use of well chosen abstraction can lead to better understanding of the key features that determine performance. Finally abstractions can be generic and easily represent a broader class of applications than any one full application. This is not to say detailed simulations of full applications are not of inestimable value - rather we argue that our high-level abstractions have intrinsic complementary value. HLAM is illustrated in Figure 2 which shows that we first hierarchically divide an application (sometimes called in this context as a metaproblem) into modules. A sophisticated parallel application may be composed of several coupled parallel programs. Parallel programs can themselves be viewed as being composed of a collection of parallel modules. These modules may be explicitly defined by a user, or the modules may be generated by a semi-automatic process such as an HPF compiler. Modules that represent distinct programs may execute on separate nodes of a networked metacomputer. An individual module may be sequential or data parallel. We might use a data parallel module to represent a multi-threaded task that runs on a SMP node. HLAM will include a wide range of applications including data-intensive applications (including I/O from remote sites) and migratory Web programs. Some key features of our proposed high level abstraction HLAM are:

Hierarchical graph representation.
Support the symbolic specification of problems so that the high level abstraction can be used to test arbitrarily large problem instances.
Support the use of aggregates as building blocks for specifying application modules in a hierarchical/multi-level manner which includes both task and data parallelism.
Aggregates are chosen as the largest possible unit of data parallelism which can specify the problem at the level needed to model performance with required precision and grain size.
Model various types of data interaction (loose synchronization and other dependencies such as pipelining) between program components. Explicit support of loosely synchronous structure present in essentially all large scale SPMD data parallel systems.
Support the modeling of dynamic relationship for runtime adaptive prediction/optimization.

We will provide a convenient Java user interface which will allow the user to specify all four inputs to HLAM (machine, problem, execution, cost). Further as explained later, various techniques such as static and runtime compilation and detailed simulation will "automate" parts of this specification process.

One initial focus will be to develop a hierarchical behavior representation for MPI-based SPMD parallel code. Previous research on representing the parallelism from sequential code will benefit our project. For example, the control and data dependence can be modeled through program dependence graphs.{PDG} Hierarchical task graphs{Polychronopoulos} were developed for modeling functional parallelism and loop parallelism. Such a graphical structure has also been studied in the SISAL project for functional parallelism{SISAL}. For modeling MPI-based parallel programs, we will abstract not only hierarchical control structures, but also important multiprocessing events such as message sending, receiving, reduction, global communication and barriers. Thus the graphical structure of a MPI program will consist of basic computation components, communication and I/O primitives, and multi-level control over these components. A basic computation is a code segment without involving I/O and multiprocessor communication. Basic computation blocks are modeled in a coarse-grain level if possible so that the performance impact of multi-level memory hierarchy can be studied in a block level. Computation primitives from software building blocks such BLAS and LAPACK math subroutines can be used to abstract basic computation.

2.3: PETASIM -- A Performance Simulator for parallel hierarchical memory computers

2.3.1: Introduction and Motivation

Central to this proposal is a performance simulator PETASIM which is aimed at supporting the (conceptual and detailed) design phases of parallel algorithms, systems software and hardware architecture. Originally this was designed as a result of the two week long workshops - PAWS and PetaSoft - aimed at understanding hardware and software architectures ten years from now when Petaflop (10¹⁵) scale computing can be expected. It was clear from these meetings that the community needed better aids for performance estimation as the 8 groups (and 8 different machine designs) present found it difficult to compare designs and in particular differed by a factor of million (PAWS report) in estimating performance of even a set of extremely simple algorithms - the PetaKernels -- on their new designs. The most sophisticated PetaKernel was a regular finite difference problem solved by simple Jacobi iteration which is of course well understood. These workshops emphasized the need for tools to allow one to describe complex memory hierarchies (present in all future and most current machine designs) and the mapping of problems onto them in a way that allows reliable (if high level) performance estimates in the initial design and rapid prototyping stages.

PETASIM is aimed at a middle ground - half way between detailed instruction level machine simulation and simple "back of the envelope" performance estimates. It takes care of the complexity - memory hierarchy, latencies, adaptivity and multiple program components which make even high level performance estimates hard. It uses a crucial simplification - dealing with data in the natural blocks (called aggregates in HLAM) suggested by memory systems - which both speeds up the performance simulation and in many cases will lead to greater insight as to the essential issues governing performance. We motivate and illustrate the design of PETASIM by the well known formulae for parallel performance of simple regular applications on nodes without significant memory hierarchy. Then (chapter 3 of PCW) one finds,

Speed Up = Number of Nodes/(1 + Overhead)

where the Overhead is proportional to (Grain Size)^-g (t_comm/ t_calc)

where in this case natural data block size is the Grain Size or number of data points in each node. The power g measures edge over area effects and is 1/d for a system of geometric dimension d. (t_comm/ t_calc) represents a ratio of communication to compute performance of the hardware. Such a formula shows the importance of identifying natural data block and how such high level analysis allows one to understand the relation of performance to memory sizing, I/O and compute capabilities of the hardware. PETASIM generalizes this "back of the envelope" estimate to more complex problems and machines. It also includes not just primitive messaging performance (node to node as used in above estimate) but also collective (such as multicast) mechanisms which are present in most applications but ignored in many simple analyses. Note that the simple performance estimate above is valid (with straightforward generalizations) on machines with the simple two level distributed memory hierarchy - namely memory is either on or off processor - which is essentially model built into the current generation of parallel programming systems as typified by HPF or MPI. As we described in section 2.1, it is essential to generalize this machine model whether we want to provide input to either parallel programming tools or to performance estimation systems. Thus we believe our experience with HLAM and PETASIM will be very valuable in helping designing the new generation of parallel programming environments needed for the increasing complex high performance systems coming online.

Fortunately we have good evidence that we can generalize this naïve analysis to more complex problems and machines for indeed the Rutgers/UCSB group has studied granularity issues {cluster,dsc} to identify the natural data block sizes and computation clusters based on computatation/communication ratios in more general hierarchical memories. They have developed preliminary performance prediction and optimization tools (PYRROS {PYRROS}, D-PYRROS {jjiao},RAPID {RAPID}) based on task graphs in which the impact of single-processor memory hierarchy is addressed in the intra-task level and the impact of inter-processor communication delay is identified in the inter-task level. These techniques have been shown effective for a number of adaptive and static applications {Iter-2,ship,Fernandez,SparseLU} and it will be naturally integrated into PETASIM as this Rutgers/UCSB technology is the basis of HLAM described in section 2.2.

As well as a machine description, PETASIM requires a problem description where we will use the HLAM described in sec 2.2 above. Note also that application emulators also use the same "aggregate" description of applications needed by PETASIM and these will be used as input. It is not clear yet if we will convert these directly to the HLAM or allow a separate API for this style of program description.

2.3.2: Operation of PETASIM

As described above PETASIM defines a general framework in which the user specifies the computer and problem architectures and the primitive costs of I/O, communication and computation. The computer and problem can in principle be expressed at any level of granularity - typically the problem is divided into aggregates which fit into the lowest interesting level of the memory hierarchy which is exposed in the user specified computer model. Note the user is responsible for deciding on the "lowest interesting level" and the same problem/machine mapping can be studied at different levels depending on what part of the memory is exposed for user(PETASIM) control and which (lower) parts are assumed under automatic (cache) machine control. The computer and problem can both be hierarchical and PETASIM will support both numeric and data intensive applications. Further both distributed and shared memory architectures and their mixture can be modeled.

We assume the HLAM model shown in figure 2 where we view an application as a metaproblem which is first divided into modules which are task parallel. Each module is either sequential or data parallel and further divided into aggregates defined by the memory structure of the target machine. The multi-level module structure in HLAM will be treated conventionally as an event driven simulation in PETASIM. However the large scale data parallel parts (collections of aggregates) will be treated differently by a novel mechanism that exploits the loosely synchronous SPMD structure of this part of the problem.

Synchronous, loosely synchronous ( and the typically easier embarrassingly parallel ) data parallel modules will be supported. As shown in figure 2, these are divided into phases - in each phase all aggregates are computed - with synchronization (typically also communication or redistribution ) at the end of each phase. Modules must have a fixed distribution into aggregates during each phase but can be redistributed at phase boundaries. Aggregates are the smallest units into which we divide problem and would typically be a block of grid points or particles depending on the problem. The basic idea in PETASIM is to use a sophisticated spreadsheet. Each cell of the spreadsheet represents a memory unit of the computer and they are generated from the hierarchical machine description described in sec 2.1. For each phase of the module simulation, the user provides a strategy -- called the execution script in figure 1 -- which moves the module aggregates through the cells. The simulator accumulates the corresponding communication and computer costs. Some special cells (e.g. those corresponding to the lowest memory hierarchy exposed) have CPU's attached to them and aggregates are computed when they land on a CPU cell. The phase terminates when all aggregates have been computed. Note that a "real computer" would determine this cell stepping strategy from the software supplied by the user and the action of the hardware. The simulator will supply a helpful framework but the user is responsible for understanding both the problem and the machine well enough to supply this cell stepping execution script. One could expect to gradually automate parts of this in future activities. Note that in general, there are many more memory cells than aggregate objects. Usually the number of module aggregates is greater than or equal to the number of cells with CPU's. Note the special case of a distributed memory machine with no memory hierarchy at each node. Then the number of cells, CPU's and module aggregates are all equal and the cell stepping strategy involves one step in which all objects are computed. In this case, PETASIM will reproduce results like the simple "back of the envelope" results quoted above.

We will provide both a C and Java version of the simulator whereas the user interface will be developed as a Java Applet. The visualization of the results will use a set of Java Applets based on extensions of NPAC's current Java interface to Illinois's Pablo performance monitoring system.[Dincer 97]

Note that you can use any level of fidelity that you like but the user is responsible for estimating communication costs and the compute cost when an aggregate arrives at a CPU cell. These would involve an estimate of the effect of lower memory levels in the hierarchy which are not modeled directly. As described above and illustrated in figures 2 and 3, modules and computers are both specified as a set of units labeled by a hierarchy level and a position (labeled by a one or multidimensional index) within a given level. The user must specify the linkage between these units with an appropriate associated function (or simply a weight) which calculates the communication performance between units in the computer model and needed message traffic for the problem model. Section 2.4 gives more details on the estimation of performance cost functions of the different components needed by PETASIM. The initial system will support "flat" problems but general hierarchies for problems and computers is essential and will be implemented in the second year of the project.

2.3.3: Parallel Execution of PETASIM

Typically PETASIM can be executed on a sequential machine as it is aimed at relatively crude estimation. However if for instance an accurate modeling of cache, requires a small aggregate size (and correspondingly large number of aggregates), the performance estimate may become very expensive computationally. Very adaptive problems could lead to a similar situation. In this case, we may need to use parallel simulation and so the operation of PETASIM needs to be parallelized. PETASIM involves a combination of both data parallel and event driven simulations. The latter, we know is quite hard {Maisie, GIT} and we expect that it will be sufficient to parallelize just the simulation of data parallel modules. Here we can draw on the extensive experience of all the collaborating institutions including that in dynamic irregular parallel computation techniques which will be needed when estimating the performance of applications with these characteristics.

2.4:Estimation of Performance Cost Functions

2.4.1:General Approach

The cost abstraction for performance prediction will be conducted for each primitive operations . For computation components, an average cost function is estimated using the parameters of processor and cache/memory parameters. For example BLAS and LAPACK primitive performance cost functions will be studied in details and previous work by Rutgers/UCSB has done some preliminary studies on this topic. For standard communication and I/O primitives, cost functions are determined based on device/networking performance parameters and the size of data communicated.

We will further model the data communication structure between the modules (of figure 2) seen in multidisciplinary and distributed applications. This coarse grain structure is typically not too sensitive to architectural details and we do not expect major difficulties here.

2.4.2: Analytic and Compiler Generation

The cost modeling for individual components requires either user insight or simple measurements described above or symbolic inference through use of compilation of appropriate application emulators. For example, average program cost estimations are studied in{Sarkar}. The computation and communication cost estimation for a loop program is studied in{Consnard} based on symbolic integer point counting techniques. None of them have addressed the impact of multi-level caching and further research will be needed. For some parts of code, it may be difficult for a compiler or a user to provide accurate prediction. As described in the following section, runtime profiling and full simulation for some critical parts of code will lead to a relatively accurate performance abstraction for these parts to improve the accuracy of the overall performance prediction. Sometime the worst case performance estimate from a static compiler analysis, can be overly pessimistic in predicting performance in practice. On the other hand, average performance can be very difficult to obtain using symbolic analysis. Runtime profiling based on a set of carefully selected instances can be used to study the average-case performance of components where the symbolic analysis is unduly pessimistic.

As shown in figure 1, we will also use runtime compilation to generate input to PETASIM. This will help in both defining appropriate aggregates and in estimating the associated cost functions. Here we use results from the Arpa sponsored Parallel Compiler Runtime Consortium in which both Maryland and Syracuse participate. This has generated a set of libraries implementing essential data movement and computation primitives used by parallel C++, HPF and also preliminary versions of parallel Java compilers. These libraries include those for both regular (HPF1) and adaptive irregular problems (HPF2 extensions). We intend to use the PCRC libraries in the application emulation route of figure 1 to both motivate the data movement primitives in HLAM and provide cost functions by runtime profiling of their execution.

Alternatively one can use a deductive analysis process which involves test case generation and performance summary using statistical methods. The key research issues are the generation of proper test instances based on the symbolic structure of an application, and the cost function abstraction based on the statistic analysis. The deductive analysis techniques can be applied to individual components at runtime and can also be applied to the entire application for selecting proper test instances and providing a meaningful performance summary.

2.4.3: Use of Hardware Modeling and Simulation

For a large-scale tera/petaflop-level application, we do not intend to conduct complete simulations for entire applications on current or future parallel machines. Instead, we plan to use detailed simulation and/or runtime profiling only for the performance-critical segments of the application. We plan to use the results of detailed simulation for two purposes: (1) to develop the cost functions used by PETASIM and (2) to verify the predictions generated by PETASIM.

This is similar to the approach we have used in Howsim {Howsim}, our coarse-grain simulator for I/O-intensive tasks on workstation clusters. We have developed Howsim for evaluation of architectural and OS policy alternatives for I/O-intensive tasks. Accordingly, Howsim simulates I/O devices (storage and network) and the corresponding OS software at a fairly low level and the processor at a fairly high level. To obtain the hardware and operating-system cost functions needed for Howsim, we profiled a small set of micro-applications which exercised specific hardware and OS functionality on the IBM SP-2 and a cluster of DEC Sable workstations. This approach has worked well for Howsim. For example, for the SP-2, Howsim was successfully able to model the application-level network bandwidth across a seven orders of magnitude difference in message size. The error for most message sizes was 2-6%. For the Sable cluster, Howsim was able to model the application-level network bandwidth within an error of 10% for message sizes up to 1 MB.

We do not intend to develop or significantly extend detailed hardware simulators. Instead, we plan to use (and possibly integrate) existing hardware and complete system simulators such as SimOS {SimOS}, Proteus {Proteus}, Mint {Mint} Howsim and Trojan {Trojan}. We expect to start with SimOS and integrate other simulators as needed. SimOS provides detailed models for shared-memory multi-processors and can simulate highly realistic application workloads with acceptable slow-down. It provides a fairly good interface for adjusting the hardware configuration such as the number of processors, clock speed, cache parameters, memory and disk system parameters. SimOS is geared towards shared-memory machines and does not emphasize network simulation. Other simulators, however, do provide good network models, e.g. Proteus provides detailed simulation of k-ary n-cube networks and Howsim simulates point-to-point networks. We plan to integrate these simulators (and possibly others) on an as-needed basis.

3. Emulating Families of High End Applications

An application-emulator is a suite of programs that, when run, exhibits computational and data access patterns that resemble the patterns observed in a particular type of application. We will construct two application emulators motivated by loosely synchronous adaptive applications and by data exploration and data fusion applications. As described above, application emulators will be used to validate the HLAM/PETASIM modeling process. We will use the application emulators to produce application and machine specifications at varying levels of granularity and then use PETASIM to estimate performance obtained on selected current and future architectures. We believe that our application emulators address some of the key applications targeted at future very-high end architectures. The application emulators will be shared with the performance modeling community. We believe that general availability of application emulators will help the community focus attention on crucial application classes.

3.1 Irregular Adaptive Scientific Applications

We will develop an application emulator to model the performance characteristics of three classes of irregular adaptive scientific computations, along with coupled versions of multiple instances of any of these classes. The targeted computation classes are

adaptive unstructured codes (e.g. unstructured multigrid solvers, integro-differential equation solvers and molecular dynamics codes),
structured adaptive codes, (e.g. adaptive multigrid algorithms), and
particle codes (e.g. Direct Simulation Monte Carlo, Rokhlin-Greengard or Barnes-Hut Fast Multipole codes, particle-in-cell codes).

The Maryland, Syracuse and Rutgers groups have extensive experience with each of these application classes. The three sites have long histories of developing libraries, compilers and runtime support for high performance parallel implementations for application codes from these areas~\cite{CHOUDHARY92,DAS94,FOX94,GERASOULIS95A,HWANG95B,MOON95,NANCE95}. Maryland has also taken the lead in developing systems software (Maryland's Meta-Chaos library ~\cite{EDJLALI97}) to couple separately developed applications to carry out complex physical simulations such as a combustion code coupled to a CFD code, to be able to model complex chemical reactions at fine scales using one application code and fluid flows at perhaps much coarser scales using another application code. The Syracuse group is playing a major role in both the physics and computer science aspects of an NSF Grand Challenge - the simulation of the collision of two black holes. This is built around adaptive mesh finite difference solutions of Einstein's equations. Our application emulators will leverage this extensive experience in a wide range of regular and irregular, static and dynamic applications.

The use of Maryland's Meta-Chaos software will greatly facilitate the construction of the application-emulator. Our application emulator will make it possible to model the performance of multiple coupled parallel applications running on distributed sets of multiprocessors and/or networked workstations.

Another example of complex physical simulation is ship hull design. New advanced physics-based methods for the simulation of ship responses in a seaway and for the simulation of the viscous flow field around the hull have made it possible to explore completely new hull forms. DARPA has been supporting the development of a new Arsenal Ship design with an entirely new hull with superior resistance and seakeeping characteristics. The LAMP codes (Large Amplitude Motions Program) are a class of multi-level physics codes for the simulation of wave body interactions of ship movements. The codes have been developed by the ship design division of SAIC. The LAMP codes are based on 3D integro-differential equation approximations and are iterative in time steps. The computations for the kernel approximations are dynamic and irregular.

The LAMP codes provide an excellent testbed application for future parallel systems and software. They have large memory requirements because the history of the ship movement needs to be stored to predict future movements and forces on the ship. Furthermore, the more physically accurate high level codes (LAMP-4 and above) require teraflop and petaflop performance to be used in an integrated design environment (LAMP-4 takes 1 CPU month on a CRAY C90-1 processor using 500 ship panels and 30 minutes of ship movement simulation. A typical 100 candidate design simulation requires about $10^6$ Teraflops). Under the HPCD project Rutgers is working with SAIC on a scalable version written in MPI and on more robust LAMP codes. The LAMP codes are representative of a larger set of physical codes for multi-level ship design methods. For these classes of applications, the application emulators must model the varying quantities and patterns of computation from the application codes, including the data dependent nature of the computations. This means that the emulators must parameterize the behavior of the data access patterns of the applications, to model both the interprocessor communication within one application and the interprocessor communication generated through the coupling of multiple applications to form a complete distributed physical simulation.

3.2 I/O intensive applications

We will develop an application emulator that will reproduce application characteristics found in many defense and high-end civilian applications that involve sensor data analysis, sensor data fusion and real time sensor data processing. We are focusing on emulating application scenarios that will be of practical relevance in a 5 to 15 year time frame.

One of the motivating applications (the Virtual Microscope) under development at Johns Hopkins involves development of software that makes it possible to carry out a realistic digital emulation of high power light microscope. Raw data is captured by scanning collections of full microscope slides under high power. For each microscope slide, high power images are captured at multiple focal planes. Once the image is captured, this application will make it possible to emulate the behavior of a microscope by allowing users to continuously move the stage and to simulate changing magnification and focus. The Virtual Microscope is being designed so that it will be possible to achieve an interactive level of response as the same dataset is simultaneously explored by multiple users. In the ongoing Johns Hopkins application effort, the Virtual Microscope will be coupled to computation modules that carry out:

three dimensional image reconstruction from data found in multiple focal planes and on multiple microscope slides
.image registration and compositing that take into account data obtained using various special stains used to reveal the presence or absence of biochemical markers,
image segmentation and pattern recognition to better characterize known malignancies, and
applications that aid the pathologist in screening for possible malignancy.

In the proposed effort, we will develop an applications-emulator that prototypes the performance characteristics associated with the current application suite along with the performance characteristics that will be associated with more ambitious future projects. The future projects will involve integration of additional sensor modalities along with associative retrieval of sensor data obtained from related cases. The sensor modalities that would be involved in a sensor data fusion effort include

different radiological imaging modalities such as CT, MRI and PET,
electron microscopy,
confocal microscopy, and
conventional high power light microscopy.

The application emulator will be coded as a stripped-down application suite that runs on distributed collections of multiprocessors and networked workstations.

In the development of our application emulator, we will make the realistic assumption that sensor data is stored in multiple data repositories. Each data repository will consist of a multiprocessor architecture along with secondary and tertiary storage. The application emulator will carry out varying quantities and patterns of computation (computational patterns will be abstracted from our application codes). The processing carried out at a particular repository will result in a variable degree of data size reduction. Sensor data fusion will be emulated by prototyping processing algorithms that take as inputs processed data sets produced by several repositories.

The application emulator will emulate this ambitious sensor data fusion application suite in a parameterized fashion. Parameter adjustment will make it possible to use the emulator to emulate various application scenarios. The behavior we emulate will include computation, secondary storage accesses, tertiary storage accesses, remote object invocations, and program migration between processing nodes.

4. Project Integration

In this section, we bring our activities together as an integrated approach outlined in Figure 1. We must obviously make choices from the many different applications and target machines in order to provide a project focus. We will work with DARPA and the HPCC community in choosing exemplars that are synergistic with other activities. For the near future, we have as a natural baseline choice the machines chosen by ASCI and for a ten year horizon the target architectures from the PetaFlop process which were reported at the Frontiers 96 conference whose proceedings will be available soon. We have already described the broad classes of applications we expect to study in section 3 but here again we can refine this choice. After these initial global activities, we will ramp up the three major tasks in Figure 1 - development of Application Emulators, HLAM and PETASIM. These will naturally lead to the detailed simulation and runtime compilation activities.

We expect that it will be straight forward to collaborate as a team for we have great experience in doing this with other projects. As well as the standard (and increasingly sophisticated) electronic mechanisms, we expect that the team will need to hold meetings on a semi-annual basis. These will be especially important at the start when more frequent get-togethers may be necessary -- some of these will be at DARPA PI meetings.

The success of the project will be measured by the reliability of its performance estimates and the general utility of its technology. We will measure and enhance this in years 2 and 3 by pro-active outreach. This will involve looking at new applications from the community which were not in our original suite. Secondly we will link PETASIM and the HLAM interface to the Web so that remote users can access an test our technology. All our user interfaces will be implemented in Java. We also expect that our work will have great value in education as our application, machine and execution abstractions are designed to capture the essence of a problem. Thus we intend to use our technology in parallel computing courses which we teach on an ongoing basis at the participating universities. We also expect that the NSF DoD and DoE training HPCC training activities will be able to follow this lead and use our work to show beginning users "Why Parallel Computing applications get the performance they do".

References:

[PAWS report] "The Petaflops Systems Workshops", Proceedings of the 1996 Petaflops Architecture Workshop (PAWS), April 21-25,1996 and Proceedings of the 1996 Petaflops System Software Summer Study (PetaSoft), June 17-21, 1996, edited by Michael J. MacDonald (Performance Issues are described in Chapter 7).

[PCW] "Parallel Computing Works!", Geoffrey C. Fox, Roy D. Williams, and Paul C. Messina, Morgan Kaufmann, 1994.

[Dincer 97] "Using Java in the Virtual Programming Laboratory: A web-Based Parallel Programming Environment", Kivanc Dincer and Geoffrey Fox, to be published in special issue of Concurrency: Practice and Experience on Java for Science and Engineering Computation.

[FoxFurm 97] "Computing on the Web -- New Approaches to Parallel Processing -- Petaop and Exaop Performance in the Year 2007", Geoffrey Fox and Wojtek Furmanski, submitted to IEEE Internet Computing, http://www.npac.syr.edu/users/gcf/petastuff/petaweb/

[SimOS] M. Rosenblum, S. Herrod, E. Witchel, A. Gupta, "Complete Computer System Simulation: The SimOS Approach", IEEE Parallel and Distributed Technology: Sytems and Applications, 3(4), Winter 1995, pp 34-43.

[Proteus] E. Brewer, A. Colbrook, C. Dellarocas, W. Weihl, "Proteus: A High-Performance Parallel Architecture Simulator", Performance Evaluation Review, 20(1) Jun 1992, pp 247-8.

[Trojan] D. Park, R. Saavedra, "Trojan: A High-Performance Simulator for Shared Memory Architectures", Proceedings of the 29th Annual Simulation Symposium, April 1996, pp 44-53.

[Mint] J. Veenstra, R. Fowler, "MINT: A Front-end for Efficient Simulation of Shared-Memory Multiprocessors", Proceedings of International Workshop on Modelling, Analysis and Simulation of Computer and Telecommunication Systems. Feb 1994, pp 201-7.

[Howsim] M. Uysal, A. Acharya, R. Bennett, J. Saltz, "A Customizable Simulator for Workstation Networks", To appear in the International Parallel Processing Symposium, April 1997.