From haupt@npac.syr.edu Wed Jul 9 16:04 EDT 1997 Received: from aga.npac.syr.edu (aga.npac.syr.edu [128.230.117.22]) by postoffice.npac.syr.edu (8.7.5/8.7.1) with SMTP id QAA02420; Wed, 9 Jul 1997 16:04:33 -0400 (EDT) Sender: haupt Message-ID: <33C3EF2F.167E@npac.syr.edu> Date: Wed, 09 Jul 1997 16:06:07 -0400 From: Tomasz Haupt X-Mailer: Mozilla 3.01SC-SGI (X11; I; IRIX 6.3 IP32) MIME-Version: 1.0 To: j_nieplocha@pnl.gov CC: dbc, bernhold Subject: ACTS proposal Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=us-ascii Content-Length: 33065 Status: RO Hi, this is the latest version of the proposal. I am still working on references. I think Bryan and David can add do it. In particular, I would like Bryan to put a little more "meat" into the very last section that deals with the HPF runtime. I think what I wrote is a little vague. Tom ---------------------------------------------------------------------- Extending Scientific Template Library With Shared-Memory NUMA Toolkit 1. Introduction 2. Overview of GA ( What is the purpose of this tool? What aspects of the code design developmentg porting process will it impact the most? How much training experience is needed to use it effectively? Who has successfully used it already (specific pointers)? 2.1 Global Array NUMA Programming Model All scalable parallel computers feature a memory hierarchy, in which some locations are "closer" to a particular processor than others. The hardware in a particular system may support a shared memory or message passing programming model, but these factors effect only the relative costs of local and remote accesses, not the system's fundamental Non Uniform Memory Access (NUMA) characteristics. Yet while the efficient management of memory hierarchies is fundamental to high performance in scientific computing, existing parallel languages and tools provide only limited support for this management task [Velde,NATO] and focussed primarily on parallelism, and more specifically on the issues of control flow, communication structures, and load balancing. Memory-related issues such as data locality and variable cost of accessing data have been largely ignored. Recognizing this deficiency, we developed abstractions and programming tools, that can facilitate the explicit management of memory hierarchies by the programmer, and hence the efficient programming of scalable parallel computers. The abstractions comprise local arrays, global (distributed) arrays, and disk resident arrays located on secondary storage. The tools comprise the Global Arrays library [1, 2, 3], which supports the transfer of data between local and global arrays, and the Disk Resident Arrays library [DRA], for transferring data between global and disk resident arrays. The two predominant programming models for MIMD concurrent computing are message passing and shared memory. Message passing assumes a distributed-memory model in which distinct processes each have their own "local" data, and share data only through cooperative communication. A process can access its own local data directly, but in order to access remote data requires the cooperation of the process that owns the data. The remote process must send the required data in an explicit message, and hence must know which piece of data is needed by which process and when. This requirement makes the message passing model hard to use for irregular problems and applications that use dynamic load balancing. This is because the coordination of large number of processes that operate on uneven chunks of data or that require access to remote data at irregular time intervals increases algorithmic complexity and magnifies associated programming effort. The recently proposed MPI-2 one-sided communication model [MPI2] addresses this issue to some degree but its progress rules can be too restrictive for some applications. Despite programming difficulties, the message-passing paradigm's memory model maps well to the distributed- memory architectures used in scalable MPP systems. Because the programmer must explicitly control data distribution and is required to address data locality issues, message-passing applications tend to execute efficiently on such systems. In the shared-memory programming model, data is located either in "private" memory (accessible only a specific process) or in "global" memory (accessible to all processes). In some shared-memory systems, global memory is accessed in the same manner as local memory. Systems based on this approach may rely on hardware or operating support to recognize load and store operations that reference non-local memory (e.g., SGI Origin-2000, Convex/HP-SPP) or use purely software- based approaches, as in the various distributed shared memory libraries, for example Treadmarks [Treadmarks] or Midway [midway]. In other shared-memory systems, global memory is accessed by using distinguished mechanisms, such as language constructs [Linda, SplitC, CCpp], special user-defined operations [Orca], or library functions [GA, DOLIB]. A disadvantage of many shared-memory models is that they do not expose the NUMA memory hierarchy of the underlying distributed-memory hardware. Instead, they present a flat view of memory, making it hard for programmers to understand how data access patterns effect performance or to exploit data locality. Hence, while programming effort involved in application development tends to be much lower than in message-passing approach, achieved performance is usually less competitive. The Global Arrays NUMA model combine the better features of message-passing and shared-memory such as: distributed memory view of message passing model, one-sided access to remote data in the spirit of shared memory paradigm, explicit control over data distribution, data locality, and distribution and mapping information, recognition of memory hierarchy and performance differences in access to distinct layers in memory hierarchy, and includes as a subset message passing (to support algorithms that require synchronization on data transfer). This combination leads to both simple coding and efficient execution for a class of applications that operate on large distributed dense arrays, require dynamic load balancing or exhibit attendant unpredictable data reference patterns. The primary mechanisms provided by GA for accessing data are copy operations that transfer data between layers of the memory hierarchy, namely global memory (distributed array) and local memory. In addition, each process can access directly data held in global array sections that are assigned to that process. Locks and other atomic operations are provided. They can be used to implement synchronization and to assure correctness of accumulate operations (floating-point sum reduction that combines local and remote data) executed concurrently by multiple processes and targeting overlapping array sections. The GA was developed in light of emerging standards: the API follows Fortran 90 array notation, the library has been designed based on object oriented principles, GA is compatible with MPI, and interfaces to third-party parallel linear algebra packages exist. The programmer is free to use both the shared-memory and message-passing paradigms in the same program and to take advantage of existing message-passing software libraries. Both C and Fortran interfaces are available. The GA model has been extended and employed succesfuly in the metacomputing environments (IWAY), and high-performance secondary storage systems. 2.2 Applications and Users Development of Global Arrays has been and continues to be motivated by applications. The applications of GA include graphics rendering, financial calculations (security value forecasting), molecular dynamics [Tjerk] and most of all computational chemistry algorithms which calculate the electronic structure of molecules and other small or crystalline chemical systems. These calculations are used to predict many chemical properties that are not directly accessible by experiments, and play a dominant role in the number of supercomputer cycles currently used for computational chemistry. Although the GA programming model extends the message-passing programming model, no point-to-point message passing is used in any of many chemistry application codes (message-passing is used in parallel linear algebra libraries that these applications access through GA interfaces) which illustrates the good match between the GA model and computational chemistry. Some examples of applications that attain high parallel scalability and demonstrate the ease of use of GAs are included in the following references [4,5,6,7,8]. GA were developed as part U.S. Federal High Performance Computing and Communications Initiative (HPCCI) Grand Challenge Applications project in computational chemistry, and turned out to be useful for research and production code development at PNNL's Environmental Molecular Science Laboratory (EMSL) and elsewhere in the U.S. (including Argonne National Laboratory, NCSA, Pitsburgh Supercomputing Center, Cornell Theory Center, San Diego Supercomputing Center, University of Illinois Urbana Champaign), and other countries including United Kingdom (Daresbury Laboratory), Austria (University of Vienna), Germany (Stuttgart University), Australia (Australian National University), and Italy (University of Perugia). In last three years many applications have adopted GAs, totalling nearly one-and-half million lines of code, including some 400,000 lines of new code. (we need specific user names and applications that use GA) 2.3. Supported Platforms The GA are available on 1) distributed-memory, message-passing parallel computers with interrupt-driven communications or Active Messages (IBM SP and Intel Paragon); 2) networked clusters of single- and multi-processor UNIX (Sun, IBM, SGI, DEC, HP) and Windows NT [Chien] workstations; 3) shared-memory computers (Convex SPP, SGI Origin-2000 and PowerChallenge); and 4) globally-addressable distributed-memory computers (Cray T3D/E). (JN: Do we want to mention our collaboration with IBM on GA/LAPI optimizations? It is applicable to ASCI Blue) 2.4 Disk Resident Arrays: Extension of Global Arrays to Secondary Storage Disk Resident Arrays (DRA) extend the GA model to another level in the storage hierarchy, namely, secondary storage. DRA introduces the concept of a disk resident array--a disk-based representation of an array--and provides functions for transferring blocks of data between global arrays and disk resident arrays. Hence, it allows programmers to access data located on disk via a simple interface expressed in terms of arrays rather than files. The extends the benefits of global arrays (in particular, the absence of complex index calculations and high-level array interface) to programs that operate on arrays that are too large to fit into memory. By providing distinct interfaces for accessing objects located in main memory and on the disk, GA and DRA render visible the different levels of the memory hierarchy in which objects are stored. Hence, programs can take advantage of the performance characteristics associated with access to these levels. Recall that memory hierarchies consist of multiple levels, but are managed between two adjacent levels at a time [Hennessy]. For example, a page fault causes the transfer of a data block (page) to main memory while a cache miss transfers a cache line. Similarly, GA and DRA allow data transfer between adjacent levels of memory. DRA read and write operations can be applied both to entire arrays and to sections of arrays (disk and/or global arrays); in either case, they are collective and asynchronous. This collective I/O strategy has been adopted in many other projects for the efficiency reasons [corbett:mpi-overview, PASSION, Panda] as well. DRA asynchronous interface to its I/O operation permits applications to overlap time-consuming I/O with computation. The DRA has been succesfully used in out-of-core matrix multiplication and two chemistry applications. 2.5 Mirrored Arrays: metacomputing extensions of GA In recent years there has been increasing interest in metacomputing. Metacomputing environments comprise multiple supercomputers and other devices (mass storage systems, display devices) connected via local area networks (LANs) or wide area networks (WANs). They can potentially provide significant increases in the computational power accessible to a single application. From the memory hierarchy perspective, the metacomputing environment introduces one or more additional layers, with typically much higher latencies and lower bandwidth than in a supercomputer. Mirrored Arrays [GA-IWAY,TJS] have been developed to address the performance characteristics of metacomputer in a manner similar to replicated shared memory [RSM]. Arrays are fully distributed within each machine, so the amount of data held by each processor increases by a factor roughly proportional to the number of networked supercomputers. Each supercomputer operates on its own mirrored array independently, and a special operation is provided for enforcing consistency of the different mirrored arrays. This primitive is a collective operation across all supercomputers and merges entire or user-specified sections of the mirrored arrays. Upon completion of this operation, all machines have identical copies of the specified array or array section. The result is that most GA communication occurs locally within supercomputers. Total network traffic is lower and average message size sent across the network is larger than in the fully distributed approach. This approach was very succesful for a large application (Self-Consistent Field) that was able to effectively exploit multiple supercomputers (Intel Paragons and IBM SPs) connected with WANs. 3. Value of GA to DOE scientific community and ACTS (why should a DOE project use GA? What are the evaluation criteria that favor this tool--portability, execution speed, development time, interoperability, etc.? What would GA add to the tools in the current SCiTL) All current and proposed MPPs (ASCI) have or will have nonuniform memory architecture. In future systems, both the number of memory levels and the cost (in processor cycles) of accessing deeper levels can be expected to increase [Hennessy,Petaflops]. Essential to achieving high efficiency on such systems are NUMA-aware programming tools. The GA NUMA has been proven as an effective and easy-to-use parallel programming tool for developing real large-scale applications. These applications were able to obtain impressive speedups and scalability on massively-parallel computers thanks to support that GA and DRA provide for performance programming in NUMA environments: data locality control and information, recognition of memory hierarchy and associated variable variable cost for transfering data between memory layers, and explicit mechanisms for such transfers. GA is already optimized for SMP nodes of MPP computers and should match very well the architecture of the ASCI systems. >From the user perspective, GA does not require expert skills in parallel programming. The high-level array-oriented API is close to the matrix notation that describes mathematical formulation of physical problems. The scientists who are not parallel programming experts find GA operations on arrays much more familiar, easier to learn and use than low-level message-passing model. These attributes of GA are the key to achieving very-high productivity levels in the codes developed with GA such as NWChem [12]. GA is already interoperable with standard programming tools and libraries that DOE scientific community uses such as MPI and ScaLAPACK. Extensiong interoperability of GA with other tools and facilities is a subject of this proposal. The GA differs from other tools in SciTL. Unlike P++ that targets data-parallel execution model, GA supports both task- parallelism and data-parallelism. Unlike Tulip [Tulip] that uses polling for remote memory access on message-passing systems (such as IBM SP-2), GA implementation uses platform-specific low-level communication mechanisms. This is because they do not require modifications (inserting polling calls) to the _all_ application code which would be not practical for the applications that use large standard libraries [Bal]. (Many SCiTL toolkits target C++ applications and provide minimal support for Fortran that scientists use the most. GA is oriented towards Fortran and offers C interface). In summary, GA would complement and extend capabilities of tools in the current SCiTL by providing a NUMA-aware portable, efficient and easy to use shared memory programming model fully compatible with MPI. 4. Application Requirements (Steve, Dave, Robert) 5. Proposed Work Although GA NUMA programming model and toolkit has been very succesful for many applications, the value of GA to DOE the scientific community can be improved. The improvements can target the following areas: integration and interoperability with other tools and libraries including those already in the SCiTL extensions of the GA functionality and programming interfaces interoperability with High Performance Fortran and language support for NUMA model 5.1 Extensions and improvements to GA 5.1.1 Higher-dimensional arrays The current implementation of GA provide support only for 1- and 2- dimensional dense arrays distributed (uniformly or nonunformly) in the block fashion. This is a significant limitation for computational fluid dynamics and other applications with 3-dimensional problem domain. These applications require support for arrays with three- to five-dimensions to represent the physical quantities such as temperature, velocity, pressure, etc in the 3-dimensional spatial domains. We propose to extend the GA to support arrays up to seven dimension (the Fortran limit). This is a natural extension of the current toolkit capabilities and will fit well in the current implementation framework. 5.1.2 Additional distribution types To improve load balancing, many parallel linear algebra algorithms adopted cyclic block distribution. The current interfaces between GA and parallel linear algebra packages, such as ScaLAPACK, that use such distributions require data reorganization. For some applications such reorganizations could be avoided if GA supported block cyclic distribution. 5.1.3 Support for sparse data structures The current GA applications handle sparse data structures by explicitly packing data into one- or two- dimensional dense arrays. Support for sparse arrays in GA would greatly simplify implementation of such algorithms. The details of the sparse storage would be encapsulated in the GA object and access to the data simplified. (JN: Do we need dynamic or static sparse storage?) 5.1.4 Split-phase operations In order to better tolerate latency in systems with multiple layers of NUMA memory hierarchy we propose to extend GA with split-phase operations. 5.2. Integration, interoperability of GA with standard tools and libraries 5.2.1 Interfaces to numerical libraries Many applications that use GA require access to parallel linear algebra algorithms. We respond to this demand by developing interfaces to the state-of-the-art parallel libraries such as ScaLAPACK, SUMMA, PLAPACK, PeIGS. The interfaces are integrated with the GA using a "black box" approach which hides from the user the complexity (and sometimes variation from one release to another) of the solver interfaces. The user specifies global array descriptor for input and output and the appropriate GA interface operation rearanges the data to the format required by a particular third-party library, makes call to the appropriate solver, and returns the output data in the output global array. An example of such an interface is ga_lu_solve operation, that takes uses three arguments: character variable indicating if transpose operation is required, array handle for coefficient matrix, and array handle for the right-hand-side vector(s) which is overwritten by the solution upon exit. The subroutine in turn rearranges the arrays into the block cyclic distribution format, calls ScaLAPACK pdgetrf (factorization) and pdgetrs (forward/backward substitution) subroutines and returns the solution vector(s) in the global array. The pdgetrf and pdgetrs subroutines contain 8 and and 13 arguments respectively as comparing to the three arguments in ga_lu_solve. The GA contains interfaces to several parallel linear algebra algorithms available in ScaLAPACK. The following interfaces are available: LU and Cholesky factorizations, linear equation solvers, matrix inverse. We propose providing additional interfaces to: eignesolver and Houselholder reduction to tridaigonal form. Additional interfaces to PLAPACK (we need it from George). PETSc is set of numerical libraries and data structures developed for solution of partial differential equations. In addition to the distributed data structures it contains iterative solvers and preconditioners, and unconstrained minimization algorithms. We propose to develop interface between GA and PETSc to make the numerical algorithms supported by PETSc to GA applications. By combininng PETSc with Disk Resident Arrays in the GA framework, out-of-core solvers can be provided. 5.2.2 Interface to P++ P++ is a C++ class libraries that provides a high-level view of the distributed arrays and provides extensive support for operations on such arrays by hiding the low-level details of the architecture the program is running on. P++ targets the data-parallel applications and internaly uses MultiBlock Parti [Parti] for irregular data transfers. It is possible to offer GA one-sided communication to the A++P++ applications by developing an interface from A++P++ program to GA library based on the access mechanisms to the raw A++ data the package provides. 5.3. Interface to HPF and language support Most scientific applications have been and continue to be developed in Fortran. It is important for the tools to support this language and follow its evolution. In particular, the High Performance Fortran (HPF), is a new industry standard for developing data parallel programs. Based on Fortran 90 syntax, it introduces a portable way to express explicitly parallelism in the code. Intuitive data mapping directives and the global name space employed in the HPF model greatly simplifies development of parallel applications. The efficiency of the resulting codes depends on the run-time support. And this is the greatest strenght of HPF. All details of the target machine architecture are hidden from the programmer: the actual parallel implementation is done by the compilation system, with the machine specific run-time support taking responsibility of accessing remote (off-processor) data. The HPF design and implementation reflects the current status of the compiler technology. In order to guarantee portability of the codes in a sense that the performance of the resulting codes is similar on machines with a similar computational power regardless their architecure, HPF does not support directly some features which are necessary for some class of applications. In particular, there is no support for features required in dynamic, task-parallel programs, such as random access to regions of distributed arrays from within a MIMD parallel subroutine call-tree, and reduction into overlapping regions of distributed arrays. The HPF Forum recognized such features as too specific to be included in the Fortran standard, or found it premature to guarantee performance at this time. By no means it limits the scope of the language. HPF provides a portable way to incorporate libraries of parallel codes that are implemented in languages different than HPF, and following parallel programming paradigms differrent than that employed by HPF. The difference between the directly supported HPF features and the extrinsic procedures is that the latter are to be developed by the user (or a third party), and it is the user's responsibility to provide cross-plaform portability. Development of parallel extrinsic procedures necessarily requires expertize in parallel computing. Therefore we propose to develop HPF extrinsic libraries that become a part of the SciTl. As the first step, we propose complementing HPF with the Global Arrays library. In particular, the GA non-collective one-sided communication on array sections would allow random access to regions of distributed arrays from within a MIMD parallel subroutine call-tree, and reduction into overlapping regions of distributed arrays. GA has been designed to be as compatible as possible with HPF: GA API follows Fortran 90 and HPF array notation and most GA collective operations can be translated directly into single-line statements in HPF. The concept of integration of stand alone libraries, such as Global Arrays, with HPF can be pushed even further. One strategy is to provide support for user's extensions to the language. In this approach, a dedicated, programmable language processor translates user's directives into HPF statements. In particular, the language processor generates calling sequences for the user's run-time support implemented as HPF extrinsic library. We propose to develop language extensions that makes, from the user point of view, Global Arrays an integral part of the HPF run-time system accessible via HPF-like statements and/or directives. To implement the language processor to handle codes written in the extended HPF we are going to use the infractructure developed by the Parallel Compiler Runtime Consortium [PCRC]. It includes HPF parser (and also C++ and Java) that transforms the source code into an intermediate representation (IR), and vice versa. The IR is implemented in an object oriented fashion, and is modelled on Sage++ [Ganon]. A class library to perform operations on the IR is available as well. This infrastructure has been used by the PCRC to implement an HPF compiler[Li]. Here we propose to reimplement the ideas of the Programmable Array Compiler [Rosing] as the class library to perform operations on IR that correspond to the language extensions. The result is then reparsed to HPF, compiled using a commercial HPF compiler, and linked against the extrinsic Global Array library. The key concept here is that the proposed language processor is programmable, that is, the functionality of the class library is described in a Language Definition File. Consequently, the same language processor can be used to integrate other run-time libraries with HPF just by modifying the language definition. Since the PCRC infrastructure is capable to process C++ and Java sources, this mechanism can be used to integrate other components of SciTl. The other strategy involves dynamical software integration using interpreted protocols. The fundamental difference between this approach and that described above is that here all components of the application can be compiled independently of each other, and dynamically linked in the fly at the runtime. The base of such a system is an HPF interpreter. A prototype of it has been developed at NPAC [AkarsuFoxHaupt]. The heart of the system is an HPF server implemented through the extrinsic interface. The HPF server accepts commands from a client implemented as an Java applet. The user uses the applet as an GUI. The user can interrupt, suspend and resume, execution of the precompiled HPF application at any point (including breakpoints and stepping one HPF statement at a time), and request a specific action from the HPF server. Currently, the repertoire of the user commands comprise - request for data (for dubbuging or visualizations), - setting values of any variable in the precompiled HPF code (steering), - interpretation of valid HPF statements entered from the keyboard or read from a file (HPF interpreter), - request for dynamical linking and executing of precompiled shared objects (dynamical software integration). The control over the execution of the precompiled code, that is, dynamical switching between compiled and interpreted mode is facilitated by instrumenting the original HPF source code. This is done automatically using the PCRC infrastructure (HPF parser/reparser and the class library to transform the Intermediate Representation). All features of this integrated HPF compiler and interpreter environment are of value for the SciTl toolkit. It provides support for interactive (and collaborative) visualizations, as well as run-time data analysis and steering; it serves as an powerfull HPF debugger (particularly when combined with vizualizations of distributed data objects, and HPF interpreter); it allows for rapid prototyping (capability of data manipulation in the interpreted mode and dynamical linking with precompiled modules). The most important feature of the system is the capability of dynamical alternating between compiled and interpreted modes. The compiled part preserves performance of the HPF code while interpreter allows the user for interactions with the code at the runtime (debugging, data analysis, steering). Here we propose to use this system for dynamical integration of an HPF application and the Global Array library. In this case we do not expect any signifiacant performance degradation due to interpreter because the role of the interpreter is reduced to executing of precompiled modules at a priori specified points. The advantage of such an approach is that modules can be developed, compiled, and tested independently of each other, and integrated (linked) together only at the runtime. In fact, we envision this system as a powerful tool for development of multidisciplinary applications. Once the fully operational system is demonstrated for HPF, we will extend it to support codes implemented in Java and C++, and in particular, dynamical integration of existing libraries written in Fortran with object oriented applications. The integration of the Global Arrays with HPF can be further generalized. So far we discussed using the GA extrinsic library as a supplement for the HPF runtime system to support specific codes that uses Global Arrays. As explained above, the Global Arrays offers attractive functionality that goes beyond the HPF standard that can be used in other applications. Therefore we propose to extract the kernel of the GA and incorporate it directly into HPF runtime library within PCRC runtime system. The extended runtime system can be used by both the programmable compiler and the interpreter. References [1] J. Nieplocha, R. Harrison, and R. Littlefield. Global Arrays: A portable `shared-memory" programming model for distributed memory computers. In Proc. Supercomputing 1994, pages 340-349. IEEE Computer Society Press, 1994. [2] J. Nieplocha, R. J. Harrison, R. J. Littlefield. The Global Array programming model for high performance scientific computing. SIAM News, 28(7):12-14, 1995. [3] J. Nieplocha, R. J. Harrison, R. J. Littlefield. The Global Array: A Non-Uniform-Memory-Access Programming Model For High-Performance Computers, The Journal of Supercomputing, 10:197-220, 1996. [4] R. Harrison, M. Guest, R. Kendall, D. Bernholdt, A. Wong, M. Stave, J. Anchell, A. Hess, R. Littlefield, G. Fann, J. Nieplocha, G. Thomas, D. Elwood, J. Tilson, R. Shepard, A. Wagner, I. Foster, E. Lusk, and R. Stevens. Toward high-performance computational chemistry: II. A scalable self-consistent field program. Journal of Computational Chemistry, 17(1):124-132, 1996. [5] D. Bernholdt and R. Harrison. Large-scale correlated electronic structure calculations: The RI-MP2 method on parallel computers. Chemical Physics Letters, 250:477-484, 1996. [6] H. Dachsel, H. Lischka, R. Shepard, J. Nieplocha, and R. Harrison. A massively parallel multireference configuration interaction program - the parallel COLUMBUS program. Journal of Chemical Physics, 1997. [7] A. Wong, R. Harrison, and A. Rendell. Parallel direct four-index transformations. Theoretica Chimica Acta, 1996. [8] D. Bernholdt and R. Harrison. Orbital-invariant second-order many-body perturbation theory on parallel computers: An approach for large molecules. Journal of Chemical Physics, 102(24):9582-9589, 1995. [9] J. Nieplocha and R. J. Harrison, Shared Memory NUMA Programming on I-WAY, Proc. of IEEE High Perform. Distr. Comp. HPDC-5, 1996. (HPDC-5 Best Paper award) [10] Jarek Nieplocha and Ian Foster, Disk Resident Arrays: An Array Oriented Library for High Performance IO, Proc. Frontiers on Massively Parallel Comp. Frontiers'96, 1996. [11] H. Fruechtl, R. A. Kendall, R. J. Harrison, and K. G. Dyall. A scalable implementation of RI-SCF on parallel computers. Intl J. Quantum Chem. Symp., 1997 (in press). [12] D. Bernholdt et al. Parallel computational chemistry made easier: The development of NWChem. Intl J. Quantum Chem. Symp., 29:475-483, 1995. [Tulip] P. Beckman, D. Gannon, Tulip: A portable run-time system for object-oriented systems, Proc. IPPSi'96, pp. 532:536, 1996. [Bal] K. Langendoen, J. Romein, R. Bhoedjang, H. Bal, Integrating polling, interrupts and thread management, Proc. Frontiers'96, pp13-22, 1996. [Rosing] Matt Rosing's Programmable Array Compiler