From haupt@npac.syr.edu Wed Jul  9 16:04 EDT 1997
Received: from aga.npac.syr.edu (aga.npac.syr.edu [128.230.117.22]) by postoffice.npac.syr.edu (8.7.5/8.7.1) with SMTP id QAA02420; Wed, 9 Jul 1997 16:04:33 -0400 (EDT)
Sender: haupt
Message-ID: <33C3EF2F.167E@npac.syr.edu>
Date: Wed, 09 Jul 1997 16:06:07 -0400
From: Tomasz Haupt <haupt@npac.syr.edu>
X-Mailer: Mozilla 3.01SC-SGI (X11; I; IRIX 6.3 IP32)
MIME-Version: 1.0
To: j_nieplocha@pnl.gov
CC: dbc, bernhold
Subject: ACTS proposal
Content-Transfer-Encoding: 7bit
Content-Type: text/plain; charset=us-ascii
Content-Length: 33065
Status: RO

Hi,
this is the latest version of the proposal. I am still working on
references. I think Bryan and David can add do it. In particular,
I would like Bryan to put a little more "meat" into the very last
section that deals with the HPF runtime. I think what I wrote is
a little vague.

Tom
----------------------------------------------------------------------


        Extending Scientific Template Library With 
              Shared-Memory NUMA Toolkit


1. Introduction

2. Overview of GA 
( What is the purpose of this
tool? What aspects of the code design developmentg porting process will
it impact the most? How much training experience is needed to use it
effectively? Who has successfully used it already (specific pointers)?

2.1 Global Array NUMA Programming Model

All scalable parallel computers feature a memory hierarchy, in which
some locations are "closer" to a particular processor than others. The
hardware in a particular system may support a shared memory or message
passing programming model, but these factors effect only the relative
costs of local and remote accesses, not the system's fundamental Non
Uniform Memory Access (NUMA) characteristics. Yet while the efficient
management of memory hierarchies is fundamental to high performance in
scientific computing, existing parallel languages and tools provide
only limited support for this management task [Velde,NATO] and focussed
primarily on parallelism, and more specifically on the issues of
control flow, communication structures, and load balancing.
Memory-related issues such as data locality and variable cost of
accessing data have been largely ignored.  Recognizing this deficiency,
we
developed abstractions and programming tools, that can facilitate the
explicit management of memory hierarchies by the programmer, and hence
the efficient programming of scalable parallel computers. The
abstractions comprise local arrays, global (distributed) arrays, and
disk resident arrays located on secondary storage. The tools comprise
the Global Arrays library [1, 2, 3], which supports the transfer of
data between local and global arrays, and the Disk Resident Arrays
library [DRA], for transferring data between global and disk resident
arrays.

The two predominant programming models for MIMD concurrent computing
are message passing and shared memory. 

Message passing assumes a distributed-memory model in which distinct
processes each have their own "local" data, and share data only through
cooperative communication. A process can access its own local data
directly, but in order to access remote data requires the cooperation
of the process that owns the data. The remote process must send the
required data in an explicit message, and hence must know which piece
of data is needed by which process and when. This requirement makes the
message passing model hard to use for irregular problems and
applications that use dynamic load balancing. This is because the
coordination of large number of processes that operate on uneven chunks
of data or that require access to remote data at irregular time
intervals increases algorithmic complexity and magnifies associated
programming effort. The recently proposed MPI-2 one-sided communication
model [MPI2] addresses this issue to some degree but its progress rules
can be too restrictive for some applications.  Despite programming
difficulties, the message-passing paradigm's memory model maps well to
the distributed- memory architectures used in scalable MPP systems.
Because the programmer must explicitly control data distribution and is
required to address data locality issues, message-passing applications
tend to execute efficiently on such systems.

In the shared-memory programming model, data is located either in
"private" memory (accessible only a specific process) or in "global"
memory (accessible to all processes). In some shared-memory systems,
global memory is accessed in the same manner as local memory. Systems
based on this approach may rely on hardware or operating support to
recognize load and store operations that reference non-local memory
(e.g., SGI Origin-2000, Convex/HP-SPP) or use purely software- based
approaches, as in the various distributed shared memory libraries, for
example Treadmarks [Treadmarks] or Midway [midway]. In other
shared-memory systems, global memory is accessed by using distinguished
mechanisms, such as language constructs [Linda, SplitC, CCpp], special
user-defined operations [Orca], or library functions [GA, DOLIB].  A
disadvantage of many shared-memory models is that they do not expose
the NUMA memory hierarchy of the underlying distributed-memory
hardware.  Instead, they present a flat view of memory, making it hard
for programmers to understand how data access patterns effect
performance or to exploit data locality.  Hence, while programming
effort involved in application development tends to be much lower than
in message-passing approach, achieved performance is usually less
competitive.

The Global Arrays NUMA model combine the better features of
message-passing  and shared-memory such as:

        distributed memory view of message passing model,

        one-sided access to remote data in the spirit of shared memory
        paradigm,

        explicit control over data distribution,

        data locality, and distribution and mapping information,

        recognition of memory hierarchy and performance differences in
        access to distinct layers in memory hierarchy, and

        includes as a subset message passing (to support algorithms
        that require synchronization on data transfer).

This combination leads to both simple coding and
efficient execution for a class of applications that operate on large
distributed dense arrays, require dynamic load balancing or exhibit
attendant unpredictable data reference patterns.

The primary mechanisms provided by GA for accessing data are copy
operations that transfer data between layers of the memory hierarchy,
namely global memory (distributed array) and local memory.  In
addition, each process can access directly data held in global array
sections that are assigned to that process.  Locks and other atomic
operations are provided. They can be used to implement synchronization
and to assure correctness of accumulate operations (floating-point sum
reduction that combines local and remote data) executed concurrently by
multiple processes and targeting overlapping array sections.

The GA was developed in light of emerging standards: the API follows
Fortran 90 array notation, the library has been designed based on
object oriented principles, GA is compatible with MPI, and interfaces
to third-party parallel linear algebra packages exist. The programmer
is free to use both the shared-memory and message-passing paradigms in
the same program and to take advantage of existing message-passing
software libraries.  Both C and Fortran interfaces are available.

The GA model has been extended and employed succesfuly in the
metacomputing environments (IWAY), and high-performance
secondary storage systems. 

2.2 Applications and Users

Development of Global Arrays has been and continues to be motivated by
applications. The applications of GA include graphics rendering,
financial calculations (security value forecasting), molecular dynamics
[Tjerk] and most of all computational chemistry algorithms which
calculate the electronic structure of molecules and other small or
crystalline chemical systems.  These calculations are used to predict
many chemical properties that are not directly accessible by
experiments, and play a dominant role in the number of supercomputer
cycles currently used for computational chemistry. Although the GA
programming model extends the message-passing programming model, no
point-to-point message passing is used in any of many chemistry
application codes (message-passing is used in parallel linear algebra
libraries that these applications access through GA interfaces) which
illustrates the good match between the GA model and computational
chemistry. Some examples of applications that attain high parallel
scalability and demonstrate the ease of use of GAs are included in the
following references [4,5,6,7,8].

GA were developed as part U.S. Federal High Performance Computing and
Communications Initiative (HPCCI) Grand Challenge Applications project
in computational chemistry, and turned out to be useful for research
and production code development at PNNL's Environmental Molecular
Science Laboratory (EMSL) and elsewhere in the U.S. (including Argonne
National Laboratory, NCSA, Pitsburgh Supercomputing Center, Cornell
Theory Center, San Diego Supercomputing Center, University of Illinois
Urbana Champaign), and other countries including United Kingdom
(Daresbury Laboratory), Austria (University of Vienna), Germany
(Stuttgart University), Australia (Australian National University), and
Italy (University of Perugia). In last three years many applications
have adopted GAs, totalling nearly one-and-half million lines of code,
including some 400,000 lines of new code.

(we need specific user names and applications that use GA)

2.3. Supported Platforms

The GA are available on 1) distributed-memory, message-passing parallel
computers with interrupt-driven communications or Active Messages (IBM
SP and Intel Paragon); 2) networked clusters of single- and
multi-processor UNIX (Sun, IBM, SGI, DEC, HP) and Windows NT [Chien]
workstations; 3) shared-memory computers (Convex SPP, SGI Origin-2000
and PowerChallenge); and 4) globally-addressable distributed-memory
computers (Cray T3D/E).  

(JN: Do we want to mention our collaboration with IBM on GA/LAPI
optimizations? It is applicable to ASCI Blue)


2.4 Disk Resident Arrays: Extension of Global Arrays to
    Secondary Storage

Disk Resident Arrays (DRA) extend the GA model to another level in the
storage hierarchy, namely, secondary storage. DRA introduces the
concept of a disk resident array--a disk-based representation of an
array--and provides functions for transferring blocks of data between
global arrays and disk resident arrays. Hence, it allows programmers to
access data located on disk via a simple interface expressed in terms
of arrays rather than files. The extends the benefits of global arrays
(in particular, the absence of complex index calculations and
high-level array interface) to programs that operate on arrays that are
too large to fit into memory. 

By providing distinct interfaces for accessing objects located in main
memory and on the disk, GA and DRA render visible the different levels
of the memory hierarchy in which objects are stored. Hence, programs
can take advantage of the performance characteristics associated with
access to these levels. Recall that memory hierarchies consist of
multiple levels, but are managed between two adjacent levels at a time
[Hennessy]. For example, a page fault causes the transfer of a data
block (page) to main memory while a cache miss transfers a cache line.
Similarly, GA and DRA allow data transfer between adjacent levels of
memory.  DRA read and write operations can be applied both to entire
arrays and to sections of arrays (disk and/or global arrays); in either
case, they are collective and asynchronous. This collective I/O
strategy has been adopted in many other projects for the efficiency
reasons [corbett:mpi-overview, PASSION, Panda] as well. DRA
asynchronous interface to its I/O operation permits applications to
overlap time-consuming I/O with computation.
The DRA has been succesfully used in out-of-core matrix multiplication
and two chemistry applications.


2.5 Mirrored Arrays: metacomputing extensions of GA

In recent years there has been increasing interest in metacomputing.
Metacomputing environments comprise multiple supercomputers and other
devices (mass storage systems, display devices) connected via local
area networks (LANs) or wide area networks (WANs). They can potentially
provide significant increases in the computational power accessible to
a single application. From the memory hierarchy perspective, the
metacomputing environment introduces one or more additional layers,
with typically much higher latencies and lower bandwidth than in a
supercomputer. Mirrored Arrays [GA-IWAY,TJS] have been developed to
address the performance characteristics of metacomputer in a manner
similar to replicated shared memory [RSM].  Arrays are fully
distributed within each machine, so the amount of data held by each
processor increases by a factor roughly proportional to the number of
networked supercomputers.  Each supercomputer operates on its own
mirrored array independently, and a special operation  is provided for
enforcing consistency of the different mirrored arrays.  This primitive
is a collective operation across all supercomputers and merges entire
or user-specified sections of the mirrored arrays.  Upon completion of
this operation, all machines have identical copies of the specified
array or array section.  The result is that most GA communication
occurs locally within supercomputers.  Total network traffic is lower
and average message size sent across the network is larger than in the
fully distributed approach. This approach was very succesful for a
large application (Self-Consistent Field) that was able to effectively
exploit multiple supercomputers (Intel Paragons and IBM SPs)
connected with WANs.


3. Value of GA to DOE scientific community and ACTS 

(why should a DOE project use GA? What are the evaluation criteria that
favor
this tool--portability, execution speed, development time,
interoperability, etc.? What would GA add to the tools in the current
SCiTL)

All current and proposed MPPs (ASCI) have or will have  nonuniform
memory architecture. In future systems, both the number of memory
levels and the cost (in processor cycles) of accessing deeper levels
can be expected to increase [Hennessy,Petaflops]. Essential to
achieving high efficiency on such systems are NUMA-aware programming
tools.  The GA NUMA has been proven as an effective and easy-to-use
parallel programming tool for developing real large-scale
applications.  These applications were able to obtain impressive
speedups and scalability on massively-parallel computers thanks to
support that GA and DRA provide for performance programming in NUMA
environments: data locality control and information, recognition of
memory hierarchy and associated variable variable cost for transfering
data between memory layers, and explicit mechanisms for such
transfers. GA is already optimized for SMP nodes of MPP computers
and should match very well the architecture of the ASCI systems.

>From the user perspective, GA does not require expert skills in
parallel programming. The high-level array-oriented API is close to the
matrix notation that describes mathematical formulation of physical
problems. The scientists who are not parallel programming experts find
GA operations on arrays much more familiar, easier to learn and use
than low-level message-passing model. These attributes of GA are the
key to achieving very-high productivity levels in the codes developed
with GA such as NWChem [12].

GA is already interoperable with standard programming tools and
libraries that DOE scientific community uses such as MPI and
ScaLAPACK. Extensiong interoperability of GA with other
tools and facilities is a subject of this proposal.

The GA differs from other tools in SciTL. Unlike P++ that targets
data-parallel execution model, GA supports both task- parallelism and
data-parallelism. Unlike Tulip [Tulip] that uses polling for remote
memory access on message-passing systems (such as IBM SP-2), GA
implementation uses platform-specific low-level communication
mechanisms. This is because they do not require modifications
(inserting polling calls) to the _all_ application code which would be
not practical for the applications that use large standard libraries
[Bal]. (Many SCiTL toolkits target C++ applications and provide minimal
support for Fortran that scientists use the most. GA is oriented
towards Fortran and offers C interface).

In summary, GA would complement and extend capabilities of tools in the
current SCiTL by providing a NUMA-aware portable, efficient and easy to
use shared memory programming model fully compatible with MPI.


4. Application Requirements

(Steve, Dave, Robert)


5. Proposed Work


Although GA NUMA programming model and toolkit has been
very succesful for many applications, the value of GA to DOE
the scientific community can be improved. The improvements
can target the following areas:

    integration and interoperability with other tools and libraries
    including those already in the SCiTL

    extensions of the GA functionality and programming interfaces

    interoperability with High Performance Fortran and language
    support for NUMA model

5.1 Extensions and improvements to GA

5.1.1 Higher-dimensional arrays

The current implementation of GA provide support only for 1- and 2-
dimensional dense arrays distributed (uniformly or nonunformly) in the
block fashion. This is a significant limitation for computational fluid
dynamics and other applications with 3-dimensional problem domain.
These applications require support for arrays with three- to
five-dimensions to represent the physical quantities such as
temperature, velocity, pressure, etc in the 3-dimensional spatial
domains. We propose to extend the GA to support arrays up to seven
dimension (the Fortran limit).  This is a natural extension of the
current toolkit capabilities and will fit well in the current
implementation framework.

5.1.2 Additional distribution types

To improve load balancing, many parallel linear algebra algorithms
adopted cyclic block distribution. The current interfaces between GA
and parallel linear algebra packages, such as ScaLAPACK, that use such
distributions require data reorganization. For some applications
such reorganizations could be avoided if GA supported block
cyclic distribution.


5.1.3 Support for sparse data structures

The current GA applications handle sparse data structures by explicitly
packing data into one- or two- dimensional dense arrays.  Support for
sparse arrays in GA would greatly simplify implementation of such
algorithms. The details of the sparse storage would be encapsulated in
the GA object and access to the data simplified.

(JN: Do we need dynamic or static sparse storage?)


5.1.4 Split-phase operations

In order to better tolerate latency in systems with multiple layers of
NUMA memory hierarchy we propose to extend GA with split-phase
operations. 


5.2. Integration, interoperability of GA with standard tools and
libraries

5.2.1 Interfaces to numerical libraries

Many applications that use GA require access to parallel linear algebra
algorithms. We respond to this demand by developing interfaces to the
state-of-the-art parallel libraries such as ScaLAPACK, SUMMA, PLAPACK,
PeIGS.  The interfaces are integrated with the GA using a "black box"
approach which hides from the user the complexity (and sometimes
variation from one release to another) of the solver interfaces. The
user specifies global array descriptor for input and output and the
appropriate GA interface operation rearanges the data to the format
required by a particular third-party library, makes call to the
appropriate solver, and returns the output data in the output global
array. An example of such an interface is ga_lu_solve operation, that
takes uses three arguments: character variable indicating if transpose
operation is required, array handle for coefficient matrix, and array
handle for the right-hand-side vector(s) which is overwritten by the
solution upon exit.  The subroutine in turn rearranges the arrays into
the block cyclic distribution format, calls ScaLAPACK pdgetrf
(factorization) and pdgetrs (forward/backward substitution) subroutines
and returns the solution vector(s) in the global array. The pdgetrf and
pdgetrs subroutines contain 8 and and 13 arguments respectively as
comparing to the three arguments in ga_lu_solve.

The GA contains interfaces to several parallel linear algebra
algorithms available in ScaLAPACK. The following interfaces are
available:  LU and Cholesky factorizations, linear equation solvers,
matrix inverse. We propose providing additional interfaces to:
eignesolver and Houselholder reduction to tridaigonal form.

Additional interfaces to PLAPACK (we need it from George). 

PETSc is set of numerical libraries and data structures developed for 
solution of partial differential equations. In addition to the
distributed data structures it contains iterative solvers
and preconditioners, and unconstrained minimization algorithms.
We propose to develop interface between GA and PETSc to make
the numerical algorithms supported by PETSc to GA applications.
By combininng PETSc with Disk Resident Arrays in the GA framework,
out-of-core solvers can be provided.


5.2.2 Interface to P++ 

P++ is a C++ class libraries that provides a high-level view of the
distributed arrays and provides extensive support for operations on
such arrays by hiding the low-level details of the architecture the
program is running on. P++ targets the data-parallel applications and
internaly uses MultiBlock Parti [Parti] for irregular data transfers.
It is possible to offer GA one-sided communication to the A++P++
applications by developing an interface from A++P++ program to GA
library based on the access mechanisms to the raw A++ data the package
provides. 


5.3. Interface to HPF and language support

Most scientific applications have been and continue to be developed in
Fortran. It is important for the tools to support this language and
follow its evolution. In particular, the High Performance Fortran
(HPF), is a new industry standard for developing data parallel
programs. Based on Fortran 90 syntax, it introduces a portable way
to express explicitly parallelism in the code. Intuitive data mapping 
directives and the global name space employed in the HPF model 
greatly simplifies development of parallel applications. The efficiency
of the resulting codes depends on the run-time support. And this is the
greatest strenght of HPF. All details of the target machine architecture
are hidden from the programmer: the actual parallel implementation is
done by the compilation system, with the machine specific run-time
support 
taking responsibility of accessing remote (off-processor) data. 

The HPF design and implementation reflects the current status of the
compiler 
technology. In order to guarantee portability of the codes in a sense
that
the performance of the resulting codes is similar on machines with a
similar
computational power regardless their architecure, HPF does not support
directly some features which are necessary for some class of
applications.
In particular, there is no support for features required in
dynamic, task-parallel programs, such as random access to regions of
distributed arrays from within a MIMD parallel subroutine call-tree,
and reduction into overlapping regions of distributed arrays.
The HPF Forum recognized such features as too specific to be
included in the Fortran standard, or found it premature to guarantee
performance at this time.  

By no means it limits the scope of the language. HPF provides a portable
way to incorporate libraries of parallel codes that are implemented in
languages different than HPF, and following parallel programming
paradigms
differrent than that employed by HPF. The difference between the
directly
supported HPF features and the extrinsic procedures is that the latter
are to be developed by the user (or a third party), and it is the user's
responsibility to provide cross-plaform portability. Development of
parallel
extrinsic procedures necessarily requires expertize in parallel
computing.
Therefore we propose to develop HPF extrinsic libraries that become a
part
of the SciTl. 

As the first step, we propose complementing HPF with the Global
Arrays library. In particular, the GA non-collective one-sided
communication on array sections would allow random access to regions of
distributed arrays from within a MIMD parallel subroutine call-tree,
and reduction into overlapping regions of distributed arrays.  GA has
been designed to be as compatible as possible with HPF: GA 
API follows Fortran 90 and HPF array notation and most GA collective
operations can be translated directly into single-line statements
in HPF.

The concept of integration of stand alone libraries, such as Global
Arrays,
with HPF can be pushed even further. One strategy is to provide support
for user's extensions to the language. In this approach, a dedicated,
programmable language processor translates user's directives into HPF
statements. In particular, the language processor generates
calling sequences for the user's run-time support implemented as HPF
extrinsic library. We propose to develop language extensions that makes,
from the user point of view, Global Arrays an integral part of the HPF
run-time system accessible via HPF-like statements and/or directives.
To implement the language processor to handle codes written in the 
extended HPF we are going to use the infractructure developed by the
Parallel
Compiler Runtime Consortium [PCRC]. It includes  HPF parser (and also
C++ and Java) that transforms the source code into an intermediate 
representation (IR), and vice versa. The IR is implemented in an object
oriented fashion, and is modelled on Sage++ [Ganon]. A class library to 
perform operations on the IR is available as well. This 
infrastructure has been used by the PCRC to implement an HPF
compiler[Li]. 
Here we propose to reimplement the ideas of the Programmable Array
Compiler
[Rosing] as the class library to perform operations on IR that
correspond to the language extensions. The result is then reparsed to
HPF, 
compiled using a commercial HPF compiler, and linked against the
extrinsic
Global Array library. The key concept here is that the proposed language
processor is programmable, that is, the functionality of the class
library is
described in a Language Definition File. Consequently, the same language
processor can be used to integrate other run-time libraries with HPF
just
by modifying the language definition. Since the PCRC infrastructure is
capable to process C++ and Java sources, this mechanism can be used to
integrate other components of SciTl. 

The other strategy involves dynamical software integration using
interpreted
protocols. The fundamental difference between this approach and that
described
above is that here all components of the application can be compiled 
independently of each other, and dynamically linked in the fly at the
runtime.
The base of such a system is an HPF interpreter. A prototype of it has
been
developed at NPAC [AkarsuFoxHaupt]. The heart of the system is an HPF
server
implemented through the extrinsic interface. The HPF server accepts
commands 
from a client implemented as an Java applet. The user uses the applet as
an GUI. The user can interrupt, suspend and resume, execution of the
precompiled HPF application at any point (including breakpoints and
stepping
one HPF statement at a time), and request a specific action from
the HPF server. Currently, the repertoire of the user commands comprise

 - request for data (for dubbuging or visualizations), 
 - setting values of any variable in the precompiled HPF code
(steering),
 - interpretation of valid HPF statements entered from the keyboard 
   or read from a file (HPF interpreter),
 - request for dynamical linking and executing of precompiled shared
objects 
   (dynamical software integration).    

The control over the execution of the precompiled code, that is,
dynamical
switching between compiled and interpreted mode is facilitated by
instrumenting
the original HPF source code. This is done automatically using the PCRC
infrastructure (HPF parser/reparser and the class library to transform
the Intermediate Representation). All features of this integrated HPF
compiler and interpreter environment are of value for the SciTl toolkit.
It provides support for interactive (and collaborative) visualizations,
as well
as run-time data analysis and steering; it serves as an powerfull HPF
debugger
(particularly when combined with vizualizations of distributed data
objects,
and HPF interpreter); it allows for rapid prototyping (capability of 
data manipulation in the interpreted mode and dynamical linking with
precompiled modules). 

The most important feature of the system is the capability of dynamical 
alternating between compiled and interpreted modes. The compiled part
preserves performance of the HPF code while interpreter allows the
user for interactions with the code at the runtime (debugging, data
analysis, steering). Here we propose to use this system for dynamical
integration of an HPF application and the Global Array library. In
this case we do not expect any signifiacant performance degradation
due to interpreter because the role of the interpreter is reduced to
executing of precompiled modules at a priori specified points. The
advantage of such an approach is that modules can be developed,
compiled, and tested independently of each other, and integrated
(linked) together only at the runtime. In fact, we envision this system
as a
powerful tool for development of multidisciplinary applications. Once
the
fully operational system is demonstrated for HPF, we will extend it to
support codes implemented in Java and C++, and in particular,
dynamical integration of existing libraries written in Fortran with
object oriented applications.    
 
The integration of the Global Arrays with HPF can be further
generalized. So far we discussed using the GA extrinsic library as
a supplement for the HPF runtime system to support specific codes that
uses Global Arrays. As explained above, the Global Arrays offers
attractive
functionality that goes beyond the HPF standard that can be used in
other applications. Therefore we propose to extract the kernel of the
GA and incorporate it directly into HPF runtime library within PCRC
runtime system. The extended runtime system can be used by both the
programmable compiler and the interpreter.


References

[1]   J. Nieplocha, R. Harrison, and R. Littlefield.
Global Arrays: A portable `shared-memory" programming model for
distributed memory computers. In Proc. Supercomputing 1994, pages
340-349. IEEE Computer Society Press, 1994.

[2]   J. Nieplocha, R. J. Harrison, R. J. Littlefield. The
Global Array programming model for high performance scientific
computing. SIAM News, 28(7):12-14, 1995.

[3]     J. Nieplocha, R. J. Harrison, R. J. Littlefield. The Global
Array: A Non-Uniform-Memory-Access Programming Model For
High-Performance Computers, The Journal of Supercomputing, 10:197-220,
1996.

[4]   R. Harrison, M. Guest, R. Kendall,
D. Bernholdt, A. Wong, M. Stave, J. Anchell, A. Hess,
R. Littlefield, G. Fann, J. Nieplocha, G. Thomas,
D. Elwood, J. Tilson, R. Shepard, A. Wagner, I. Foster,
E. Lusk, and R. Stevens. Toward high-performance computational
chemistry: II. A scalable self-consistent field program. Journal of
Computational Chemistry, 17(1):124-132, 1996.

[5]     D. Bernholdt and R. Harrison. Large-scale correlated
electronic structure calculations: The RI-MP2 method on parallel
computers. Chemical Physics Letters, 250:477-484, 1996.

[6]   H. Dachsel, H. Lischka, R. Shepard,
J. Nieplocha, and R. Harrison. A massively parallel
multireference configuration interaction program - the parallel
COLUMBUS program. Journal of Chemical Physics, 1997.

[7]     A. Wong, R. Harrison, and A. Rendell. Parallel direct
four-index transformations. Theoretica Chimica Acta, 1996.

[8]   D. Bernholdt and R. Harrison. Orbital-invariant
second-order many-body perturbation theory on parallel computers: An
approach for large molecules. Journal of Chemical Physics,
102(24):9582-9589, 1995.

[9]     J. Nieplocha and R. J. Harrison, Shared Memory NUMA Programming
on I-WAY, Proc. of IEEE High Perform. Distr. Comp. HPDC-5, 1996.
(HPDC-5 Best Paper award)

[10]    Jarek Nieplocha and Ian Foster, Disk Resident Arrays: An Array
Oriented Library for High Performance IO, Proc.  Frontiers on
Massively Parallel Comp. Frontiers'96, 1996.

[11]  H. Fruechtl, R. A. Kendall, R. J. Harrison, and
K. G. Dyall. A scalable implementation of RI-SCF on parallel
computers. Intl J. Quantum Chem. Symp., 1997 (in press).

[12]    D. Bernholdt et al. Parallel computational chemistry made
easier: The development of NWChem. Intl J. Quantum Chem. Symp.,
29:475-483, 1995.

[Tulip] P. Beckman, D. Gannon, Tulip: A portable run-time system
for object-oriented systems, Proc. IPPSi'96, pp. 532:536, 1996. 

[Bal] K. Langendoen, J. Romein, R. Bhoedjang, H. Bal,
Integrating polling, interrupts and thread management,
Proc. Frontiers'96, pp13-22, 1996.

[Rosing] Matt Rosing's Programmable Array Compiler