Andrew Grimshaw
University of Virginia
grimshaw@cs.virginia.edu
Abstract:
The following white paper is in draft form. It offers an introduction to metasystems, as well as a sense of their practical use in both current and future computing. The reader who already has a good handle on metasystems should skip to the section on "the near term future".
I am often asked by applications programmers "what
exactly is a metasystem, anyway?" Physically, a metasystem
is a collection of geographically separated resources (people,
hosts, instruments, databases) connected by a high speed interconnect.
We already have a metasystem of sorts, we might think, in the
internet, or in clusters of computers: a metasystem, however,
is far more than computers connected by fiber. What distinguishes
a metasystem from a collection of computers is the software layer,
often called middleware, that transforms a collection of independent
hosts into a single, coherent virtual machine. To the user, a
virtual machine is a single entity that executes applications,
schedules application components, detects and recovers from faults,
provides necessary protection and security to both users and resource
owners, and brings users together through enhanced collaboration
and sharing.
The astute reader will note that I did not use the
word high-performance. A system does not need to be high-performance
to be considered a metasystem (although to be useful to scientific
users it must be). A metasystem's import lies in its single interface,
its transparency to faults, and the shared aspects of the machine.
Having said that, the remainder of this white paper we will assume
that we are discussing high-performance metasystems.
So why don't we have metasystems today? As usual,
the fundamental difficulty in constructing a metasystem is software
- specifically an inadequate conceptual model for metasystem software
design. In the face of the onrush of hardware, the computing community
has tried to stretch an existing paradigm, interacting autonomous
hosts, into a regime that it was not designed to handle. The result
is a collection of partial solutions - some good in isolation,
but lacking coherence and scalability - that make the development
of even a single wide-area application demanding at best, and
near impossible at worst.
Thus, the challenge to the computer science community
is to provide a solid, integrated, foundation on which to build
applications that can unleash the potential of so many diverse
resources. The foundation must hide the underlying physical infrastructure
from both users and the vast majority of programmers; support
access, location, and fault transparency; enable inter-operability
of components; support construction of larger integrated components
using existing components; provide a secure environment for both
resource owners and users; and it must scale to millions of autonomous
hosts.
The technology to meet this challenge largely exists:
Our vision of a metasystem is a system consisting
of millions of hosts and trillions of objects co-existing in a
loose confederation tied together with high-speed links. The user
will have the illusion of a very powerful computer on her desk.
She will sit at her terminal and manipulate objects. The objects
she manipulates will represent data resources such as digital
libraries or video streams, applications such as teleconferencing
or physical simulations, and physical devices such as cameras,
telescopes, or linear accelerators. Naturally the objects being
manipulated may be shared with other users, allowing the construction
of shared virtual workspaces.
It is the metasystem's responsibility to support
the abstraction presented to the user, to transparently schedule
application components on processors; manage data migration, caching,
transfer, and coercion; detect and manage faults; and ensure that
the user's data and physical resources are adequately protected.
The potential benefits of a metasystem to the scientific community are enormous:
There are many ways a metasystem may be used. They range from the relatively simple to imagine and implement to the exploitation of metasystem capabilities to solve currently impossible problems. Below, I have briefly sketched four potential uses of a metasystem that span the spectrum and illustrate the possibilities.
The simplest service a metasystem provides is location
transparency for file and object access. This is a well understood
capability that is known by most as a distributed file system.
For example, a user program running on site A can access a file
at site B using the same mechanism (open, close, read, write)
used to access files at site A. Further, the same file may be
shared by more than one user and by more than one task. Issues
that must be addressed in such a distributed file system are naming,
caching, replication, coherence of caches and replicas, and so
on.
A more powerful model is shared object spaces. Instead
of just files, all entities, files as well as executing tasks,
can be named and shared between users (subject to access control
restrictions). This merging of "files" and "objects"
is driven by the observation that the traditional distinction
between files and other objects is somewhat of an anachronism.
Files really are objects - they happen to live on a disk, so as
a consequence they are slower to access, and they persist when
the computer is turned off. In this model, a file-object is a
typed object with an interface. The interface can define object
properties such as its persistence, fault, synchronization, and
performance characteristics. Thus, not all files need be the same,
eliminating, for example, the need to provide Unix synchronization
semantics for all files even when many applications simply do
not require those semantics. Instead, the right semantics along
many dimensions can be selected on a file-by-file basis, and even
changed at run-time.
The notion of file objects can be extended beyond traditional Unix-like files. ELFS, an ExtensibLe File System, provides language and system support for a new set of file abstractions. ELFS supports:
Queuing systems are a part of everyday life at production
supercomputer centers. The basic idea is simple: rather than interactively
starting a job, the user submits a description of the job to a
program (the queuing system), and that program schedules the job
to execute at some point in the future. The purpose of the queuing
system is to optimize some objective function such as system utilization,
minimum average delay, or a priority scheme. There are over twenty-five
different queuing systems in use today. Examples include NQS,
PBS, Condor, Codine, and many others.
Just as contemporary queuing systems can be used
to manage homogeneous resources at a particular site, future queuing
systems will be able to manage heterogeneous resources at multiple
sites. This will enable the user to submit a job from any site
and, subject to access control, have the job automatically scheduled
somewhere in the metasystem. Further, if binaries for the application
are available for multiple architectures the queuing system will
have a choice of platforms on which to schedule the job. The advantage
of this over the current state of practice is significant. Often
a user with multiple accounts at different sites must "shop
around" for a the different sites looking for the shortest
queue in which to run a job. This consumes the most precious resource,
user time, that can be better spent doing science.
A multi-site queuing system has the opportunity to provide better user satisfaction. Queuing theory points out that a single M/M/k queue provides better average delay than k independent M/M/1 queues. This is analogous to the difference between a single line at the bank versus one line for each teller and no opportunity to switch lines. Thus, multi-site queuing systems offer the promise of improved average delay.
Another opportunity presented by metasystems is connecting
multiple resources together for the purpose of executing a single
application, thereby providing the opportunity to execute mission-critical
problems of much larger scale than would otherwise be possible.
Not all problems will be able to exploit this capability. In order
to exploit the metasystem in this manner the application must
be latency tolerant - after all, it is 30 milli-seconds from California
to Virginia at a minimum. Applications that fall into this category
include parameter space problems, Monte Carlo's, and many regular
finite difference methods.
Consider a 2D finite difference application such
as PPM or an Ocean model that is to use two distributed memory
MPPs at different sites connected by a thick communication pipe
running at OC-3 or OC-12. Further suppose that the first host
has twice as many processors as the second. To balance the load
requires that the problem be decomposed in such a manner that
the first host has twice as much of the data as the second. (In
general the scheduling problem can become quite complex. A simple
example is used here to illustrate the point.)
Given the decomposition shown below and information
on the size of the mesh points we can easily compute the bandwidth
requirements of the communications channel. Suppose the problem
is 10,000x10,000 and each mesh point has twelve double precision
floating point numbers. Then there are 96 bytes per mesh point,
and just under one megabyte of data must be transferred over the
wire on each iteration. (Represented by the thick line in the
figure below.) Assuming coast-to-coast communication, an OC-3
channel (155 Mbs) and 66% channel utilization, the time to transmit
the boundary layer is approximately 130 ms. For some applications,
such as those with an embedded implicit solver, that is likely
to be too long. However for others 130 ms is acceptable - particularly
if asynchronous send-ahead is used and the communication can overlap
with computation.
Meta-Applications
The most challenging class of applications for the metasystem are meta-applications. A meta-application is a multi-component application where many of the components were previously executed as a stand-alone applications.
A generic example is shown above. Three previously
stand-alone applications have been connected together to form
a larger, single application. Of course each of the components
may have hardware affinities, such as a vectorizable code that
"wants" to run on a vector machine such as Cray T90.
Components also may have parallel implementations, such as a data
parallel or task parallel implementation.
A more concrete example is a multi-resolution weather/climate
model is being constructed within NPACI as a collaboration between
UCLA, UVa, SDSC, and UCB. The application consists of a coupled
atmosphere/ocean general circulation model, a meso-scale weather
model of the western United States, and smaller scale models of
bays and estuaries within the western United States.
Characteristics.
Component characteristics.
Meta-application components are often written by, and maintained
by, geographically separate research groups. Further, the components
may be written in different languages, or perhaps using different
parallel dialects of the same language, e.g., Fortran components
using MPI, PVM, and HPF. The components often represent valuable
intellectual property for the owners. They often want to keep
the code in-house to protect it. Often too the code is really
only understood by the authors anyway. The challenge to the metasystem
is to facilitate inter-component communication when the components
are in different languages, and to transparently migrate binaries
from site to site.
Geographically distributed databases.
Data usually resides physically close to the research group that
collects and maintains the data. Thus before a coupled model execution
may begin in todays environment all of the data must first be
copied to a single location, where, unfortunately, it may not
fit. The challenge to the metasystem is determine when it makes
sense to move the computation to the data, and when it makes sense
to move the data to the computation.
Fault-tolerance. As the
number of hosts and processors participating in the computation
increases the probability of a failure increases and the mean
time to failure decreases. When the mean time to failure becomes
less than the time to complete a run of the application, the application
will never finish. It will start, fail, re-start, fail, and so
on. Fault tolerance is a necessity under these circumstances.
Security. Security is
a word that covers a range of issues, authentication, access control,
encryption, and others. Clearly different users have different
requirements. Suppose though that the meta-application is using
geographically distributed databases. Then those databases must
be protected against un-authorized access and update. Similarly,
the distributed components results must protected from tampering
or destruction.
Scheduling. Scheduling
of meta-applications presents a number of challenges. First, the
general scheduling problem of mapping an arbitrary task graph
onto more than three processors is known to be NP-hard. This means
that an optimal algorithm will have exponential complexity, i.e.,
we cannot use optimal algorithms. Instead application-class-specific
heuristics must be used that exploit knowledge about the application
(the "shape" of the graph). Even then the scheduling
problem is difficult. Consider scheduling a simple meta-application
with three data-parallel components onto a single distributed
memory MPP. First, the number of processors allocated to each
component must be chosen to ensure that each component progresses
at the same rate; that may require giving each component a different
number of processors. Second, the component tasks must be mapped
to the processors in such a manner as to reduce the communication
load, random placement may lead to communication bottlenecks.
Finally, the computational requirements of the components may
vary over time, requiring dynamic re-partitioning of the available
resources.
Visualization. During
the course of the computation the user may want to see what is
happening, both at the global scale and at smaller scales. Further,
he or she may wish to "look inside" individual models
at particular aspects of that model, and potentially steer the
computation by modifying some values. This is complicated by the
fact that the user may be physically far removed from the computation.
Thus, a meta-application may be composed of multiple models, running at different scales, written by different research groups using different languages or parallel processing tools, using proprietary geographically distributed databases, on faulty hardware. The requirement is for a system that cleanly and easily supports interoperability between components, the "plug and play" composition of components into a running program, and the scheduling of components onto processing resources (both within a large MPP and between hosts). Further, the system must support transparent, secure, and efficient access to remote databases, while providing a robust integrated visualization capability.
The next several years will see the widespread deployment of metasystems software. Both NSF PACIs (NPACI and NCSA) are committed to deploying metasystems. Similarly, NASA is seriously considering a major commitment to metacomputing via the Information Power Grid (IPG) project. While the architecture and capabilities of metasystems of the next five years will vary significantly certain minimum capabilities
are certain. These include:
Beyond the above there is a significant difference
in the architectures of the two largest ongoing metasystems projects,
Legion (University of Virginia) and Globus (USC and ANL). In
a nutshell, Legion is a complete solution based on a unifying
object model and a commitment to site autonomy and user flexibility,
whereas Globus is a sum of services architecture that began with
a communication service (Nexus) and added one service at a time.
Below I have included short descriptions of each project.
Legion is a reflective object-based metasystem developed
at the University of Virginia. From the project's beginning in
late 1993 the Legion Research Group's goal has been a highly usable,
open, efficient, and scalable system founded on solid principles.
We have been guided by our own work in object-oriented parallel
processing, distributed computing, and security, as well as by
decades of research in distributed computing systems. Legion provides
a single, coherent, virtual machine that addresses key issues
such as scalability, programming ease, fault tolerance, security,
site autonomy, etc.
We have laid out ten design objectives that are central
to the project's success: site autonomy; an extensible core; scalability;
an easy-to-use, seamless computational environment; high performance
via parallelism; a single persistent name space; security for
both users and resource providers; management and exploitation
of resource heterogeneity; multi-language support and inter-operability;
and fault tolerance.
In addition to these goals, three constraints restrict
our design:
Legion is a reflective object-based system that empowers
classes and meta-classes with system-level responsibility. Legion
users will require a wide range of services in many different
dimensions, including security, performance, and functionality.
No single policy or static set of policies will satisfy every
user, so, whenever possible, users must be able to decide what
trade-offs are necessary and desirable. Several characteristics
of the Legion architecture reflect and support this philosophy.
A central goal of the Globus project is the development and deployment of a distributed supercomputing infrastructure toolkit providing an integrated set of services in five areas: communication, information, resource location/allocation, security, and data access. For each component, we provide an interface specification defining the functionality provided by the service, a set of implementation notes explaining how the service is implemented in different environments, and one or more reference implementations. The following links provide access to this information for each service:
1. Globus communication services (the Nexus communication library) provide unicast and multicast message delivery services, designed to permit the efficient implementation of a wide range of communication methods.
2. Globus information services (the Metacomputing Directory Service) provide a uniform mechanism for obtaining real-time information about system structure and status.
4. Globus data access services provide high-speed remote access to persistent storage such as files.
5. Globus security services provide authentication
and authorization services