Metasystems - The Next five Years

Andrew Grimshaw

University of Virginia

grimshaw@cs.virginia.edu

Abstract:

The following white paper is in draft form. It offers an introduction to metasystems, as well as a sense of their practical use in both current and future computing. The reader who already has a good handle on metasystems should skip to the section on "the near term future".

What Is A Metasystem?

I am often asked by applications programmers "what exactly is a metasystem, anyway?" Physically, a metasystem is a collection of geographically separated resources (people, hosts, instruments, databases) connected by a high speed interconnect. We already have a metasystem of sorts, we might think, in the internet, or in clusters of computers: a metasystem, however, is far more than computers connected by fiber. What distinguishes a metasystem from a collection of computers is the software layer, often called middleware, that transforms a collection of independent hosts into a single, coherent virtual machine. To the user, a virtual machine is a single entity that executes applications, schedules application components, detects and recovers from faults, provides necessary protection and security to both users and resource owners, and brings users together through enhanced collaboration and sharing.

The astute reader will note that I did not use the word high-performance. A system does not need to be high-performance to be considered a metasystem (although to be useful to scientific users it must be). A metasystem's import lies in its single interface, its transparency to faults, and the shared aspects of the machine. Having said that, the remainder of this white paper we will assume that we are discussing high-performance metasystems.



So why don't we have metasystems today? As usual, the fundamental difficulty in constructing a metasystem is software - specifically an inadequate conceptual model for metasystem software design. In the face of the onrush of hardware, the computing community has tried to stretch an existing paradigm, interacting autonomous hosts, into a regime that it was not designed to handle. The result is a collection of partial solutions - some good in isolation, but lacking coherence and scalability - that make the development of even a single wide-area application demanding at best, and near impossible at worst.

Thus, the challenge to the computer science community is to provide a solid, integrated, foundation on which to build applications that can unleash the potential of so many diverse resources. The foundation must hide the underlying physical infrastructure from both users and the vast majority of programmers; support access, location, and fault transparency; enable inter-operability of components; support construction of larger integrated components using existing components; provide a secure environment for both resource owners and users; and it must scale to millions of autonomous hosts.

The technology to meet this challenge largely exists:

Our vision of a metasystem is a system consisting of millions of hosts and trillions of objects co-existing in a loose confederation tied together with high-speed links. The user will have the illusion of a very powerful computer on her desk. She will sit at her terminal and manipulate objects. The objects she manipulates will represent data resources such as digital libraries or video streams, applications such as teleconferencing or physical simulations, and physical devices such as cameras, telescopes, or linear accelerators. Naturally the objects being manipulated may be shared with other users, allowing the construction of shared virtual workspaces.


















It is the metasystem's responsibility to support the abstraction presented to the user, to transparently schedule application components on processors; manage data migration, caching, transfer, and coercion; detect and manage faults; and ensure that the user's data and physical resources are adequately protected.

The potential benefits of a metasystem to the scientific community are enormous:

What Can I Do With a Metasystem?

There are many ways a metasystem may be used. They range from the relatively simple to imagine and implement to the exploitation of metasystem capabilities to solve currently impossible problems. Below, I have briefly sketched four potential uses of a metasystem that span the spectrum and illustrate the possibilities.

Shared Persistent Object (file) Space

The simplest service a metasystem provides is location transparency for file and object access. This is a well understood capability that is known by most as a distributed file system. For example, a user program running on site A can access a file at site B using the same mechanism (open, close, read, write) used to access files at site A. Further, the same file may be shared by more than one user and by more than one task. Issues that must be addressed in such a distributed file system are naming, caching, replication, coherence of caches and replicas, and so on.

A more powerful model is shared object spaces. Instead of just files, all entities, files as well as executing tasks, can be named and shared between users (subject to access control restrictions). This merging of "files" and "objects" is driven by the observation that the traditional distinction between files and other objects is somewhat of an anachronism. Files really are objects - they happen to live on a disk, so as a consequence they are slower to access, and they persist when the computer is turned off. In this model, a file-object is a typed object with an interface. The interface can define object properties such as its persistence, fault, synchronization, and performance characteristics. Thus, not all files need be the same, eliminating, for example, the need to provide Unix synchronization semantics for all files even when many applications simply do not require those semantics. Instead, the right semantics along many dimensions can be selected on a file-by-file basis, and even changed at run-time.

The notion of file objects can be extended beyond traditional Unix-like files. ELFS, an ExtensibLe File System, provides language and system support for a new set of file abstractions. ELFS supports:

Wide Area Queuing System

Queuing systems are a part of everyday life at production supercomputer centers. The basic idea is simple: rather than interactively starting a job, the user submits a description of the job to a program (the queuing system), and that program schedules the job to execute at some point in the future. The purpose of the queuing system is to optimize some objective function such as system utilization, minimum average delay, or a priority scheme. There are over twenty-five different queuing systems in use today. Examples include NQS, PBS, Condor, Codine, and many others.

Just as contemporary queuing systems can be used to manage homogeneous resources at a particular site, future queuing systems will be able to manage heterogeneous resources at multiple sites. This will enable the user to submit a job from any site and, subject to access control, have the job automatically scheduled somewhere in the metasystem. Further, if binaries for the application are available for multiple architectures the queuing system will have a choice of platforms on which to schedule the job. The advantage of this over the current state of practice is significant. Often a user with multiple accounts at different sites must "shop around" for a the different sites looking for the shortest queue in which to run a job. This consumes the most precious resource, user time, that can be better spent doing science.

A multi-site queuing system has the opportunity to provide better user satisfaction. Queuing theory points out that a single M/M/k queue provides better average delay than k independent M/M/1 queues. This is analogous to the difference between a single line at the bank versus one line for each teller and no opportunity to switch lines. Thus, multi-site queuing systems offer the promise of improved average delay.

Wide Area Parallel Processing

Another opportunity presented by metasystems is connecting multiple resources together for the purpose of executing a single application, thereby providing the opportunity to execute mission-critical problems of much larger scale than would otherwise be possible. Not all problems will be able to exploit this capability. In order to exploit the metasystem in this manner the application must be latency tolerant - after all, it is 30 milli-seconds from California to Virginia at a minimum. Applications that fall into this category include parameter space problems, Monte Carlo's, and many regular finite difference methods.

Consider a 2D finite difference application such as PPM or an Ocean model that is to use two distributed memory MPPs at different sites connected by a thick communication pipe running at OC-3 or OC-12. Further suppose that the first host has twice as many processors as the second. To balance the load requires that the problem be decomposed in such a manner that the first host has twice as much of the data as the second. (In general the scheduling problem can become quite complex. A simple example is used here to illustrate the point.)














Given the decomposition shown below and information on the size of the mesh points we can easily compute the bandwidth requirements of the communications channel. Suppose the problem is 10,000x10,000 and each mesh point has twelve double precision floating point numbers. Then there are 96 bytes per mesh point, and just under one megabyte of data must be transferred over the wire on each iteration. (Represented by the thick line in the figure below.) Assuming coast-to-coast communication, an OC-3 channel (155 Mbs) and 66% channel utilization, the time to transmit the boundary layer is approximately 130 ms. For some applications, such as those with an embedded implicit solver, that is likely to be too long. However for others 130 ms is acceptable - particularly if asynchronous send-ahead is used and the communication can overlap with computation.







Meta-Applications

The most challenging class of applications for the metasystem are meta-applications. A meta-application is a multi-component application where many of the components were previously executed as a stand-alone applications.

A generic example is shown above. Three previously stand-alone applications have been connected together to form a larger, single application. Of course each of the components may have hardware affinities, such as a vectorizable code that "wants" to run on a vector machine such as Cray T90. Components also may have parallel implementations, such as a data parallel or task parallel implementation.

A more concrete example is a multi-resolution weather/climate model is being constructed within NPACI as a collaboration between UCLA, UVa, SDSC, and UCB. The application consists of a coupled atmosphere/ocean general circulation model, a meso-scale weather model of the western United States, and smaller scale models of bays and estuaries within the western United States.

Characteristics.

Component characteristics. Meta-application components are often written by, and maintained by, geographically separate research groups. Further, the components may be written in different languages, or perhaps using different parallel dialects of the same language, e.g., Fortran components using MPI, PVM, and HPF. The components often represent valuable intellectual property for the owners. They often want to keep the code in-house to protect it. Often too the code is really only understood by the authors anyway. The challenge to the metasystem is to facilitate inter-component communication when the components are in different languages, and to transparently migrate binaries from site to site.

Geographically distributed databases. Data usually resides physically close to the research group that collects and maintains the data. Thus before a coupled model execution may begin in todays environment all of the data must first be copied to a single location, where, unfortunately, it may not fit. The challenge to the metasystem is determine when it makes sense to move the computation to the data, and when it makes sense to move the data to the computation.

Fault-tolerance. As the number of hosts and processors participating in the computation increases the probability of a failure increases and the mean time to failure decreases. When the mean time to failure becomes less than the time to complete a run of the application, the application will never finish. It will start, fail, re-start, fail, and so on. Fault tolerance is a necessity under these circumstances.

Security. Security is a word that covers a range of issues, authentication, access control, encryption, and others. Clearly different users have different requirements. Suppose though that the meta-application is using geographically distributed databases. Then those databases must be protected against un-authorized access and update. Similarly, the distributed components results must protected from tampering or destruction.

Scheduling. Scheduling of meta-applications presents a number of challenges. First, the general scheduling problem of mapping an arbitrary task graph onto more than three processors is known to be NP-hard. This means that an optimal algorithm will have exponential complexity, i.e., we cannot use optimal algorithms. Instead application-class-specific heuristics must be used that exploit knowledge about the application (the "shape" of the graph). Even then the scheduling problem is difficult. Consider scheduling a simple meta-application with three data-parallel components onto a single distributed memory MPP. First, the number of processors allocated to each component must be chosen to ensure that each component progresses at the same rate; that may require giving each component a different number of processors. Second, the component tasks must be mapped to the processors in such a manner as to reduce the communication load, random placement may lead to communication bottlenecks. Finally, the computational requirements of the components may vary over time, requiring dynamic re-partitioning of the available resources.


















Visualization. During the course of the computation the user may want to see what is happening, both at the global scale and at smaller scales. Further, he or she may wish to "look inside" individual models at particular aspects of that model, and potentially steer the computation by modifying some values. This is complicated by the fact that the user may be physically far removed from the computation.

Thus, a meta-application may be composed of multiple models, running at different scales, written by different research groups using different languages or parallel processing tools, using proprietary geographically distributed databases, on faulty hardware. The requirement is for a system that cleanly and easily supports interoperability between components, the "plug and play" composition of components into a running program, and the scheduling of components onto processing resources (both within a large MPP and between hosts). Further, the system must support transparent, secure, and efficient access to remote databases, while providing a robust integrated visualization capability.

The Near-Term Future

The next several years will see the widespread deployment of metasystems software. Both NSF PACIs (NPACI and NCSA) are committed to deploying metasystems. Similarly, NASA is seriously considering a major commitment to metacomputing via the Information Power Grid (IPG) project. While the architecture and capabilities of metasystems of the next five years will vary significantly certain minimum capabilities

are certain. These include:

Beyond the above there is a significant difference in the architectures of the two largest ongoing metasystems projects, Legion (University of Virginia) and Globus (USC and ANL). In a nutshell, Legion is a complete solution based on a unifying object model and a commitment to site autonomy and user flexibility, whereas Globus is a sum of services architecture that began with a communication service (Nexus) and added one service at a time. Below I have included short descriptions of each project.

Appendix

Legion

Legion is a reflective object-based metasystem developed at the University of Virginia. From the project's beginning in late 1993 the Legion Research Group's goal has been a highly usable, open, efficient, and scalable system founded on solid principles. We have been guided by our own work in object-oriented parallel processing, distributed computing, and security, as well as by decades of research in distributed computing systems. Legion provides a single, coherent, virtual machine that addresses key issues such as scalability, programming ease, fault tolerance, security, site autonomy, etc.

We have laid out ten design objectives that are central to the project's success: site autonomy; an extensible core; scalability; an easy-to-use, seamless computational environment; high performance via parallelism; a single persistent name space; security for both users and resource providers; management and exploitation of resource heterogeneity; multi-language support and inter-operability; and fault tolerance.



In addition to these goals, three constraints restrict our design:

Legion System Architecture

Legion is a reflective object-based system that empowers classes and meta-classes with system-level responsibility. Legion users will require a wide range of services in many different dimensions, including security, performance, and functionality. No single policy or static set of policies will satisfy every user, so, whenever possible, users must be able to decide what trade-offs are necessary and desirable. Several characteristics of the Legion architecture reflect and support this philosophy.

Globus [This section taken from http://www.globus.org/globus/information.htm]

Project Goals and Approach

A central goal of the Globus project is the development and deployment of a distributed supercomputing infrastructure toolkit providing an integrated set of services in five areas: communication, information, resource location/allocation, security, and data access. For each component, we provide an interface specification defining the functionality provided by the service, a set of implementation notes explaining how the service is implemented in different environments, and one or more reference implementations. The following links provide access to this information for each service:

1. Globus communication services (the Nexus communication library) provide unicast and multicast message delivery services, designed to permit the efficient implementation of a wide range of communication methods.

2. Globus information services (the Metacomputing Directory Service) provide a uniform mechanism for obtaining real-time information about system structure and status.

3. Globus resource location and allocation services provide mechanisms for expressing application resource requirements, for identifying resources that meet these requirements, for scheduling resources once they have been located, and for initiating and managing computation on these resources.

4. Globus data access services provide high-speed remote access to persistent storage such as files.

5. Globus security services provide authentication and authorization services