Metasystems - The Next five Years

Andrew Grimshaw

University of Virginia

grimshaw@cs.virginia.edu

Abstract:

The following white paper is in draft form. It offers an introduction to metasystems, as well as a sense of their practical use in both current and future computing. The reader who already has a good handle on metasystems should skip to the section on "the near term future".

What Is A Metasystem?

I am often asked by applications programmers "what exactly is a metasystem, anyway?" Physically, a metasystem is a collection of geographically separated resources (people, hosts, instruments, databases) connected by a high speed interconnect. We already have a metasystem of sorts, we might think, in the internet, or in clusters of computers: a metasystem, however, is far more than computers connected by fiber. What distinguishes a metasystem from a collection of computers is the software layer, often called middleware, that transforms a collection of independent hosts into a single, coherent virtual machine. To the user, a virtual machine is a single entity that executes applications, schedules application components, detects and recovers from faults, provides necessary protection and security to both users and resource owners, and brings users together through enhanced collaboration and sharing.

The astute reader will note that I did not use the word high-performance. A system does not need to be high-performance to be considered a metasystem (although to be useful to scientific users it must be). A metasystem's import lies in its single interface, its transparency to faults, and the shared aspects of the machine. Having said that, the remainder of this white paper we will assume that we are discussing high-performance metasystems.

So why don't we have metasystems today? As usual, the fundamental difficulty in constructing a metasystem is software - specifically an inadequate conceptual model for metasystem software design. In the face of the onrush of hardware, the computing community has tried to stretch an existing paradigm, interacting autonomous hosts, into a regime that it was not designed to handle. The result is a collection of partial solutions - some good in isolation, but lacking coherence and scalability - that make the development of even a single wide-area application demanding at best, and near impossible at worst.

Thus, the challenge to the computer science community is to provide a solid, integrated, foundation on which to build applications that can unleash the potential of so many diverse resources. The foundation must hide the underlying physical infrastructure from both users and the vast majority of programmers; support access, location, and fault transparency; enable inter-operability of components; support construction of larger integrated components using existing components; provide a secure environment for both resource owners and users; and it must scale to millions of autonomous hosts.

The technology to meet this challenge largely exists:

parallel compilers that support execution on distributed memory machines,
advances in distributed systems software that manage complex distributed environments,
the widespread acceptance of the object-oriented paradigm because of its encapsulation and reuse properties,
advances in cryptography and cryptographic protocols.

Our vision of a metasystem is a system consisting of millions of hosts and trillions of objects co-existing in a loose confederation tied together with high-speed links. The user will have the illusion of a very powerful computer on her desk. She will sit at her terminal and manipulate objects. The objects she manipulates will represent data resources such as digital libraries or video streams, applications such as teleconferencing or physical simulations, and physical devices such as cameras, telescopes, or linear accelerators. Naturally the objects being manipulated may be shared with other users, allowing the construction of shared virtual workspaces.

It is the metasystem's responsibility to support the abstraction presented to the user, to transparently schedule application components on processors; manage data migration, caching, transfer, and coercion; detect and manage faults; and ensure that the user's data and physical resources are adequately protected.

The potential benefits of a metasystem to the scientific community are enormous:

more effective collaboration by putting coworkers in the same virtual workplace,
higher application performance due to parallel execution and exploitation of off-site resources,
improved access to data and computational resources,
improved researcher and user productivity resulting from more effective collaboration and better application performance,
increased resource utilization,
a considerably simpler programming environment for the applications programmers.

What Can I Do With a Metasystem?

There are many ways a metasystem may be used. They range from the relatively simple to imagine and implement to the exploitation of metasystem capabilities to solve currently impossible problems. Below, I have briefly sketched four potential uses of a metasystem that span the spectrum and illustrate the possibilities.

Shared Persistent Object (file) Space

The simplest service a metasystem provides is location transparency for file and object access. This is a well understood capability that is known by most as a distributed file system. For example, a user program running on site A can access a file at site B using the same mechanism (open, close, read, write) used to access files at site A. Further, the same file may be shared by more than one user and by more than one task. Issues that must be addressed in such a distributed file system are naming, caching, replication, coherence of caches and replicas, and so on.

A more powerful model is shared object spaces. Instead of just files, all entities, files as well as executing tasks, can be named and shared between users (subject to access control restrictions). This merging of "files" and "objects" is driven by the observation that the traditional distinction between files and other objects is somewhat of an anachronism. Files really are objects - they happen to live on a disk, so as a consequence they are slower to access, and they persist when the computer is turned off. In this model, a file-object is a typed object with an interface. The interface can define object properties such as its persistence, fault, synchronization, and performance characteristics. Thus, not all files need be the same, eliminating, for example, the need to provide Unix synchronization semantics for all files even when many applications simply do not require those semantics. Instead, the right semantics along many dimensions can be selected on a file-by-file basis, and even changed at run-time.

The notion of file objects can be extended beyond traditional Unix-like files. ELFS, an ExtensibLe File System, provides language and system support for a new set of file abstractions. ELFS supports:

Application-specific "file" interfaces. For example, instead of just read, write, and seek, a 2D-array-file may have functions such as read_column, read_row, and read_sub_array. The advantage to the user is the ability to interact with the file system in application terms rather than arbitrarily linearized streams of data. Further, the implementations can be optimized for the particular data structure, perhaps by storing the data in sub-arrays, or by scattering the data to multiple devices to provide parallel I/O.
User specification of caching and prefetch strategies. This feature allows the user to exploit application domain knowledge about access patterns and locality to tailor the caching and prefetch strategies to the application.
Asynchronous I/O. ELFS permits overlapping I/O operations, including prefetching, with the application's computation.
Multiple outstanding I/O requests. ELFS allows the application to request data long before it is actually needed. The application can then compute while I/O is being processed. By the time the data is needed, it is available with almost zero latency.
Data format heterogeneity. ELFS classes may be constructed so as to hide data format heterogeneity, automatically translating data as it is read or written.

Wide Area Queuing System

Queuing systems are a part of everyday life at production supercomputer centers. The basic idea is simple: rather than interactively starting a job, the user submits a description of the job to a program (the queuing system), and that program schedules the job to execute at some point in the future. The purpose of the queuing system is to optimize some objective function such as system utilization, minimum average delay, or a priority scheme. There are over twenty-five different queuing systems in use today. Examples include NQS, PBS, Condor, Codine, and many others.

Just as contemporary queuing systems can be used to manage homogeneous resources at a particular site, future queuing systems will be able to manage heterogeneous resources at multiple sites. This will enable the user to submit a job from any site and, subject to access control, have the job automatically scheduled somewhere in the metasystem. Further, if binaries for the application are available for multiple architectures the queuing system will have a choice of platforms on which to schedule the job. The advantage of this over the current state of practice is significant. Often a user with multiple accounts at different sites must "shop around" for a the different sites looking for the shortest queue in which to run a job. This consumes the most precious resource, user time, that can be better spent doing science.

A multi-site queuing system has the opportunity to provide better user satisfaction. Queuing theory points out that a single M/M/k queue provides better average delay than k independent M/M/1 queues. This is analogous to the difference between a single line at the bank versus one line for each teller and no opportunity to switch lines. Thus, multi-site queuing systems offer the promise of improved average delay.

Wide Area Parallel Processing

Another opportunity presented by metasystems is connecting multiple resources together for the purpose of executing a single application, thereby providing the opportunity to execute mission-critical problems of much larger scale than would otherwise be possible. Not all problems will be able to exploit this capability. In order to exploit the metasystem in this manner the application must be latency tolerant - after all, it is 30 milli-seconds from California to Virginia at a minimum. Applications that fall into this category include parameter space problems, Monte Carlo's, and many regular finite difference methods.

Consider a 2D finite difference application such as PPM or an Ocean model that is to use two distributed memory MPPs at different sites connected by a thick communication pipe running at OC-3 or OC-12. Further suppose that the first host has twice as many processors as the second. To balance the load requires that the problem be decomposed in such a manner that the first host has twice as much of the data as the second. (In general the scheduling problem can become quite complex. A simple example is used here to illustrate the point.)

Given the decomposition shown below and information on the size of the mesh points we can easily compute the bandwidth requirements of the communications channel. Suppose the problem is 10,000x10,000 and each mesh point has twelve double precision floating point numbers. Then there are 96 bytes per mesh point, and just under one megabyte of data must be transferred over the wire on each iteration. (Represented by the thick line in the figure below.) Assuming coast-to-coast communication, an OC-3 channel (155 Mbs) and 66% channel utilization, the time to transmit the boundary layer is approximately 130 ms. For some applications, such as those with an embedded implicit solver, that is likely to be too long. However for others 130 ms is acceptable - particularly if asynchronous send-ahead is used and the communication can overlap with computation.

Meta-Applications

The most challenging class of applications for the metasystem are meta-applications. A meta-application is a multi-component application where many of the components were previously executed as a stand-alone applications.

A generic example is shown above. Three previously stand-alone applications have been connected together to form a larger, single application. Of course each of the components may have hardware affinities, such as a vectorizable code that "wants" to run on a vector machine such as Cray T90. Components also may have parallel implementations, such as a data parallel or task parallel implementation.

A more concrete example is a multi-resolution weather/climate model is being constructed within NPACI as a collaboration between UCLA, UVa, SDSC, and UCB. The application consists of a coupled atmosphere/ocean general circulation model, a meso-scale weather model of the western United States, and smaller scale models of bays and estuaries within the western United States.

Characteristics.

Component characteristics. Meta-application components are often written by, and maintained by, geographically separate research groups. Further, the components may be written in different languages, or perhaps using different parallel dialects of the same language, e.g., Fortran components using MPI, PVM, and HPF. The components often represent valuable intellectual property for the owners. They often want to keep the code in-house to protect it. Often too the code is really only understood by the authors anyway. The challenge to the metasystem is to facilitate inter-component communication when the components are in different languages, and to transparently migrate binaries from site to site.

Geographically distributed databases. Data usually resides physically close to the research group that collects and maintains the data. Thus before a coupled model execution may begin in todays environment all of the data must first be copied to a single location, where, unfortunately, it may not fit. The challenge to the metasystem is determine when it makes sense to move the computation to the data, and when it makes sense to move the data to the computation.

Fault-tolerance. As the number of hosts and processors participating in the computation increases the probability of a failure increases and the mean time to failure decreases. When the mean time to failure becomes less than the time to complete a run of the application, the application will never finish. It will start, fail, re-start, fail, and so on. Fault tolerance is a necessity under these circumstances.

Security. Security is a word that covers a range of issues, authentication, access control, encryption, and others. Clearly different users have different requirements. Suppose though that the meta-application is using geographically distributed databases. Then those databases must be protected against un-authorized access and update. Similarly, the distributed components results must protected from tampering or destruction.

Scheduling. Scheduling of meta-applications presents a number of challenges. First, the general scheduling problem of mapping an arbitrary task graph onto more than three processors is known to be NP-hard. This means that an optimal algorithm will have exponential complexity, i.e., we cannot use optimal algorithms. Instead application-class-specific heuristics must be used that exploit knowledge about the application (the "shape" of the graph). Even then the scheduling problem is difficult. Consider scheduling a simple meta-application with three data-parallel components onto a single distributed memory MPP. First, the number of processors allocated to each component must be chosen to ensure that each component progresses at the same rate; that may require giving each component a different number of processors. Second, the component tasks must be mapped to the processors in such a manner as to reduce the communication load, random placement may lead to communication bottlenecks. Finally, the computational requirements of the components may vary over time, requiring dynamic re-partitioning of the available resources.

Visualization. During the course of the computation the user may want to see what is happening, both at the global scale and at smaller scales. Further, he or she may wish to "look inside" individual models at particular aspects of that model, and potentially steer the computation by modifying some values. This is complicated by the fact that the user may be physically far removed from the computation.

Thus, a meta-application may be composed of multiple models, running at different scales, written by different research groups using different languages or parallel processing tools, using proprietary geographically distributed databases, on faulty hardware. The requirement is for a system that cleanly and easily supports interoperability between components, the "plug and play" composition of components into a running program, and the scheduling of components onto processing resources (both within a large MPP and between hosts). Further, the system must support transparent, secure, and efficient access to remote databases, while providing a robust integrated visualization capability.

The Near-Term Future

The next several years will see the widespread deployment of metasystems software. Both NSF PACIs (NPACI and NCSA) are committed to deploying metasystems. Similarly, NASA is seriously considering a major commitment to metacomputing via the Information Power Grid (IPG) project. While the architecture and capabilities of metasystems of the next five years will vary significantly certain minimum capabilities

are certain. These include:

Multi-site access. One objective of metasystems software is to blur the distinction between sites. Nonetheless there are different sites - with different policies and accounting schemes. At a minimum a metasystem will allow a single application to access resources (e.g., cycles or data) at multiple sites. Some metasystems require the user to have a personal account at each site (Globus), whereas others do not necessarily require accounts (Legion, where each site may select its own policy).
Process management. User programs will have the ability to create (spawn) objects (processes) throughout the metasystem subject to access control and local policy. The mechanism used will vary, as will the semantics. There will be mechanisms for destroying objects, checking the state of objects, limiting resource use of objects, and object migration.
Security. Security is a major issue for metasystems. All existing projects have solutions for at least part of the security problem. Authentication and encryption are always covered. Authentication usually involves the use of certificates (using one of several "standards"). Encryption of the communication layer is also possible. Solutions for access control vary widely.
Remote file access. Access to remote files is provided. The name space varies, from a single consistent name space to a more traditional host-name@/path_name approach. The usual open, close, read, write, and seek are available, providing access to files from all executing objects.
Scheduling and resource management services. Scheduling and resource management covers a range of activities. First is the issue of resource discovery: finding out what resources are available, when they are available, and at what cost. This information is necessary for all but the simplest scheduling algorithms. Scheduling then occurs, taking into account local policy restrictions on who may access which resources.
Inter-object communication (object = process, for the most part). Inter-object communication is needed for all parallel applications. The major metasystems support both MPI and PVM. Beyond that are higher-level abstractions such as method invocation on objects.
Cross-site queuing. Even single task jobs can benefit from a multi-site queuing system that allows users to submit jobs from their workstations and let the back-end system select a resource, perhaps at another site, to execute the job.

Beyond the above there is a significant difference in the architectures of the two largest ongoing metasystems projects, Legion (University of Virginia) and Globus (USC and ANL). In a nutshell, Legion is a complete solution based on a unifying object model and a commitment to site autonomy and user flexibility, whereas Globus is a sum of services architecture that began with a communication service (Nexus) and added one service at a time. Below I have included short descriptions of each project.

Appendix

Legion

Legion is a reflective object-based metasystem developed at the University of Virginia. From the project's beginning in late 1993 the Legion Research Group's goal has been a highly usable, open, efficient, and scalable system founded on solid principles. We have been guided by our own work in object-oriented parallel processing, distributed computing, and security, as well as by decades of research in distributed computing systems. Legion provides a single, coherent, virtual machine that addresses key issues such as scalability, programming ease, fault tolerance, security, site autonomy, etc.

We have laid out ten design objectives that are central to the project's success: site autonomy; an extensible core; scalability; an easy-to-use, seamless computational environment; high performance via parallelism; a single persistent name space; security for both users and resource providers; management and exploitation of resource heterogeneity; multi-language support and inter-operability; and fault tolerance.

Site autonomy. Legion will not be a monolithic system. It will be composed of resources owned and controlled by an array of organizations. These organizations, quite properly, will insist on having control over their own resources; e.g., specifying how much of a resource can be used, when it can be used, etc.
Extensible core. We cannot know all of the many current and future needs of users. Mechanism and policy must be realized via extensible, and replaceable, components that permit Legion to evolve over time and allow users to construct their own mechanisms and policies to meet specific needs.
Scalable architecture. Because Legion will consist of millions of hosts, it must have a scalable architecture rather than centralized structures - the system must be totally distributed.
Easy-to-use, seamless computational environment. Legion must mask the complexity of the hardware environment and of the communication and synchronization of parallel processing. Machine boundaries, for example, should be invisible to users. As much as possible, compilers, acting in concert with run-time facilities, must manage the environment.
High performance via parallelism. Legion must support easy-to-use parallel processing with large degrees of parallelism. This includes task and data parallelism and their arbitrary combinations.
Single, persistent name space. One of the most significant obstacles to wide area parallel processing is the lack of a single name space for file and data access. The existing multitude of disjoint name spaces makes writing applications that span sites extremely difficult.
Security for users and resource owners. Because we cannot replace existing host operating systems, we cannot significantly strengthen existing operating system protection and security mechanisms. However, we must ensure that existing mechanisms are not weakened by Legion. Further, we must provide mechanisms for users to manage their own security needs: Legion should not define the security policy or require a "trusted" Legion.
Management and exploitation of resource heterogeneity. Legion must support inter-operability between heterogeneous hardware and software components, as well as take advantage of the fact that some architectures are better than others at executing particular applications (e.g., vectorizable codes). These affinities, and the costs of exploiting them, must be factored into scheduling decisions and policies.
Multiple language support and inter-operability. Legion applications will be written in a variety of languages. It must be possible to integrate heterogeneous source language application components in much the same manner that heterogeneous architectures are integrated. Inter-operability also means that we must be able to support legacy codes.
Fault-tolerance. In a system as large as Legion, it is certain that at any given instant several hosts, communication links, and disks will have failed. Dealing with failure and dynamic re-configuration is a necessity - both for Legion itself, and for its applications.

In addition to these goals, three constraints restrict our design:

We cannot replace host operating systems: Organizations will not permit their machines to be used if their operating systems must be replaced. Operating system replacement would require them to rewrite many of their applications, retrain many of their users, and possibly make their machines incompatible with other systems in their organization.
We cannot legislate changes to the interconnection network: We assume that the network resources, and the protocols in use, are a given. Just we must accommodate operating system heterogeneity, we must live with the available network resources.
We cannot require that Legion run as "root" (or the equivalent). Indeed, quite the contrary - to protect themselves, most Legion users will want it to run with the fewest possible privileges.

Legion System Architecture

Legion is a reflective object-based system that empowers classes and meta-classes with system-level responsibility. Legion users will require a wide range of services in many different dimensions, including security, performance, and functionality. No single policy or static set of policies will satisfy every user, so, whenever possible, users must be able to decide what trade-offs are necessary and desirable. Several characteristics of the Legion architecture reflect and support this philosophy.

Everything is an object: A Legion system consists of a variety of hardware and software resources, each of which is represented by a Legion object. This object is an active process that responds to member function invocations from other objects in the system. Legion defines the message format and high-level protocol for object interaction, but not the programming language or the communications protocol.
Classes manage their instances: Every Legion object is defined and managed by its class object, which is itself an active Legion object. Class objects are given system-level responsibility; classes create new instances, schedule them for execution, activate and deactivate them, and provide information about their current location to client objects that wish to communicate with them. In this sense, classes are managers and policy makers, not just definers of instances. Classes whose instances are themselves classes are called meta-classes.
Users can provide their own classes: Legion allows users to define and build their own class objects; therefore, Legion programmers can determine and even change the system-level mechanisms that support their objects. Legion 1.0 (and future Legion systems) contains default implementations of several useful types of classes and meta-classes. Users will not be forced to use these implementations, however, particularly if they do not meet the users' performance, security, or functionality requirements.
Core objects implement common services: Legion defines the interface and basic functionality of a set of core object types that support basic system services, such as naming and binding, and object creation, activation, deactivation, and deletion. Core Legion objects provide the mechanisms that classes use to implement policies appropriate for their instances. Examples of core objects include hosts, vaults, contexts, binding agents, and implementations.

Globus [This section taken from http://www.globus.org/globus/information.htm]

Project Goals and Approach

A central goal of the Globus project is the development and deployment of a distributed supercomputing infrastructure toolkit providing an integrated set of services in five areas: communication, information, resource location/allocation, security, and data access. For each component, we provide an interface specification defining the functionality provided by the service, a set of implementation notes explaining how the service is implemented in different environments, and one or more reference implementations. The following links provide access to this information for each service:

1. Globus communication services (the Nexus communication library) provide unicast and multicast message delivery services, designed to permit the efficient implementation of a wide range of communication methods.

2. Globus information services (the Metacomputing Directory Service) provide a uniform mechanism for obtaining real-time information about system structure and status.

3. Globus resource location and allocation services provide mechanisms for expressing application resource requirements, for identifying resources that meet these requirements, for scheduling resources once they have been located, and for initiating and managing computation on these resources.

4. Globus data access services provide high-speed remote access to persistent storage such as files.

5. Globus security services provide authentication and authorization services