Tomasz Haupt, Erol Akarsu, Geoffrey Fox, Wojtek Furmanski
Notheast Parallel Architecture Center at Syracuse University
Abstract ( appr. 10 - 15 lines)
Programming tools that are simultaneously sustainable,
highly functional, robust and easy to use have been hard to come by in
the HPCC arena. This is partially due to the difficulty in developing sophisticated
customized systems for what is a relatively small part of the worldwide
computing enterprise. Thus we have developed a new strategy - termed HPcc
High Performance Commodity Computing[1] - which builds HPCC programming
tools on top of the remarkable new software infrastructure being built
for the commercial web and distributed object areas.
This leverage of a huge industry investment naturally delivers tools with
the desired properties with the one (albeit critical) exception that high
performance is not guaranteed. Our approach automatically gives the user
access to the full range of commercial capabilities (e.g. databases and
compute servers), pervasive access from all platforms and natural incremental
enhancement as the industry software juggernaut continues to deliver software
systems of rapidly increasing power.
We add high performance to commodity systems using a multi tier architecture
with traditional HPCC technologies such as MPI and HPF supported as the
backend of a middle tier of commodity web and object servers.
Figure 1. Top view of the WebFlow design.
Our research addresses needs for high level programming environments and
tools to support distance computing on heterogeneous distributed commodity
platforms and high-speed networks, spanning across labs and facilities.
More specifically, we are developing WebFlow - a scalable, high level,
commodity standards based HPDC system that integrates (c.f. Fig. 1):
Note this can be applied to either parallel or
metacomputing applications, and provides a uniform cross platform high
level computing enviroment.
The visual HPDC framework introduced by this project
offers an intuitive Web browser based interface and a uniform point of
interactive control for a variety of computational modules and applications,
running at various labs on different platforms and networks. New applications
can be composed dynamically from reusable components just by clicking on
visual module icons, dragging them into the active WebFlow editor area,
and linking by drawing the required connection lines. The modules are executed
using Globus[2] optimized components combined with the pervasive commodity
services where native high performance versions are not available. For
instance today one links Globus controlled MPI programs to WebFlow (Java
connected) Windows NT and database executables. When Globus gets extended
to full PC support, the default WebFlow implementation is replaced by the
high performance code.
Individual modules are typically represented by visualization, control,
steering or collaboration applets, and the system also offers visual monitoring,
debugging and administration of the whole distributed applications and
the underlying metacomputing resources. In the future, WebFlow will offer
tools for easy conversion of existing (sequential, parallel or distributed)
applications to visual modules via suitable CORBA[3], COM[4] or JavaBeans[5]
based wrapper/proxy techniques.
New applications, created within the WebFlow framework follows a natural
modular design in which one accumulates a comprehensive problem domain
specific module library in the first phase of a project. Then one would
explore the computational challenges of the project in a visual interactive
mode, trying to compose the optimal solution of a problem in a sequence
of on-the-fly trial applications. The scripting capabilities of WebFlow
coupled with database support for session journaling will facilitate playback
and reconstructing optimal designs discovered during such rapid prototyping
sessions.
For the parallel object and module developers, we will also provide finer
grain visual and scripted parallel software development tools using the
new Uniform Modeling Language (UML)[3], recently accepted as an OMG standard.
UML offers a spectrum of diagrammatic techniques that allow one to address
various stages of software process and several hierarchy layers of a complex
software system. In this way, WebFlow will combine the features of UML
based visual tools such as Rational Rose with both high performance and
the proven value of data flow based visual programming environments such
as Khoros and AVS.
Our technology goal is to build a high level user friendly commodity software
based visual programming and runtime environment for HPDC. We believe that
such an ambitious and generic framework can be successfully built only
when closely related to some specific large scale application domains which
can provide specification requirements, testbeds and user feedback during
the whole course of the system design, prototyping, development, testing
and deployment. We view NCSA Alliance, and DoD modernization programs as
attractive application environments for HPcc due to its unique, clear mission
and advanced computational challenges, opportunities and requirements.
WebFlow is a particular programming paradigm implemented over a virtual
Web accesible metacomputer and given by a dataflow programming model (other
models under experimentation include data parallel, collaboratory, and
televirtual paradigms). WebFlow application is given by a computational
graph, visually edited by end-users using Java applets. Modules are written
by module developers, people who have only limited knowledge of the system
on which the modules will run. They not need concern themselves with issues
such as: allocating and running the modules on various machines creating
connections among the modules sending and receiving data across these connections
running several modules concurrently on one machine The WebFlow system
hides these management and coordination functions from the developers,
allowing them to concentrate on the modules being developed.
In our approach we adopt the integrative methodology
i.e. we setup a multiple-standards based framework in which the best assets
of various approaches accumulate and cooperate rather than compete. We
start the design from the middleware which offers a core or a `bus' of
modern 3-tier systems and we adopt Java as the most efficient implementation
language for the complex control required by the multi-server middleware.
Atomic encapsulation units of WebFlow computations are called modules and
they communicate by sending objects along channels attached to a module.
Modules can be dynamically created, connected, scheduled, run, relocated
and destroyed.
Our prototype WebFlow system[WebFlow] is given
by a mesh of Java enhanced Web Servers[Apache], running servlets that manage
and coordinate distributed computation. This management is currently implemented
in terms of the three servlets: Session Manager, Module Manager, and Connection
Manager. These servlets are URL addressable and can offer dynamic information
about their services and current state. Each of them can also communicate
with each other through sockets. Servlets are persistent and application
independent.
Session Manager. The Session Manager is the part of the system in charge
of accepting user commands from the front end, and executing them by sending
requests to the rest of the system. The user requests that the Session
Manager honors are creating a new module, connecting two ports, running
the application, and destroying the application. Since the Session Manager
and the front end generally reside on separate machines, the Session Manager
keeps a representation of the application that the user is building, much
like the representation stored in the front end. The difference between
these two representations is that the Session Manager needs to worry about
the machines on which each of the modules has been started, while the front
end worries about the position of the representation of the module on the
screen. The Session Manager acts like a server for the front end, but uses
the services of the Module and Connection Managers. All of the requests
received from the user are satisfied by a series of requests to the Module
and Connection Managers which store the actual modules and ports.
Module Manager. The Module Manager is in charge of running modules on demand.
When creation of a module is requested, that request is sent to the Module
Manager residing on the particular machine on which the module should be
run. The Module Manager creates a separate thread for the module (thus
enabling concurrent execution of multiple modules), and loads the module
code, making the module ready for execution. Upon receipt of a request
for running a module, the Module Manager simply calls a run method which
each module is required to have. That method is written by the module developer,
and implements the module's functionality. Upon receipt of a request for
destroying a module, the Module Manager first stops the thread of execution
of the module, then calls the special destroy method. The destroy method
is again written by the module developer, and it performs all the clean-up
operations deemed necessary by the developer.
Connection Manager. The Connection Manager is in charge of establishing
connections between modules, or more precisely, between input and output
ports of the modules. As the modules can be executed on different machines,
the Connection Manager is capable of creating connections across the network.
In such a case it serves as a client to the peer Connection Manager on
the remote WebFlow server. The handshaking between the Managers follows
a custom protocol.
Although our protoype implementation of the WebFlow proved to be very successful,
we are not satisfied with this to large extend custom solution. Pursuing
HPcc goals, we would prefer to base our implementation on the emerging
standards for distributed objects, and take the full advantage of the possible
leveages realised by employing commercial technologies. Our research led
us to the following observations.
While the "Java Platform" or "100% Pure Java" philosophy
is being advocated by Sun Microsystems, industry consortium led by the
OMG pursues a multi-language approach built around the CORBA model. It
has been recently observed that Java and CORBA technologies form a perfect
match as two complementary enabling technologies for distributed system
engineering. In such a hybrid approach, referred to as Object Web [ObjectWeb],
CORBA is offering the base language-independent model for distributed objects
and Java offers a language-specific implementation engine for the CORBA
brokers and servers.
Meanwhile, other total solution candidates for distributed objects/components
are emerging such as DCOM by Microsoft or WOM (Web Object Model) by the
World-Wide Web Consortium. However, standards in this area and interoperability
patterns between various approaches are still in the early formation stage.
A closer inspection of the distributed object/component standard candidates
indicates that, while each of the approaches claims to offer the complete
solution, each of them in fact excels only in specific selected aspects
of the required master framework. Indeed, it seems that WOM is the easiest,
DCOM the fastest, pure Java the most elegant and CORBA the most realistic
complete solution.
Consequently, we plan to adopt CORBA as the base distributed object model
at the Intranet level, and the Web as the world-wide distributed object
model. System scalability requires fuzzy, transparent boundaries between
Intranet and Internet domains which therefore translates into the request
of integrating the CORBA and Web technologies. We implement it by building
a JWORB[JWORB] which is a multi-protocol extensible server written in Java.
The base server has HTTP and IIOP protocol support. It can serve documents
as an HTTP Server and it handles the IIOP connections as an Object Request
Broker. As an HTTP server, JWORB supports base Web page services, Servlet
(Java Servlet API) and CGI 1.1 mechanisms. In its CORBA capacity, JWORB
is currently offering the base remote method invocation services via CDR
based IIOP and we are now implementing the Interface Repository, Portable
Object Adapter and selected Common Object Services.
The new JWORB based WebFlow will adresses the integration of the CORBA
component, DCOM component and the Enterprise JavaBeans models. Armed with
a Java based CORBA platform such as JWORB to be soon augmented by the CORBA/COM
bridge, we will be able to freely experiment with and mix-and-match all
these component standard candidates.
The WebFlow Applet is the front end of the system.
Through it, the users can request new modules to be initialized, their
ports connected, the whole application ran, and finally destroyed.
Figure 2. WebFlow Front End Applet
The WebFlow editor provides an intuitive environment to visually compose
(click-drag-and-drop) a chain of data-flow computations from preexisting
modules. In the edit mode, modules can be added to or removed from the
existing network, as well as connections between the modules can be updated.
Once created network can be saved (on the server side) to be restored at
a later time. The workload can be distributed among several WebFlow nodes
(WebFlow servers) with the interprocessor communicatations taken care of
by the middle-tier services. More, thanks to the interface to the Globus
system in the backend, execution of particular modules can be delegated
to powerful HPCC systems. In the run mode, the visual representation of
the metaaplication is passed to the middle-tier by sending a series of
requests (module instantiation, intermodule communications) to the Session
Manager.
The control of the module execution is excercised not only by sending relevant
data through the module's input ports. Majority of modules we developed
so far require some additional parameters that can be entered via "Module
Controls" (similarly to systems such as AVS). The module controls
are Java applets displayed in a card panel of the main WebFlow applet.
The communication channels (sockets) between the backend implementation
of a module and its front-end Module Controls is generated automatically
during the instantiation of the module.
Not all applications follow closely the data flow paradigm. Therefore it
is necessary to define an interface so that different frontend package
can be "plugged in" into the middle-tier, giving the user a chance
to use the FrontEnd that best fits the application at hand. Currently we
offer visual editors based on GEF[GEF] and VGJ[VGJ]. In the future, we
will add an editor based on the UML[UML] standard, and we will provide
an API for creating custom ones.
When designing the WebFlow we assumed that the most important feature of
the frontend should be capability to create dynamically many different
networks of modules taylored to the application needs. However, working
with the real users and the real applications we find out that this assumption
is not always true. WebFlow can be used just as a development tool by taking
advantage of our graphical authoring tools to create the application (or
a suite of applications). Once created, the same application (i.e., network
of modules) is to be run without any changes over and over again with the
different input sets by the end user. In such a case the design of the
frontend should be totally different. The expected functionality is to
provide an environment that allows navigate and choose the right application
and right data to solve the problem at hand while any technical nuances
of the application should be hidden from the end user.
Another important feature of our design is that we introduce a platfom
independent, web accessible frontend to a high performance metacomputing
environment. Given access to the Internet, the user can create and execute
his or her application using adequate computational resources anywhere
anytime, even from a laptop personal computer. It is the responsibility
of the middle-tier to identify and allocate resources, and to provide access
to the data.
In the current implementation of the WebFlow we ignored the issues of security.
Again, in agreemnt with our HPcc strategy, we closely watch development
of the industry standards. At this time, the SSL suite of protocols is
clearly the dominant technology for authorization, mutual authentication
and encription mechanisms. The most recent release of Globus implements
SSL-based security features. In order to access Globus high performance
computational resources, the user must produce an encripted certificate
digitally signed by the globus certificate authority, and it return, the
Globus side (more precisely, GRAM gatekeeper) presents its own certificate
to the user. This mutual authentication is necessary for exchange of encripted
messages between the two parties. However, the authorization to use the
resources is granted by the system administration that owns the resources,
and not Globus. We are experimenting with a similar implementation for
the WebFlow.
The module API is very simple. The module implements
a specific WebFlow Java interface, metamodule. In practice, the module
developer has to implement three methods: initialize, run, and destroy.
The initialize method registers the module itself and its ports to the
Session Manager, and establish the communication between itself and its
frontent applet - the module controls. The run method implemets the desired
functionality of the module, while the destroy method performs clean-up
after the processing is completed. In particular, the destroy methods closes
all socket connections which are never destroyed by the Java garbage collector.
It follows that development of WebFlow modules in Java is strightforward.
Taking into account availability of more and more Java APIs, such as JDBC,
this allows to create quite powerful, portable applications in the pure
Java. To convert existing applications written in languages different than
Java, the Java native interface can be used. Indeed, at Supercomputing
`97 in San Jose, California, we demonstrated an HPF application run under
control of WebFlow [SC97]. Using the Netscape browser on a desktop machine
we were able not only to launch the application on a remote parallel computer,
but also extract data at the runtime, using the DARP system[DARP] converted
to a WebFlow module, and process them by WebFlow modules written in Java
running on the local machine.
For a more complex metaapplications more sophisticated backend solution
is needed. As usual we go for a commodity solution. Since commercial solutions
are practically nonexistent, in this case we use technology that comes
from the academic environment: the metacomputing toolkit of Globus. The
Globus toolkit provides all functionality we need. The underlying technology
is a high performance communication library: Nexus. MDS (Metacomputing
Directory Services) allows for the resource identification, while GRAM
(Globus Resource Allocation Manager) provides secure mechanisms to allocate
and scheduling of the resources. GASS package (Global Access to the Secondrary
Storage) implements high performance, secure data transfer which is augmented
with RIO (Remote Input/Output) library that provides access to parallel
data file systems.
WebFlow interacts with the Globus via GRAM gatekeeper. A dedicated WebFlow
module serves as a proxy of the gatekeeper client, which in turn sends
requests to GRAM. Currently, the proxy is implemented using the Java native
interface. However, in collaboration with the Globus development team,
we are working on a pure Java implementation of the gatekeeper client.
At this time GASS supports only the Globus native x-gass protocol which
restrict its use to systems where Globus is installed. We expect support
for other protocols, notably ftp, soon. This will allow to use GASS secure
mechanism for data transfer from and to systems outside the Globus domain,
under control of WebFlow. We also collaborate with the Globus development
team to build support for other protocols, HTTP and LDAP. In particular
support for HTTP will allow for implementing data filtering on-the-fly,
as the URL given to GASS may point not to the data directly, but a servlet
or CGI script instead.
As a test application for the WebFlow we selected
Quantum Simulations[QS]. This application can be characterized as follows
(c.f. fig.3). A chain of high performance applications (both commercial
packages such as GAUSSIAN or GAMESS or custom developed) is run repeatedly
for different data sets. Each application can be run on several different
(multiprocessor) platforms, and consequently, input and output files must
be moved between machines. Output files are visually inspected by the researcher;
if necessary applications are rerun with modified input parameters. The
output file of one application in the chain is the input of the next one,
after a suitable format conversion.
Figure 3. WebFlow application: Quantum Simulations
This example mataapplication demonstrates strength of our WebFlow approach.
The WebFlow editor provides an intuitive environment to visually compose
the chain of data-flow computations from preexisting modules. The modules
encapsulate many different classes of applications: from massively parallel
to custom developed auxiliary programs to commodity commercial ones (such
as DBMS or visualization packages). The seamless integration of such heterogeneous
software components is achieved by employing distributed objects technologies
in the middle tier. The high performance part of the backend tier in implemented
using the GLOBUS toolkit. In particular, we use MDS (metacomputing directory
services) to identify resources, GRAM (globus resource allocation manager)
to allocate resources including mutual, SSL based authentication, and GASS
(global access to secondary storage) for a high performance data transfer.
The high performance part of the backend is augmented with a commodity
DBMS (servicing Permanent Object Manager) and LDAP-based custom directory
service to maintain geographically distributed data files generated by
the Quantum Simulation project.
We just started using WebFlow for another type of applications, within
DoD modernization project. Here the focus is to provide an uniform access
via an Web-based interface to specific simualtion codes. We will use Globus
to allocate remote resources and data transfer. The project involves development
of a custom frontend that allows to import data sets comming from varius
sources and in differnt formats (satellite images, GIS, databases, to name
a few), as well as generating visually a new data sets.
To summarize, we developed a platform independent,
three-tier system: the visual authoring tools implemented in the front
end integrated with the middle tier network of servers based on the industry
standards and following distributed object paradigm, facilitate seamless
integration of commodity software components. In particular, we use the
WebFlow as a high level, visual user interface for GLOBUS. This not only
makes construction of a meta-application much easier task for an end user,
but also allows to combine this state of art HPCC environment with commercial
software, including packages available only on Intel-based personal computers.