Given by Mark Baker, Geoffrey Fox at Tutorial for CRPC MRA Meeting at Cornell on May 7 1996. Foils prepared May 7 1996
Outside Index
Summary of Material
A Brief History of Scientific Computing |
- Performance of Supercomputers and Networks |
Some Terminology |
The Need for Supercomputing - A Rationale for Metacomputing |
- Some Examples |
- Current Access to Resources |
- Need for Alternatives |
- Computer Architectures |
- Why Use Parallel Computing Techniques |
Parallel versus Distributed Computing |
- Pros and Cons |
- Why Use Distributed Computing Techniques |
- Examples of Communications Performance |
The Challenge |
Infrastructure and Technology |
Features of Distributed Systems
|
The Problem... |
The Reality |
Some Relevant Parallel Programming Languages - Legion - TreadMarks - Linda - HPF - MPI - PVM - JAVA |
Outside Index Summary of Material
Mark Baker and Geoffrey Fox |
Northeast Parallel Architectures Center |
Syracuse University |
111 College Place |
Syracuse, NY 13244-4100, USA |
tel: +1 (315) 443 2083 |
fax: +1 (315) 443 1973 |
email: mab@npac.syr.edu |
URL: http://www.npac.syr.edu/ |
Overview |
A Brief History of Scientific Computing |
- Performance of Supercomputers and Networks |
Some Terminology (1 - 5) |
The Need for Supercomputing - A Rationale for Metacomputing |
- Some Examples |
- Current Access to Resources |
- Need for Alternatives |
- Computer Architectures |
- Why Use Parallel Computing Techniques |
Parallel versus Distributed Computing |
- Pros and Cons |
- Why Use Distributed Computing Techniques |
- Examples of Communications Performance |
The Challenge |
Infrastructure and Technology |
Features of Distributed Systems
|
The Problem... |
The Reality |
Software |
Software (Contd)
|
Late 1970's and early 1980's |
Departmental Systems - VAX's come into use - Time Shared Resources - Relatively high initial capital and maintenance costs - Skilled system administrator |
Early 1980's to Mid 1980's |
Unix workstations became available - and affordable - Limited CPU, memory and disk resources - Skilled system administrator |
Mid 1980's to late 1980's |
Networking - more commonly - Ethernet - Distributed file systems and servers |
Late 1980's to early 1990's |
Desktop workstations start to proliferate - CPU/Memory/Disk more affordable - Parallel Languages - P4, PVM, Linda, etc. - Hippi and FDDI networks |
Early 1990's to Mid 1990's
|
Further Advances in the 1990's
|
Bandwidth - The communications capacity (measured in bits per |
second) of a transmission line or of a specific path through the |
network. |
Clustered Computing - An environment consisting of many |
workstations connected together by a local area network and can |
be generalised to a heterogeneous collection of machines with |
arbitrary architecture and interconnect. |
High-Performance Distributed Computing (HPDC) - The use of |
distributed networked computers to achieve high performance on a |
single problem, i.e., the computers are co-ordinated and |
synchronised to achieve a common goal. |
LAN, MAN, WAN - Local, Metropolitan, and Wide Area Networks can |
be made from any or many of the different physical network media, |
and run the different protocols. LAN's are typically confined to |
departments (less than a kilometre), MAN's to distances of order 10 |
kilometres, and WAN's can extend worldwide. |
Metacomputer - This term describes a collection of heterogeneous |
computers networked by a high-speed wide area network. Such an |
environment would recognise the strengths of each machine in the |
Metacomputer, and use it accordingly to efficiently solve so-called |
Metaproblems. The World Wide Web has the potential to be a |
physical realisation of a Metacomputer. |
Metaproblem - Term describes a class of problem which is outside the |
scope of a single computer architectures, but is instead best run on a |
Metacomputer . These problems consist of many constituent subproblems. |
Ex. the design and manufacture of a modern aircraft, which presents |
problems in geometry grid generation, fluid flow, acoustics, structural |
analysis, operational research, visualisation, and database management. |
The Metacomputer for such a Metaproblem would be networked |
workstations, array processors, vector supercomputers, MPP, and |
visualisation engines. |
Protocol - A set of conventions and implementation methodologies defining |
the communication between nodes on a network. Ex. seven layer OSI |
standard model going from physical link (optical fiber to satellite) to |
application layer (such as Fortran subroutine calls). |
Supercomputer - Most powerful computer that is available at any given |
time. As performance is roughly proportional to cost, this is not very |
well defined for a scalable parallel computer. Traditionally, computers |
costing some $10 - $30 M are termed supercomputers. |
Symmetric Multiprocessor (SMP) - A Symmetric Multiprocessor |
supports a shared memory programming model -- typically with a UMA |
memory system, and a collection of up to 32 nodes connected with a |
bus. |
Transmission Control Protocol (TCP) - A connection-oriented |
transport protocol used in the DARPA Internet. TCP provides for the |
reliable transfer of data, as well as the out-of-band indication of urgent |
data. |
WWW Clients and Servers - A distributed set of clients (requesters |
and receivers of services) and servers (receiving and satisfying |
requests from clients) using Web Technologies. |
World Wide Web and associated Technologies - A very important |
software model for accessing information on the Internet based on |
hyperlinks supported by Web technologies, such as HTTP, HTML, |
MIME, Java, Applets, and VRML. |
Scientists - Verify or disprove things and get their work done faster... |
Engineers - Simulate things (Car, drugs, aeroplanes...) in ever greater detail before producing expensive prototypes or production systems |
Retailers - Examine sales records to tailor there marketing and production needs to the most lucrative market. |
Airlines - Optimise the carriage of passengers on their aircraft - fill flights, understand route-trends, mine data for valuable information. |
Financiers - Make vast profits by capitalising on the intelligent probabilistic prediction of stock mark shares faster than their competitors. |
Consumers - Access to movies, down load to the home. |
Current Access to Supercomputing Resources |
1. Community of users for FREE Supercomputing far exceeds the |
available resources. |
2. Computational needs of these users far exceeds the |
Supercomputing resources available now or in the near future. |
1. Vast numbers of under utilised workstations available to use. |
2. Huge numbers of unused processor cycles and resources that |
could be put to good use in a wide variety of applications areas. |
3. Reluctance to buy Supercomputer due to their cost and short life |
span. |
4. Distributed compute resources "fit" better into todays funding |
model. |
Architectures of Supercomputers |
Single processor - past/traditional |
Vector |
Multiple processor Shared Memory |
Multiple processors |
Combinations - modern/innovative |
Main Architectures are Either Shared or Distributed Memory |
Vary greatly in complexity: |
small machine with a handful of processors on one circuit board. |
larger machines might have many thousands of complex processors connected by a number of specialised communications networks contained in many cabinets. |
Basically, any parallel computer consists of three main elements: |
Processors |
Memories |
Interconnection network to enable communication between these elements. |
Vary greatly in complexity: |
A typical way to classify these architectures is with Flynn's taxonomy, which labels architectures according to instruction stream and data stream. |
For example, an idealised serial computer would be labeled Single instruction Single Data (SISD) as it executes one instruction at a time on a single piece of data. |
The SIMD architecture consists |
Many (typically simple) processors, with some local memory. |
Executing the same instruction in lockstep, on a small piece of data in its local memory, with the instructions issued by the controller processor. |
Such architectures are good for applying algorithms which require the same operation on a large array of elements, however, they suffer badly if the problem results in load imbalances as the processors synchronise after every step. This is not usually a problem for data parallel programs. |
Examples - AMT DAP - Maspar |
MIMD architectures consist of a number of powerful processors which can each execute individual instruction streams. The usual subdivision of this class is by the relationship between the processor and memory. |
1. MIMD - Shared Memory |
2. MIMD - Distributed Memory |
3. MIMD - Virtual Shared Memory |
All processors associated with the same shared memory structure access the exact same storage location |
Synchronisation is achieved by controlling tasks' reading from and writing to the shared memory. |
A shared memory location must not be changed by one task while another, concurrent task is accessing it. |
Data sharing among tasks is fast (speed of memory access). |
Attractive feature of shared memory is that the time to communicate among the tasks is effectively a factor of a single fixed value, that being "the time it takes a single task to read a single location." |
Disadvantage: scalability is limited by number of access pathways to memory. |
If there are more tasks than connections to memory, you have contention for access to the desired locations, and this amounts to increased latencies while all tasks obtain the required values. So the degree to which you can effectively scale a shared memory system is limited by the characteristics of the communication network coupling the processors to the memory units. |
Examples - SM can be either via hardware - SMP's or emulated through software. |
Software - HPF (various sources), Craft (Cray), Legion (Virginia), TreadMarks (Rice). |
Hardware - Cray -C90, SGI - power series, Convex - Exemplar. |
Memory is physically distributed among processors; each local memory is directly accessible only by its processor. |
Similar to buying an ordinary workstation, each component of a distributed memory parallel system is, in most cases, a self-contained environment, capable of acting independently of all other processors in the system. |
To achieve the true benefits of this system, there must be a way for all of the processors to act in concert, which means "control" |
Synchronisation is achieved by moving data between processors (communication). |
The only link among these distributed processors is the traffic along the communications network that couples them; therefore, any "control" must take the form of data moving along that network to the processors. |
A Major Concern |
A major concern is data decomposition -- how to divide arrays among local CPUs to minimise communication |
Here is a major distinction between shared- and distributed-memory: |
In SM the processors don't need to worry about communicating with their peers, only with the central memory, while in the DM there really isn't anything but the processors. |
A single large regular data structure, such as an array, can be left intact within SM, and each co-operating processor is simply told which ranges of indices its is to deal with; |
In DM, once the decision as to index-ranges has been made, the data structure has to be decomposed, i.e., the data within a given set of ranges assigned to a particular processor must be physically sent to that processor in order for the processing to be done, and then any results must be sent back to whichever processor has responsibility for co-ordinating the final result. |
And, to make matters even more interesting, it's very common in these types of cases for the boundary values, the values along each "outer" side of each section, to be relevant to the processor which shares that boundary. |
Distributed memory is, for all intents and purposes, virtually synonymous with message-passing, although the actual characteristics of the particular communication schemes used by different systems may hide that fact. |
Message-passing approach: |
Tasks communicate by sending data packets to each other. |
Messages are discrete units of information and can be distinguished from all other messages (theory). |
Parallel tasks use these messages to send information and requests for same to their peers. |
Message-passing approach: |
The overhead is proportional to size and number of packets (more communication means greater costs; sending data is slower than accessing shared memory.). |
Message-passing is not cheap: every one of those messages has to be individually constructed, addressed, sent, delivered, and read, all before the information it contains can be acted upon. |
In the general case, message-passing will take more time and effort than shared-memory. |
Message-passing approach: |
Shared memory scales less well than message passing, and, once past its maximum effective bandwidth utilisation, the latency associated with message-passing may actually be lower than that encountered on an over-extended shared memory communications network. |
There are ways to decrease the overhead associated with message-passing, the most significant being to somehow arrange to do as much valuable computation as possible while communication is occurring. |
hypercube machines (e.g., nCube) |
Hypercube architectures typically utilise off-the-shelf processors coupled via proprietary networks (both hardware and message-passing software); |
IBM SPx |
Utilising standard RS6000-type processors coupled via a very high speed, high bandwidth switch networking architecture, the nodes within an SPx system use a special very low-level message-passing library for the lowest-level (i.e., most efficient, fastest) mechanism for communicating with one another. |
Important class of these hybrids is the so called virtual shared memory architecture. |
Each processor has its own local memory (like a distributed machine), however, direct remote memory access (like a shared memory machine) is possible via a global address space. |
Relies on special hardware or software to deal with the communication, while the processors continue computing. |
Such a systems benefits from fast communications with a scalable architecture. |
Why use parallel computing techniques ? |
Limitations on traditional Supercomputer
|
Existing and emerging network technologies are making the practical use of distributed computing resources on a range of user applications a reality. |
Parallel/Distributed computing is the natural |
evolution of the techniques and methods that will |
fulfil our computational needs in the future. |
Parallel Computing |
Communication must be high bandwidth and low latency. |
Low flexibility (and overhead) in messages (point-to-point). |
Distributed Computing |
Communication can be high or low bandwidth. |
Latency typically high ---> can be very flexible messages involving fault tolerance, sophisticated routing, etc. |
Parallel Computing should on technical grounds define standards as more restrictive (demanding)? However it is smallest field. |
Distributed Computing |
World Wide Web revolutionising both parallel and distributed approaches and standards |
Why use Distributed Computing Techniques ? |
Expense of buying, maintaining and using traditional MPP systems. |
Rapid increase in commodity processor performance. |
Commodity networking technology (ATM/FCS/SCI) of greater than 200 Mbps at present with expected Gbps performance in the very near future. |
The pervasive nature of workstations in academia and industry. |
Price/Performance of using existing hardware/software |
Comms1 - From ParkBench Suite |
To fully utilise a heterogeneous computing environment where different types of processing elements and inter-connection technologies are effectively and efficiently used. |
The use of distributed resources in this framework is known as Metacomputing and such an environment has the potential to maximise performance and cost effectiveness of a wide range of scientific and distributed applications. |
Features of Distributed Systems |
The Pros and Cons of distributed systems - examine the following |
areas: |
- Performance.- Distributed File System.- User Name Space.- Service Availability.- Scheduling Capability.- System Management.- Security. |
Decreased turnaround time |
- Time-critical applications... - Size of workload increased - More effective use of Users time o Short turnaround - rapid iteration/refinement o Visualisation and data analysis |
Solving Larger problems than would fit on a single system |
- Easing of memory restrictions - out-of-core calculations... - Improved cache utilisation |
Utilise Pooled Resources more effectively |
- "fat" and "thin" systems - MPP/Vector/... |
Better fault tolerance |
- Replication of data - Replication of service |
Ideally a single environment-wide namespace is needed for all files |
- Provides high-levels of security - Reduces the replication of data and programs by users - Simplifies needs for tape-type media - Simplifies backup and archive - Identical access paths from any system - Reduces data-movement by user - Simplifies distributed applications - Common environment customised for individual systems |
Single version of password entry across all systems
|
Reference Password File
|
Crossing Administrative Boundaries
|
High Availability
|
System independent resource naming access mechanisms
|
Name service to maintain resource information
|
Single job acceptance system for all resources (CMS)
|
Single Environment for execution
|
Replicated Scheduler (CMS)
|
Interface to Security
|
Authentication
|
Access Control
|
Auditing
|
Accounting
|
High Initial and Maintenance Costs
|
Applications Development
|
General
|
The net result is that there is not a universally used or mandated environment with which to implement, manage and run a Metacomputer at this time. |
At present there are only relatively small integrated local systems. These are customised for local circumstances and conditions. |
To create the infrastructure for a Metacomputer you would Ideally like to call-up your local computer vendor and buy a package that does everything that you want! |
The reality is that you will need to: |
- Buy/acquire the hardware. |
- Buy/acquire/develop or create the necessary software. |
- Integrate the hardware and software into a coherent system. |
- Edit/debug/tune/optimise your application before you can run it. |
- Support, maintain and "port" as time goes by. |
Sequential - Traditional codes - "dusty deck FORTRAN or COBOL or C++ etc, typically "single threaded". |
Data Parallel - Synchronous: Tightly coupled, software needs to exploit features of problem structure to get good performance. Comparatively easy as different data elements are essentially identical. |
Data Parallel - Loosely Synchronous: As above but data elements are not identical. Still parallelises due to macroscopic time synchronisation. |
Asynchronous - Functional (or data) parallelism that is irregular in space and time. Often loosely coupled and so need not worry about optimal decomposition's to minimise communication. Hard to parallelise (massively) unless .... |
Embarrassingly parallel - Essentially independent execution of disconnected components. |
Metaproblems - Asynchronous collection of (loosely) synchronous components where these programs themselves can be parallelised (coarse grain task parallelism - each data component data parallel). |
Simple, but general and extensible to many more nodes is domain decomposition |
Most successful concurrent machines with many nodes have obtained good performance from "Data Parallelism" or "Domain Decomposition" |
Problem is an algorithm applied to data set
|
The three architectures considered here differ as follows:
|
Metaproblems are of growing importance in general HPCC community |
One important Example is: |
Manufacturing and Design including Multi-disciplinary Optimisation which combines many fields to look at total product design, manufacturability etc.
|
Example of an application that has been designed to use NII from outset. |
ASOP consortium is made up of:
|
Aims to examine the problems of inter-disciplinary collaborative design and manufacture. |
Motivation is the complexity of designing modern aircraft |
- Hundreds of contractors- Thousands of engineers- Use of thousands of computer programs |
Web technology is proposed to bind the management of large projects |
- Provide uniform interface to all computer platforms- Allow the management of remote executions |
For example: |
- A database engine on an enterprise server |
- A CFD calculation on an MPP |
- CAD/CAM visualisation |
Problem Machine |
Synchronous SIMD, MIMD |
Loosely Synchronous MIMD |
Asynchronous Unclear |
Metaproblems Heterogeneous network |
Embarrassingly Parallel NOW, SIMD, MIMD |
Information Production |
PVM, MPI, Linda, HPF, TreadMarks, Legion, JAVA |
Parallel / Distributed Computing Runtime Tools - debugger, profilers, and performance visualisation |
Image processing software - AVS and Extensions |
Parallel Operating Systems
|
Information Analysis, Access and Integration |
Parallel (Relational) Database e.g. Oracle 7.0 |
Object database |
High Speed Networks |
Multilevel Mass Storage |
Integration Software ("glue") |
Integration of Parallel and Distributed Computing |
Compression |
Parallel Rendering |
Linkage Analysis (between records of database) |
Sorting (large databases), etc... |
ATM networks have rapidly transitioned from research Gigabit networks to commercial deployment
|
Computer Hardware trends imply that all computers (PC's -> supercomputers) will be parallel by the year 2000 |
- Up to 1993, parallel computers are from small start-upcompanies (except Intel Supercomputer Division) |
-- Now Cray, Convex (HP), Digital, IBM have massively parallel computing systems and Silicon |
Graphics is becoming a powerful force high performance computing vendors
|
Software is challenge and could prevent/delay hardware trend that suggests parallelism will be a mainline computer architecture
|
Ideally Should be able to use any Programming Paradigm. |
The Programming Spectrum:
|
Legion (University of Virginia) |
TreadMarks (Rice University - Tx) |
Network Linda (Yale University) |
HPF |
MPI |
PVM (ORNL/UTK) |
Java (Sun) |
The Legion project is an attempt to design and build system services that provide the illusion of a single virtual machine. |
Legion targets wide-area assemblies of workstations and supercomputers |
It aims to provide - |
- Shared-object and shared-name spaces, - Application adjustable fault-tolerance, - Improved response time, - Greater throughput, - Wide-area network support, - Management and exploitation of heterogeneity, - Protection, - Security, - Scheduling, - Resource management, - Parallel processing- Object inter-operability. |
Legion is an object-oriented system - designed around C++ |
The principles of the object-oriented paradigm are the foundation for the construction of Legion; the following features are exploited: |
Under lying themes: cannot design a system that will satisfy every |
user's needs, so must design an extensible system. |
For example: |
Security - Trade-off between security and performance (due to the cost of authentication, encryption, etc.). Rather than providing a fixed level of security, users may choose their own trade-offs by implementing their own policies or by using existing policies via inheritance. |
Fault-Tolerance - Select the level of fault-tolerance needed, and pay for only what they use. By allowing the user to implement there own services or to inherit them from library classes, providing flexibility complemented by a menu of existing choices. |
Campus Wide Virtual Computer (CWVC) |
The Campus Wide Virtual Computer is a heterogeneous distributed computing system built on top of Mentat, an object-oriented parallel processing system. |
The CWVC is used to demonstrate the benefits of a Legion-like system, providing departments across the UVa campus with an interface to high performance distributed computing. |
The CWVC allows researchers at the University to share resources and to develop applications that will be usable in a true Legion setting. |
Organisations |
The CWVC is currently used by, and based on resources owned by, |
the following organisations: |
- The Department of Computer Science, UVa |
- Information Technology and Communication, UVa |
- The School of Medicine, UVa |
- The Department of Electrical Engineering, UVa |
- The Department of Engineering Physics, UVa |
- NASA Langley Research Center, Hampton, Va |
Hardware Resources |
CWVC contains more than 100 workstations of varying types, including: |
- IBM RS6000 workstations, and a small 8 node SP-2- Sun Microsystems SparcStations, including several small multiprocessor workstations- Hewlett Packard PA-RISC workstations- Silicon Graphics Indy, Indigo, and Onyx workstations, including a two processor Onyx |
Mixture of network technologies, including ATM, FDDI, and Ethernet. |
The CWVC is an integrated interface to heterogeneous campus-wide distributed computing. It provides a set of tools that facilitate application development, debugging, and resource management. |
These tools include: |
Parallel C++ : The Mentat Programming Language |
Mentat is an object-oriented parallel processing system designed to directly address the difficulty of developing architecture- independent parallel programs. |
The fundamental objectives of Mentat are to |
(1) Provide easy-to-use parallelism, |
(2) Achieve high performance via parallel execution |
(3) Facilitate the execution of applications across a wide range of platforms. |
The Mentat approach exploits the object-oriented paradigm to provide high-level abstractions that mask the complex aspects of parallel programming, including communication, synchronisation, and scheduling, from the programmer. Instead of managing these details, the programmer concentrates on the application. |
The programmer uses application domain knowledge to specify those object classes that are of sufficient computational complexity to warrant parallel execution. |
Federated File System |
Objects that execute on hosts in the CWVC are presented with a single unified file system abstraction called the "Federated File System" (FFS). |
The FFS is necessary because the hosts made available by participating departments have their own local file systems, NFS mount structures, etc., so files visible on one host may not be available on another. |
The FFS allows objects to view a single, unified file name space, and thus execute in a location independent manner. |
In the CWVC FFS, files are simply objects which persist. |
Library routines to access FFS files are provided for C, C++ |
The Mentat Programming Language (MPL), C and Fortran; these interfaces are similar to the Unix C standard library file system interface; therefore, utilisation of the CWVC-FFS typically requires little change to existing code. |
Thermostat |
Thermostat provides an GUI to manage resources. |
It allows a resource owner to schedule the times of day and the days of the week that hosts will be available for CWVC use. |
It also allows the resource owner to specify the percent of the CPU time and memory that may be used by the CWVC during individual available time slots. |
Resource Accounting Services |
Resource utilisation is logged on a per-user basis, while resource availability (controlled via the Thermostat) is logged on a per machine basis. |
Usage and machine availability "credits" are scaled based on the computing power of the hosts involved - a minute of time on an SP-2 node is clearly worth more than a minute on a Sun IPC. A report generation tool is provided to extract and summarise usage statistics. |
MAD |
MAD is a set of tools that enables programmers to debug their Mentat/CWVC applications using the debugger of their choice. |
MAD supports post-mortem debugging: CWVC programs run to completion, or until an error occurs, and are only examined after the fact. |
MAD is based on the record and replay technique, and consists of two phases. |
In the recording phase, objects of a CWVC application log their incoming messages to a file. |
In the playback phase, objects can reproduce their original execution by extracting the appropriate messages from the log file. |
The programmer can then replay a specific object, i.e. reproduce its execution, under the control of a sequential debugger, and can use the traditional cyclic debugging technique. |
Data Parallel Computation Scheduler |
Prophet is an automatic run-time scheduling system for SPMD computations designed for heterogeneous workstation networks. |
Prophet solves three problems associated with scheduling data parallel computations in this environment:
|
Data domain decomposition decomposes the data domain to provide processor load balance. |
Placement assigns tasks to processors to limit communication overhead. All of these factors contribute to reduced completion time. |
Prophet has been integrated into the Mentat-Legion system. |
Scientific Applications |
Genome Library Comparison |
Atmospheric Simulation |
Automatic Test Pattern Generation for Integrated Circuits |
The Quest group in the University of Virginia department of Electrical Engineering has developed a parallel Automatic Test Pattern Generation (ATPG) application. |
URL http://www.cs.virginia.edu/~legion |
TreadMarks is a software Distributed Shared Memory system built at Rice University. |
It has an efficient user-level DSM system that runs on commonly available Unix systems. |
URL http:/www.cs.rice.edu/~willy/TreadMarks/0verview.html |
TreadMarks provides primitives similar to those used in hardware shared memory machines. |
Application processes synchronise via two primitives: barriers and mutex locks. |
Barriers |
The routine Tmk_barrier(i) stalls the calling process until all processes in the system have arrived at the same barrier. |
Locks |
Locks are used to control access to critical sections. |
The routine Tmk_lock_acquire(i) acquires a lock for the calling processor, and the routine Tmk_lock_release(i) releases it. |
TreadMarks uses a lazy invalidate version of release consistency and a multiple-writer protocol to reduce the amount of communication involved in implementing the SM abstraction. |
The virtual memory hardware is used to detect accesses to shared memory. |
#include <stdio.h> |
#include <errno.h> |
#include <sys/signal.h> |
#include "Tmk.h" |
void main(int argc, char **argv) |
{ |
int c ; |
while ((c = getopt(argc, argv, "")) != -1) ; |
Tmk_startup(argc, argv); |
printf("hello from process %d\n", Tmk_proc_id) ; |
/* All processes wait for keyboard input on local host. */ |
if(Tmk_proc_id == 0) |
do { |
c = getchar() ; |
} while (c < 0 && errno == EINTR); |
Tmk_barrier(0); |
Tmk_exit(0); |
} |
It is easier to program using TreadMarks than using MP. |
Although there is little difference in programmability for simple programs, for those with complicated communication patterns, such as 3-D FFT, it takes more effort to figure out what to send and whom to send it to. |
Results show that the use of release consistency and the multiple-writer protocol, TreadMarks can performs comparably with MP on a variety of problems. |
The separation of synchronisation and data transfer and the request-response nature of data communication in TreadMarks is responsible for lower performance for all the TreadMarks programs. |
In MP, data communication and synchronisation are integrated together. The send and receive operations not only exchange data, but also regulate the progress of the processors. |
In TreadMarks, synchronisation is through locks/barriers, which do not communicate data. Moreover, data movement is triggered by expensive page faults, and a diff request is sent out in order to get the modifications. |
In addition, MP benefits from the ability to aggregate scattered data in a single message, an access pattern that would result in several miss messages in the invalidate-based TreadMarks protocol. |
Although the multiple-writer protocol addresses the problem of simultaneous writes to the same page, false sharing still affects the performance of TreadMarks. |
Overview |
- What is Linda? - The Linda Model - Master/Worker Model using Virtual Shared Memory |
Linda Basics |
- Definitions - Operations - Templates - Template Matching Rules |
URL http://www.cs.yale.edu/HTML/YALE/CS/Linda/linda.html |
What is Linda? |
Parallel programming language based on C (C-Linda) and FORTRAN (Fortran-Linda) |
Combines co-ordination language of Linda with programming languages of C and FORTRAN |
Enables users to create parallel programs that perform on wide range of computing platforms |
Ease of use |
Based on logically global, associative object memory called tuple space. |
Tuple space provides inter-process communication and synchronisation logically independent of the underlying computer or network |
Implements parallelism with a small number of simple operations on tuple space to create and co-ordinate parallel processes |
Commercially available - Scientific Computing Associates Inc. |
Virtual Shared Memory |
Different parts of the data can reside on different processors. |
Looks like one single global memory space to component processes. |
Linda's Virtual Shared Memory is known as tuple space |
Can be used to implement many different types of algorithms |
Lends itself well to master/worker distributed data structure algorithms |
Task and workers are independent of each other |
Master divides work into discrete tasks and puts into global space |
Workers repeatedly retrieve tasks and put results back into global space |
Workers notified of work completion by having met some condition, receiving an end-signal or terminated by some other means |
Master gathers results from global space |
Possible ways that tasks can be distributed:
|
Tuple Space |
Linda's name for its shared data space. Tuple space contains tuples. |
Tuples |
The fundamental data structure of tuple space. |
Tuples are represented by a list of up to 16 fields, separated by commas and enclosed in parentheses. |
Examples: - ('arraydata', dim1, 13, 2) - (var1, var2) |
Associative Memory Model |
A tuple is accessed by specifying its contents. |
From the programmer's point of view, there is no address associated with a tuple. |
Operations - There are four basic operations |
Tuple Generation |
out |
- Generates a data (passive) tuple - Each field is evaluated and put into tuple space - Control is then returned to the invoking program |
eval |
- Generates a process (active) tuple - Control is immediately returned to invoking program- Logically, each field is evaluated concurrently, by a separateprocess and then placed into tuple space |
Tuple Extraction |
in |
- Uses a template to retrieve tuple from tuple space. - Once retrieved, it is taken out of tuple space and no longer available for other retrievals. - If no matching tuple is found process will block.- Provides for synchronisation between processes. |
rd |
- Uses a template to copy data without removing it from tuple space. - Once read it is still available for others. - If no matching tuple is found process will block. |
Name of program has an .fl or .cl extension |
Top level routine must be named real_main but has the same argv and argc parameters as the C main does |
Top level routine must be a parameter-less subroutine named real_main - Fortran |
Bulk of program is pure FORTRAN/C |
Linda operations, eval, in, out |
Program is asynchronous - no guarantee that any particular process will execute before any other |
#define NNODES 4 |
#include <stdio.h> |
int node(int taskId) { |
printf("Hello from task %d.\n", taskId) ; |
return 0 ; |
} |
void real_main() |
{ |
int i ; |
for (i = 0 ; i < NNODES ; i++) |
eval("worker", node(i)) ; |
} |
The discussions on an HPF standard were started at SC'91 where DEC had organised a discussion meeting for interested parties. |
This meeting led to the formation of the HPF Forum (HPFF). |
The first meeting of the HPFF was held in Houston, Texas, in Jan `92 and was attended by 130 people. |
Detailed work began in March when a working group of 40 people started fleshing out a standard with an intention to finish it by Jan `93. A further 8 meetings were held during `92 leading to the publication of the HPF Specification v1.0 in May `93. |
The working group - members from industry, universities and (US) Labs |
HPFF goals were to define a language that offers: |
Data parallel programming (Single threaded, global name space, loosely synchronous parallel computation). |
Top performance with SIMD and MIMD machines with non-uniform memory access costs. |
Code tuning for various architectures. |
These goals were addressed in the form of new intrinsics and first class language constructs. |
Subsidiary goals were also established, including: |
- Portability of existing code - Efficient portability of new code - Maintained compatibility with existing standards (particularly Fortran 90), - Simplicity and ease of implementation - Open interface to other languages or programming styles. - Availability of compilers in the near future |
HPF is based upon the Fortran 90 programming language - it includes the F77 and provides many new features including: |
- Array operations - Improved facilities for numeric computation - User defined data types - New control constructs - Facilities for modular data and procedure definition - New source form - Optional procedure arguments and recursion - Dynamic storage allocation - Pointers |
HPF adds the following features to those in Fortran 90. |
Support for controlling the alignment and distribution of data on a parallel machine. |
New data parallel constructs. |
An extended set of intrinsic functions and standard library providing much useful functionality at a high level of abstraction. |
EXTRINSIC procedures which standardise the interface with the other languages or styles |
Directives to address some sequence and storage association issues. |
In principle, a sequential algorithm is portable to any architecture supporting the sequential paradigm. |
However, programmers require more than this: they want their realisation of the algorithm in the form of a particular program to be portable - source-code portability. |
The same is true for message-passing programs and forms the motivation behind MPI. |
URL http://www.mcs.anl.gov/MPI |
MPI provides source-code portability of message-passing programs written in C or Fortran across a variety of architectures. |
Just as for the sequential case, this has many benefits, including protecting investment in a program, allowing development of the code on one architecture (e.g. a network of workstations) before running it on the target machine (e.g. fast specialist parallel hardware) |
Basic concept of processes communicating by sending messages to one another has been understood for a number of years, it is only relatively recently that message-passing systems have been developed which allow source-code portability. |
MPI was the first effort to produce a message-passing interface standard across the whole parallel processing community. |
Sixty people representing forty different organisations -- users and vendors of parallel systems from both the US and Europe -- collectively formed the "MPI Forum". |
The discussion was open to the whole community and was led by a working group with in-depth experience of the use and design of message-passing systems (including PVM, PARMACS, and P4). |
The two-year process of proposals, meetings and review resulted in a document specifying a standard Message Passing Interface (MPI). |
To provide source-code portability |
To allow efficient implementation across a range of architectures |
It also offers: |
A great deal of functionality |
Support for heterogeneous parallel architectures |
Deliberately outside the scope of MPI is any explicit support for:
|
#include <stdio.h> |
#include "mpi.h" |
int main(int argc, char* argv []) { |
int rank ; |
MPI_Init(&argc, &argv) ; |
MPI_Comm_rank(MPI_COMM_WORLD, &rank) ; |
printf("hello from procesor %d\n", rank) ; |
MPI_Finalize() ; |
} |
PVM - What is it? |
- Background Information - Advantages/Disadvantages - Components o Deamon o Libraries |
URL http://www.epm.ornl.gov/pvm/ |
PVM (Parallel Virtual Machine) is a software system to link clusters of machines. |
It enables a heterogeneous collection of computers (the virtual machine) to be programmed as a single machine. |
It provides the user with process control and message passing calls for communication between tasks running on different hosts. |
The individual computers in the virtual machine may be shared- or local-memory multiprocessors, vector supercomputers, specialised graphics engines, or scalar workstations, that may be interconnected by a variety of networks, such as ethernet or FDDI. |
User programs written in C, C++ or Fortran access PVM through library routines. |
The PVM system provides a set of user interface routines that may be incorporated into existing C or Fortran programs. |
Routines exist for the process control, message transmission and reception, broadcasting, synchronisation via barriers, mutual exclusion, shared memory, dynamic process groups and some forms of error detection and recovery. |
Processes may be initiated synchronously or asynchronously, and may be conditioned upon the initiation or termination of another process, or upon the availability of data values. |
Message transmission is governed by routines that ensure data is transmitted in a machine independent form. |
Overall, these routines allow for parallel implementation of the most appropriate computational model and for division into tasks that can be assigned to the most appropriate |
The first version of PVM was written during Summer of `89 at Oak Ridge National Laboratory. It was not released, but was used in applications at the lab. |
Version 2 was written from scratch during Feb `91 at the University of Tennessee, Knoxville, n and released in March of that year. It was intended to clean up and stabilise the system so that external users could benefit. |
Version 3 was redesigned from scratch, and a complete rewrite started in Sep `92, with first release of the software in March `93. While similar in spirit to version 2, version 3 includes features that did not fit the old framework - most importantly fault tolerance, better portability and scalability. |
Advantages |
It is fairly small, with a limited user routine library, yet it is sufficient to implement any parallel programming scheme. |
There is a large and growing user base. |
There are tools (XPVM) for debugging, performance analysis and visualisation. |
Also, continued upgrades are expected in the areas of visualisation, fault tolerance, multimedia applications and performance optimisation for specific architectures. |
Advantages |
Portability: Probably the most portable message passing library available. |
Promises of distributed computing for the future's computing requirements |
Scalable Parallelism |
Easy to install and use |
Public domain software |
Popular - probably the most widely used parallel MP library |
Local, wide-area or combination of networks |
Flexible |
Advantages |
Easy definition and modification of your parallel virtual machine |
Arbitrary control and dependency structures - your application "decides": |
- where and when tasks are spawned and terminated |
- which machines to add or delete from your parallel virtual machine |
- how tasks communicate and/or synchronise with each other. |
Disadvantages |
Performance: Even with recent advancements, PVM may be slower than other message passing languages |
There is no real PVM standard |
It is deficient in some of the message passing functionality. |
Limited fault tolerance. The user must monitor the processes and hosts to insure that no hosts have gone down during execution. In this event, the user must reset the virtual machine and restart the processes. |
Disadvantages |
Network latencies can cause problems in applications that have a high communication to computation ratio. |
There is as yet no automation of the compilation of different object modules on different architectures, though this may be added in the future. |
Components are the parts of your Parallel Virtual Machine responsible for things such as: |
- Communication - Process control - The programming interface for the user. |
There are two - the PVM Daemon and the PVM Libraries. |
The PVM daemon pvmd3 is a Unix process which oversees the operation of user processes within a PVM application and co-ordinates inter-machine PVM communications. |
One daemon runs on each machine configured into your PVM machine. Each will have their own pvmd3s running. |
The "master" PVM daemon is the first PVM daemon started and is responsible for starting all other PVM daemons in your parallel virtual machine. Typically started on your login machine. |
Each daemon maintains a table of configuration and process information relative to your parallel virtual machine. |
User processes communicate with each other through the daemons:
|
The PVM daemon software must reside on every machine which you will be using:
|
The three PVM libraries are:
|
Contain simple calls that the application programmer embeds in concurrent or parallel application code. Provide the ability to:
|
Library routines do not directly communicate to other processes. Instead, they send commands to the local daemon and receive status information back. |
Can be installed in user filespace - default location is $HOME/pvm3/lib. |
External Data Representation (XDR) format conversion is performed automatically between hosts of different architectures. This is the default. |
HotJava is a Web browser that supports dynamically downloadable interactive content. |
Arbitrarily sophisticated dynamic multimedia applications inserts called Applets can be embedded in the regular HTML pages and activated on each exposure of a given page. |
Applet constructs are implemented in terms of a special HTML tag: |
<APPLET codebase="URL directory path" code="Java class file name" width=".." height=".." > |
Where the URL and class file name points to a chunk of server side software that is to be downloaded and executed at the client side on each presentation of a page containing this applet which executes in window specified in size by width and height in pixels. |
Applets are written in Java - a new general purpose object-oriented programming language from Sun Microsystems. |
Starts in 1991 by Project Green --- a group in Sun focused on operating software for consumer electronic devices such as smart set-top boxes Gosling (creator of Sun NeWS which had major conceptual impact both on current Java and Telescript models) realizes that C++ is not adequate and initiates development of a new language Oak, later renamed as Java. |
A PDA (Personal Digital Assistant -- codename *7) based on Oak/Java ready in 1993. Green Team incorporates as FirstPerson, Inc. *7 proposal to Time-Warner rejected in 1993. 3DO deal falls through in 1994. FirstPerson, Inc. dissolves. |
Small group (~30 people, now Java Team) continues development and decides to adapt Oak as a Webtechnology. |
Document The Java: A White Paper by Sun Microsystems -- October 1995 draft by Gosling & McGilton -- key features of Java:
|
Java omits several rarely used, poorly understood and confusing features of C++ including operator overloading, multiple inheritance, pointers and automatic type coercion's. |
It adds automatic garbage collection which makes dynamic programming easier in Java than in C or C++. |
No more mallocs! |
It also adds 'Interface' construct, similar to Objective C concept, which often compensates for the lack of multiple inheritance by allowing method calling syntax to be "inherited". |
The resulting language is familiar as it looks like C++ but is simpler and hence easier to program in. |
It also results in a much smaller kernel which is suitable for planned Java ports to consumer electronic devices. Base (alpha) interpreter is ~40Kb, libraries and threads add additional 175Kb. |
Java model can be viewed as a C++ subset, with some dynamic elements inherited from Objective-C (method overloading, garbage collection). |
Structures, Unions and Functions are absorbed into data and methods of Java classes -- Java is Simple! |
The strength of Java object-oriented model is not is sophistication but in simplicity and the extensive class library associated with the system (some 250 public classes were released in both alpha and beta). |
Java class plays also a role of a communication atom in the Web embedding model. Applet classes identify themselves by names in the HTML applet tag. |
Applet downloads other classes, present in the applet source. |
Hence, the Java class names play the role of addressing mode for the distributed Java code database. |
C/C++ programming in a heterogeneous network environment requires use and compatibility across several vendor platforms and the corresponding compilers. This problem is solved in Java by designing platform-independent binary representation called Java bytecode (or opcode). |
Java compiler (written in Java and platform-independent) reads Java source bytecode. |
These bytecodes are shipped to client machines upon browser requests. |
Each client machine must run Java interpreter which performs runtime execution of Java bytecodes. |
Java interpreter is written in POSIX compliant ANSI C and needs to be ported to and conventionally compiled (once) on each individual platform. |
Once the interpreter is ported, application developers don't need to worry at all about platform specificity and differences between native compilers. |
They write portable Java which is compiled by platform independent Java compiler and executed on Java interpreter, ported to various platforms. |
Java language offers a uniform abstract (virtual) machine model which is identical for all platforms. SUN owns the Java Virtual Machine (see online report) -- it is universal while classes can be added by any user. |
Unlike in C/C++ where various integers match the architecture of a physical machine at hand, Java byte, char short, int and long are always of the same size, equal to 8, 16, 16 (unicode), 32 and 64 bits, respectively. |
No header files, preprocessors, #define etc. |
Floating point is always IEEE 754 |
Differences between vendor specific windowing environments (X Windows, MS Windows, Macintosh) are removed in terms of the Abstract Windowing Toolkit (AWT) metaphor. |
AWT is given by ~60 Java classes (alpha) which offer a universal GUI programming model, portable between UNIX, PC and Mac, and translated automatically to native windowing systems on individual platforms by Java interpreters. |
Popular TCP/IP based protocols such as FTP or HTTP are supported in terms of network protocol classes. |
This facilitates various forms of distributed processing. New protocols (e.g. PVM etc.) can added and dynamically installed. |
Distributed computing model of Java is mainly client-server, with Java compiler preparing the opcodes at the server side, and Java interpreter executing it at the client side. |
One can expect more dynamic uses of Java with Java threads on both Server and Client side communicating with each other. |
Java binaries are shipped across the network and executed on client machines. Security is therefore a critical issue and strongly enforced in Java. |
Java contains its own networking classes which are designed to be secure |
Modifications of the C++ model such as eliminating pointer arithmetic and coercion were dictated mainly by the security requirements. |
Most viruses are based on acquiring access to private/protected sectors of computer memory which is impossible in Java. |
Java opcodes are executed at the client side by Java interpreter which operates exclusively on the virtual memory. Hence, unless there are security bugs in the Java interpreter itself, the model is safe and users cannot create security holes by incorrectly or maliciously written applets. |
The byte codes sent across network are verified at the client which prevents evil/corrupted classes from causing problems. |
Java model offers preemptive multithreading, implemented in terms of the Thread class. |
Thread methods offer a set of synchronization primitives based on monitor and conditional variable paradigm by C.A.R. Hoare. |
Java threads inherit some features from the pioneering Cedar/Mesa System by Xerox Park that gave birth to Macintosh and object-oriented programming. |
A typical use of Java multithreading in applet programming is to have several independent but related simulations (e.g. various sorting algorithms), running concurrently in an applet window. |
Multithreading is also used internally by the HotJava browser to handle multiple document dynamics. |
Another interesting application domain are multi-HotJava environments to come such as collaboratory or gaming. |
Java threads don't have built-in point-to-point communication primitives. Various thread communication environments can be provided by coupling the thread and network protocol objects. |
Source code of a Java program consists of one or more compilation units, implemented as files with .java extension. |
Each compilation unit can contain: |
- A package statement |
- Import statements |
- Class declarations |
- Interface declarations |
Java compiler (called javac) reads java source and produces a set of binary bytecode files with .class extensions, one for each class declared in the source file. For example, if Foo.java implements Foo and Fred classes, then javac Foo.java will generate Foo.class and Fred.class files. |
Suppose that Foo implements an applet and Fred is an auxiliary class used by Foo. If HotJava/Netscape encounters a tag <APPLET code="Foo.class">, it will download Foo.class and Fred.class files and it will start interpreting bytecodes in Foo.class. |
This prints in applet window, the classic Hello World string! |
import java.awt.Graphics |
public class HelloWorld extends java.applet.Applet |
{ |
public void init() { |
} |
public void paint(Graphics g) { |
g.drawString("Hello world!", 50, 25); |
} |
} |
class HelloWorldApp { |
public static void main (String args[]) { |
System.out.println("Hello World!"); |
} |
} |
Note args are of class String and needed even though helloWorldApp has no command line arguments |