Parallel Machines Two theories of parallel execution: ³Two heads are better than one.² ³Too many cooks spoil the soup.² Both theories can happen in practice. Keeping all the processors busy and accessing local memory means a good collaboration. Synchronization and accessing nonlocal memory mean the chefs are fighting over the pots. Generalities about modern parallel processors: Individual processors are reasonably fast Local memory access costs one to ten cycles Nonlocal access costs tens to hundreds of cycles Synchronization costs hundreds to thousands of cycles Parallel computers allow several CPUs to contribute to a computation simultaneously. For our purposes, a parallel computer has three types of parts: Processors n Memory modules n Communication / synchronization network n Key points: All processors must be busy for peak speed. Local memory is directly connected to each processor. Accessing local memory is much faster than other memory. Synchronization is expensive, but necessary for correctness. Distributed Memory Machines Conceptually, the Paragon is very similar to the CM-5. Bandwidth is higher Latency is longer The network topology is a two-dimensional torus To program these machines: Divide the problem to minimize number of messages while retaining parallelism Convert all references to global structures into references to local pieces Optimization: Pack messages together to reduce fixed overhead (almost always needed) Optimization: Carefully schedule messages (usually done by library) Every processor has a memory others canıt access. Advantages: Can be scalable Can hide latency of communication Disadvantages: Hard to program Program and O/S (and sometimes data) must be replicated Shared-Memory Machines Interconnection network shown here is actually for the BBN Butterfly, but C-90 is in the same spirit. These machines share data by direct access. Potentially conflicting accesses must be protected by synchronization. Simultaneous access to the same memory bank will cause contention, degrading performance. Some access patterns will collide in the network (or bus), causing contention. Many machines have caches at the processors. All these features make it profitable to have each processor concentrate on one area of memory that others access infrequently. All processors access the same memory. Advantages: Easy to program (correctly) Can share code and data among processors Disadvantages: Hard to program (optimally) Not scalable due to bandwidth limitations Distributed Shared Memory Machines Combining the advantages of shared and disttributed memory Lots of hierarchical designs are appearing. Typically, ³nodes² with 4 to 32 processors (see below) Each processor has a local cache Processors within a node access shared memory Nodes can get data from or put data to other nodesı memories Parallel Algorithms This is just one view of parallel programs (and algorithms). Note: The tasks mentioned here may not be the tasks you're used to. Here, ³task² means ³a sequence of operations that is scheduled atomically and can run to completion once started² That is, a task is the computation between synchronization points For example, an MPI process is a collection of tasks A parallel algorithm is a collection of tasks and a partial ordering between them. Design goals: Match tasks to the available processors (exploit parallelism). Minimize ordering (avoid unnecessary synchronizations). Utilize remaining ordering (exploit locality). Sources of parallelism: Data parallelism: updating array elements simultaneously. Functional parallelism: conceptually different tasks which combine to solve the problem. Speculative parallelism: temporarily ignore partial ordering, repair later if this causes problems. This list is not exhaustive Data Parallelism in Algorithms Data-parallel algorithms exploit the parallelism inherent in many large data structures. Many elements in structure ¤ many update operations Example: Red-Black Relaxation All red points can be updated in parallel Analysis: Scalable parallelism Synchronization and locality may be suspect Functional Parallelism in Algorithms Functional parallelism exploits the parallelism between the parts of many systems. Many pieces to work on ¤ many independent operations Example: Aeroelasticity (wing design) CFD and CSM (and others, if need be) can be evaluated in parallel Analysis: Parallelism limited Synchronization probably good Parallel Languages A parallel language provides an executable notation for implementing a parallel algorithm. Design criteria: How are parallel operations defined? static tasks vs. dynamic tasks vs. implicit operations How is data shared between tasks? explicit communication/synchronization vs. shared memory How is the language implemented? low-overhead runtime systems vs. optimizing compilers Usually a language reflects a particular type of parallelism. Data-Parallel Languages Data parallelism originated on SIMD machines But has since been generalized in concept And has been implemented on many architectures The definition is still somewhat fuzzy HPF is largely based on data parallelism Other data-parallel languages of note: C* DataParallel C (taken from C*) NESL Fortran D, Fortran 90D, Vienna Fortran Data-parallel languages provide an abstract, machine-independent model of parallelism. Fine-grain parallel operations, such as element-wise operations on arrays Shared data in large, global arrays with mapping ³hints² Implicit synchronization between operations Partially explicit communication from operation definitions Advantages: Global operations conceptually simple Easy to program (particularly for certain scientific applications) Disadvantages: Unproven compilers Data-Parallel Languages, cont. Data parallelism is (an attempt at) a high-level language. High-level languages have inherent advantages and disadvantages: It's easier to concentrate on the high-level ideas. This means you can write the program faster This means you're more likely to get it right The underlying system takes over the low-level details. This means you can't micro-manage the execution This means you can't always tell what will happen Abstractions like data parallelism split the work between the programmer and the compiler. Programmerıs task: Solve the problem in this model. Concentrate on high-level structure and concepts Aggregate operations on large data structures Data in global arrays with mapping information Compilerıs task: Map conceptual (massive) parallelism to physical (finite) machine. Fill in the grungy details Guided by user ³hints² like mapping information Optimizing computation and communication Message-Passing Systems For comparison to data-parallel languages Program is based on relatively coarse-grain tasks Separate address space and a processor number for each task Data shared by explicit messages Point-to-point and collective communications patterns Examples: MPI, PVM, Occam Advantages: Close to hardware. Disadvantages: Many low-level details.