Next: Run-time Support System Up: Communication Generation Previous: Structured Communication

Unstructured Communication

In distributed memory MIMD architectures, there is typically a non-trivial communication latency or startup cost. Hence, it is attractive to vectorize messages to reduce the number of startups. For unstructured communication, this optimization can be achieved by performing the entire preprocessing loop before communication so that the schedule routine can combine messages to the maximum extent. The preprocessing loop is also called the ``inspector'' loop [20][28].

Example 1 (precomp_read) Consider the statement:


         FORALL(I=1:N) A(I)=B(2*I+1)

The array is marked as precomp_read since the distributed dimension subscript is written as which is invertible as .


 1  count=1
 2  call set_BOUND(lb,ub,st,1,N,1) 
 3  DO I=1, N/P        
 4    receive_list(count)=global_to_proc(f(i))
 5    send_list(count)= global_to_proc(g(i))
 6    local_list(count) = global_to_local(g(i))
 7    count=count+1
 8  END DO
 9  isch=schedule1(receive_list, send_list,local_list,count)
10  call precomp_read(isch, tmp,B)           
11  count=1
12  DO I=1, N/P        
13    if((I.ge.lb).and.(I.le.ub).and.(mod(I,st).eq.0)) 
   &      A(I) = tmp(count)
14    count= count+1
15  END DO

The preprocessing loop is given in lines between 1-9. Note that this preprocessing loop executes concurrently in each processor. The loop covers the entire local array bounds since each processor has to calculate the receive_list as well as the send_list of processors. Each processor also fills the local indices of the array elements which are needed by that processor.

The schedule1 routine does not need to communicate, it constructs the scheduling data structure isch. The schedule isch can also be used to carry out identical patterns of data exchanges on several different but identically distributed arrays or array sections. The same schedule can be reused repeatedly to carry out a particular pattern of data exchange on a single distributed array. In these cases, the cost of generating the schedules can be amortized by only executing it once. This analysis can be performed at compile time. Hence, if the compiler recognizes that the same schedule can be reused, it does not generate code for scheduling, it passes a pointer to the already existing schedule. Furthermore, the preprocessing computation can be moved up as much as possible by analyzing definition-use chains [56]. Reduction in communication overhead can be significant if the scheduling code can be moved out of one or more nested loops by this analysis. In the above example, local_list (line 6) is used to store the index of one-dimensional array. However, in general, local_list will store indices from a multi-dimensional Fortran array using the usual column-major subscript calculations to map the indices to a one-dimensional index.

The precomp_read primitive performs the actual communication using the schedule. Once the communication is performed, the data is ordered in a one dimensional array, and the computation (lines 12-15) uses this one dimensional array. The precomp_read primitive brings an element into temp for each local array element since preprocessing loops covers entire local array. The if statement masks the assignment to preserve the semantics of the original loop.

Example 2 (gather) Consider the statement


         FORALL(I=1:N) A(I)=B(V(I))

The array is marked as requiring gather communication since the subscript is only known at runtime. The receiving processors can know what non-local data they need from other processors, but a processor may not know what local data it needs to send to other processors. For simplicity, in this example, we assume that the indirection array is replicated. If is not replicated, it must also be communicated to compute the receive list on each processor.


 1  count=1
 2  call set_BOUND(lb,ub,st,1,N,1) 
 3  DO I=lb,ub,st        
 4    receive_list(count)=global_to_proc(V(i))
 6    local_list(count) = global_to_local(V(i))
 7    count=count+1
 8  END DO
 9  isch = schedule2(receive_list, local_list, count)
10  call gather(isch, tmp,B)           
11  count=1
12  DO I=lb,ub,st        
13    A(I) = tmp(count)
14    count= count+1
15  END DO

Once scheduling is completed, every processor knows exactly which non-local data elements it needs to send to and receive from other processors. Recall that the task of scheduler2 is to determine exactly which send and receive communications must be carried out by each processor. The scheduler first figures out how many messages each processor will have to send and receive during the data exchange. Each processor computes the number of elements (receive_list) and the local index of each element it needs from all other processors. In schedule2 routine, processors communicate to combine these lists (a fan-in type of communication). At the end of this processing, each processor contains the send and receive list. After this point, each processor transmits a list of required array elements (local_list) to the appropriate processors. Each processor now has the information required to set up the send and receive messages that are needed to carry out the scheduled communication. This is done by the gather primitives. Note that schedule1 does not need to communicate to form scheduling unlike schedule2.

Example 3 (scatter) Consider the statement


         FORALL(I=1:N) A(U(I))=B(I)

The array is marked as requiring scatter primitive since the subscript is only known at runtime. Note that the owner computes rule is not applied here. The processor performing the computation knows the processor and the corresponding local-offset where the resultant element must be written.


 1  count=1
 2  call set_BOUND(lb,ub,st,1,N,1) 
 3  DO I=lb,ub,st        
 4    send_list(count)=global_to_proc(U(i))
 6    local_list(count) = global_to_local(U(i))
 7    count=count+1
 8   END DO
 9   isch = schedule3(proc_to, local_to, count)
10   call scatter(isch, A, B)

Unlike the gather primitive, each processor computes a send_list containing processor ids and a local_list containing the local index where the communicated data must be stored. The schedule3 routine is similar to schedule2 of the gather primitive except that schedule3 does not need to send local index in a separate communication step.

The gather and scatter operations are powerful enough to provide the ability to read and write distributed arrays with a vectorized communication facility. These two primitives are available in PARTI (Parallel Automatic Runtime Toolkit at ICASE) [28] which is designed to efficiently support irregular patterns of distributed array accesses. The PARTI and other communication primitives and intrinsic functions form the unstructured run-time support system of our Fortran 90D/HPF compiler.



Next: Run-time Support System Up: Communication Generation Previous: Structured Communication


zbozkus@
Thu Jul 6 21:09:19 EDT 1995