When a block of source data is replicated, any or all of
the processors owning it can take part in the communication.
Alternatively, one of the source processors is chosen to send to all
processors owning the destination block. Ideally, the sends should be
spread out over as many source processors as possible to utilize as full
available communication bandwidth (shown in Figure ).
This idea is observed by Mark Young [64].
The basic idea is to somehow divide the set of destination processors among the set of source processors. Each source would do a multicast to its assigned subset of destinations. The source and destination sets are computable from information in the template and processor data structures.