This program is harder for the compiler.
-
Allocate portion of array on each processor based on DISTRIBUTE
-
Apply owner-computes rule analytically based on left-hand side
-
Detect communication from dependence analysis & intrinsics
-
Here, it really pays to transform the program!
-
Reorder computation to always precompute the next pivot column
-
Rearrange communication to pipeline the series of updates
-
Do broadcasts asynchronously
-
Net result: 2¥ speedup
-
Use standard numbering for processors
|