"Back of the envelope" where you get good intuition for why the performance is what it is |
Detailed simulations |
Communication Overhead in classical data parallel (edge over area) with NO memory hierarchy NO latency where dim is geometric dimension |
tcomm |
tfloat |
(grain size) |
1/dim |
a |