1 | "Back of the envelope" where you get good intuition for why the performance is what it is |
2 | Detailed simulations |
3 | Communication Overhead in classical data parallel (edge over area) with NO memory hierarchy NO latency where dim is geometric dimension |
4 | tcomm |
5 | tfloat |
6 | (grain size) |
7 | 1/dim |
8 | a |