1 |
Code is iteratively generated & timed until optimal case is found. We try:
-
Differing NBs
-
Breaking false dependencies
-
M, N and K loop unrolling
|
2 |
Cache based multiply optimizes for:
-
TLB access
-
L1 cache reuse
-
FP unit usage
-
Memory fetch
-
Register reuse
-
Loop overhead minimization
|
3 |
Takes a couple of hours to run.
|