Choosing the Processing Macro

Today’s “conventional wisdom:”
- Complex memory hierarchy driving superscalar, superpipelined, branch predictions, fast TLBs, multiple function units, multi ported register files,.....
Does that make sense in PIM environment?
- Large bandwidth from direct row buffer access
- Reduced latency (no chip crossings)
- Naturally closely coupled parallelism
Answer: No! Better choice: design for:
- Maximum performance “per transistor”
- Minimize power per mip