TIL: Processing-Using-SRAM - When the Cache Becomes the Processor
TIL: Processing-Using-SRAM - When the Cache Becomes the Processor
Published on 2026-06-27
Processing-Using-SRAM - When the Cache Becomes the Processor
We have seen enough magic happen in DRAM via RowClone, Ambit and SIMDRAM. As I progressed in the later chapters of<br>the primer I came across Processing-Using-SRAM, which asks a different question:
If DRAM can compute, why not experiment with caches?
Modern CPUs contain tens of megabytes of SRAM spread across its cache hierarchy. Normally these SRAM arrays simply store data and serve it to the CPU. Processing-Using-SRAM techniques instead exploit the<br>internal behavior of SRAM arrays to perform computation directly inside the cache itself.
So it's the same story as Ambit. By simultaneously activating two SRAM rows, the shared bitlines and sensing circuitry naturally perform bitwise operations. When two rows drive the same bitline, the resulting<br>voltage depends on the values stored in both cells. The behavior can be exploited to implement Boolean operations such as AND and NOR directly inside the SRAM array.
What's more cooler is that every SRAM bitline can act as a tiny compute lane. Instead of moving data from cache into execution units, the cache itself becomes a massively parallel SIMD engine.
There are several research projects that build on this idea. Compute Cache demonstrates bulk bitwise operations directly inside<br>processor caches. Neural Cache extends the approach to arithmetic operations<br>for neural network inference. Duality Cache pushes things further by supporting general purpose data parallel execution using a SIMT-style programming model.
The major limitation SRAMs face is capacity. A CPU cache may contain about tens of megabytes of SRAM, but the main memory goes up to gigabytes of DRAM. So it's important to understand that SRAM works best when data<br>is already present in the cache. For very large datasets, moving data into the cache can outweigh the benefits of cache-side computation.
This does not mean that Processing-Using-DRAM is better than Processing-Using-SRAM. Rather, the two complement each other. Viewed this way, the future is likely to be a memory-centric system containing both of them. Having<br>computation happen at every level of the memory hierarchy suddenly starts looking a lot like distributed systems!
[!TLDR]<br>Computation does not always need to happen where we traditionally expect it to.