I remember hearing repeatedly from low-level hardware people that moving data is the ...

I remember hearing repeatedly from low-level hardware people that moving data is the thing that really costs power in a computer, so I was wondering what that actually looks like on my Tiger Lake laptop... apparently in practice this mostly means that when you access further-out data, power usage goes up a bit while performance goes down drastically (and so the power cost for a given amount of work is going up a lot)?

I wanted to see the cost of pulling cache lines up through the memory hierarchy, so I decided to make one of those charts people sometimes share to show cache characteristics through the memory throughput depending on working set size. I added the performance counter data showing L1D/L2/L3 hit/miss rates (which should more or less sum up to the rate at which I'm reading from memory, modulo a bit of overhead from my measurement harness), and I also added the RAPL psys data, which shows how much power is being used by the platform.
I'm doing independent reads here (so the CPU should be able to issue a bunch of them in parallel).
I've also added power use and read rate from another run with dependent loads and 8K stride (colored gray), so that you can see the power use when the CPU is getting much less work done because it has to wait for data a lot.

Not super scientific, and I assume the read rate going up at the beginning is because I didn't measure quite right, but still interesting, I think?

At the transition from L1D to L2 (which is very sharp because of L1D's PLRU replacement policy, I think), you can see that the read rate only drops a little bit, and power use goes up by something like 1.4W.
At the transition from L2 to L3, power usage only goes up by another 0.8W, but the read rate tanks by 71%, so the energy cost I'm effectively paying per read is going way up.
At the transition from L3 to main memory, the read rate approximately halves again, and power use kinda fluctuates?

[All memory accesses were to a 1G test region mapped with a single page table entry, so there should be no TLB effects in here. The test ran on a single core, doing single-byte reads with 64-byte stride, with turbo disabled.]
[This was measured in a single pass along the X axis, probably for doing this properly I should have looped around a few times...]
[I added the comparison with dependent loads in an attempts to sorta show how much of the power cost is actually for keeping the core running vs for memory access, but I'm not sure how useful that is... it seems like the platform power with the core running at normal speed without the cost of it doing anything might be around 12W?]

Jann Horn on Nostr: I remember hearing repeatedly from low-level hardware people that moving data is the ...