I learn something new every day! Thanks for mentioning this. For other readers: ...

I learn something new every day! Thanks for mentioning this. For other readers: Agner Fog documents this in 22.18 Mirroring memory operands.

I've known that similar optimizations exist, namely store-to-load forwarding, but I didn't know that AMD has experimented with mapping in-flight writes straight into the register file. Sounds like they've abandoned this approach, though, and Zen 3 doesn't feature this, supposedly because it's expensive to implement. So for all intents and purposes, it doesn't exist anymore, and it probably won't be brought back in the same fashion.

I do still think this is something better solved by ISA changes. Doing this on the uarch level will either be flaky or more costly. It is absolutely possible, but only with tradeoffs that may not be acceptable. The APX extension doubles the number of GPRs and improves orthogonality, so there's at least work in that direction on the ISA level, and I think that's what we're realistically going to use soon.