Apple M1: Load and Store Queue Measurements

Out-of-order processors have to keep track of multiple in-flight operations at once, and they use a variety of different buffers and queues to do so. I’ve been trying to characterise and measure some of these buffers in the Apple M1 processor’s Firestorm and Icestorm microarchitectures, by timing different instruction patterns.

I’ve measured the size of the load and store queue, discovered that load and store queue entries are allocated when the ops issue from the scheduler, and released once they are non-speculative and all earlier loads and stores have been released. I may have also accidentally found a trick for manipulating load/store alias prediction. And I figured I should write it up, so other people can reproduce it, and/or find mistakes.

Fighting Loads with Loads

The general idea is to time the execution of two independent long-latency operations (instructions, or chains of dependent instructions), with a number of loads or stores between them. Usually these two long-latency operations can run in parallel, but if there are so many loads/stores that some buffer is completely filled, the machine will stall, and will have to wait until space in the buffer is freed to execute subsequent instructions. This method was described first (and much more clearly) in Henry Wong’s blog post Measuring Reorder Buffer Capacity.

Initially, in measuring the M1, I used cache-missing loads as the “two independent long-latency operations” (as did everyone else, to my knowledge).

My result (which very roughly looked like AnandTech’s results) was that 128 loads or 107 stores could run, without stopping the final long-latency load from executing in parallel, but adding one more would cause a stall. Since the first load, 128 other loads, and the last load are in flight at the same time, I’d call this a load queue size of 130. I still believe this to be correct. On the other hand, the loads don’t require store buffers, so I incorrectly called this a 107 entry store queue.

Schedulers and Dispatch Queues

Although it is not the topic of this post, I also mapped out the structure of the dispatch queues and schedulers:

If an op can enter a scheduler, it can leave the scheduler and execute when all its dependencies are ready. If it can enter a dispatch queue, then it will enter a scheduler when there’s room, and until then, the frontend can continue. But if the scheduler and dispatch queue are full, and we have an op that needs to go to that queue, we stall. This means no more ops can enter any dispatch queues until there is room in that queue.

It is worth noting that the load/store scheduler has 48 entries, with a 10 entry dispatch queue.

(Very briefly: These were measured by filling the scheduler with some number of loads or stores, with addresses depending on the first long-latency chain, then finding two points. First, I found the last point at which an independent load could still fit into the scheduler and run, to find the scheduler size. Then, I found the number of extra load/stores, once the scheduler was full, that were required to block independent floating-point operations. That gives the dispatch queue size. Mixing operations makes it possible to figure out which schedulers/queues are separate and which are not. Should I write this up?)

Fighting Stores with Square-Roots

The next idea was to use a chain of long-latency floating point operations instead. Surprisingly, this produces a result that about 329 loads or stores can run between floating point operations, without forcing the two long-latency chains to run completely in parallel. I assume this means that load and store queue entries are being released, and reused, before the first long-latency chain has retired, and we’re hitting another limit. Mixing loads and stores confirms it’s the same limit. (It’s not the topic of this post, but I’ve explored it a bit and called it the “coalescing retire queue”. I believe it is their ROB implementation.)

So at this point I’m guessing that loads and stores can complete (free their load/store queue entries, but not their retire queue entries) once they are non-speculative. I believe this completion is in-order relative to other loads and stores. The approach I used to test this was to have a single, initial load or store operation, with its address depending on the result of the first long-latency operation.

However, for this writeup, I got the same result by instead adding a branch, dependent on the result of the first long-latency operation. This will ensure the loads and stores cannot become non-speculative, and keep their queue entries. In this test we see we can run 188 loads or 118 stores without forcing the two long-latency chains to run completely in parallel. This was initially pretty confusing, since we believe we only have a 130 entry load buffer. So, where did the extra 58 entries come from?

The load/store scheduler has 48 entries, with a 10 entry dispatch queue. If the load/store scheduler and dispatch queue are full, the integer and floating point units can continue operating. But if we hit one more load/store, the machine stalls, as it has nowhere to put the instruction. This explains the 58 extra entries.

By this logic (subtracting 58 for the size of the scheduler and dispatch queue), the store queue has only 60 entries. So why did we think it had 47 more? Because if the 48 entry scheduler is almost full, but has one free entry, then a load can enter the scheduler, and then issue from the scheduler and execute (in parallel with the other long-latency load), but if the scheduler is completely full, it cannot.

So those are my current numbers, 130 load queue entries and 60 store queue entries. The same logic works for Icestorm, where we see 30 load queue entries and 18 store queue entries (with an 18 entry scheduler and a 6 entry dispatch queue).

An Uncomfortable, Surprising Discovery

Update (2022-07-22): I’ve since had trouble reproducing the results in this section on an M1 Max. I suspect this is an issue with hardware or software revisions, but take this section with a grain of salt.

In writing this up for this post, I tried to reproduce the “107 entry store queue” result, but the result I got was 60 entries. This is both the best answer I could hope for (it’s the number I currently think is correct), and the worst possible outcome (I’m writing this up, so that people can reproduce my work, and I’m failing to reproduce my own results).

So what went wrong? Bisecting a little, I found this was caused by refactoring to use X29 as the base address for the second long-latency load (with a variable offset), and using X29 or SP as the base address for the store (with a constant offset). Changing either or both registers back to X3 (even though the values did not change) gave me 107 again.

Processors try to execute loads and stores out of order, which makes things fast, as long as the data read by a load isn’t later changed by a preceding (in program order) store. This is called a memory order violation. When it happens, a processor typically throws away all its work and starts again, which is very expensive. (Apple provides a performance counter for this. It’s called MEMORY_ORDER_VIOLATION and described as “Incorrect speculation between store and dependent load”.) Because this is so expensive, there are predictors that try to figure out when a load and a store might alias, and run them in order instead. You can read more about how Intel approaches this in Travis Downs’ Memory Disambiguation on Skylake writeup.

X29 is typically used as the stack frame base pointer, and I suspect the memory dependency prediction has a special case for this, and makes the load wait until it knows the store doesn’t alias it. The theory is that if we have 60 speculative stores before the load, we can figure out all their addresses, and that the load can go ahead. But if we have 61, we can’t check the last one, so the load will wait.

The same code running on Icestorm also measures the 18 entry store buffer.

I think this explanation makes sense, but it was a surprising reminder that there’s still a lot of mysteries here, and it’d be good to verify it further. I put the code to reproduce this result in a gist if others want to investigate this.

The Data

So, here’s the data. To get a single number, I find the top of the jump (the first point at which it’s executing completely serially) and subtract one. This is the largest number of instructions that execute a measurable amount faster than completely serially. But however you pick, the results are close enough.

(Note that the dependent B.cc is chained using FCMP, as I think FMOV might interfere with the memory operations.)

Final Notes

The difference between measuring using the resource itself (loads+loads), and measuring using another resource (fsqrts+loads) is very clear in both graphs. The 58 instruction difference implies that when we do not have the resources to execute more loads, we can continue move more loads into the scheduler. So I conclude that the resources are allocated late. Similarly, the ~330 limit we can hit implies that this resource can be freed early.

I do not see this kind of pattern when measuring physical register file sizes (e.g. comparing int+int vs int+float), so I believe they are neither allocated late, nor freed early. But there’s a lot of complexity I do not yet understand.

2 thoughts on “Apple M1: Load and Store Queue Measurements

  1. Very nice work (as always!)

    I tracked the size of the int Dispatch buffers (not carefully, it just came up as a side effect of something else I was looking at) by exploiting the fact that you can fill Dispatch at 8/cycle. but it only drains at 6/cycle.
    The same may be true of FP (and flags) Dispatch, that you can fill them 8-wide but only drain them 4 or 3-wide, so you can force a stutter at the point where they fill up?

    This appears to be the patent for the early release of LSQ entries:
    https://patents.google.com/patent/US9535695B2/
    (Haven’t had a chance to read it carefully, but that seems to be the idea.)

    The hierarchical scheduling queues have evolved over time.
    We start with 2013 https://patents.google.com/patent/US9336003B2 which is the basic two level scheme, where the main point is load-balancing.
    This is augmented with 2015 https://patents.google.com/patent/US20170024205A1, a non-shifting Scheduling Queues which I will describe soon.
    The same technology that allowed the non-shifting Scheduling Queues allows 2016 https://patents.google.com/patent/US10452434B1, which is a cheap timestamp attached to the elements of the Dispatch Buffer so that it’s easy to ensure that it’s the oldest instructions that are released from the Dispatch Buffer to the Scheduling queues, not random instructions.
    Finally we get 2017 https://patents.google.com/patent/US10452434B1 which, under the right conditions (specifically nothing can be found to execute from the primary Scheduling Queue) looks into the Dispatch Buffer to see if there’s anything there that can be sent for execution.

    To understand the point of the 2015 patent, recall that the usual way a Scheduling Queue is implemented is exactly as a physical queue. Using a physical queue means that temporal ordering is trivially preserved — it’s easy to see which instructions are oldest (at the head of the queue) vs youngest (at the tail of the queue). But a queue uses a lot of power because it has to be “collapsing”; that is, every time an instruction is scheduled from the middle of the queue, you have to move all the other values up to remove that vacancy and open up space at the end of the queue.
    Now you can imagine various possible ways to reduces the cost (leave the holes as long as possible, ie only collapse when you have to; maybe use indirection, like a linked list rather than a linear sequence of storage; …). The Apple solution is to use an age matrix, just like the 2019 History File patent ( https://patents.google.com/patent/US20210064376A1/ ), a structure that is large (by the standards of tech many years ago — but transistors are cheap!) but low power and has low complexity wiring. The age matrix tracks the temporal ordering of the instructions while not requiring them to continually be moved from one slot to another.

    Apple is ABSOLUTELY using SP for special prediction purposes.
    I don’t know if this is the particular case you hit, but it IS very interesting:
    If there’s any one thing Apple loves in their CPUs, it is clever tricks that do work at Map/Rename instead of Execute.
    We see this with zero-cycle MOV (copy the RegisterID), zero-cycle MOV# (implemented in a few different ways, M1 version appears to be in transition from two previous versions, and uses about 32 dedicated I-registers), and zero-cycle load (track if the load was recently stored, if the store source register is still available, and if so, just remap that source register to the load destination).

    Once you appreciate the idea, the details are all in how you track earlier stores vs later loads. The original version did the tracking via patterns of (rB+rI or rB+imm)
    Here’s the patent:
    https://patents.google.com/patent/EP2674856A2/

    A later version specializes this idea specifically to stack behavior (essentially saving/restoring non-volatiles across procedure boundaries is one use case; using the stack as extra register storage slots because you’ve run put of logical registers is a second use case).
    https://patents.google.com/patent/US10838729B1
    This second version even zero-cycles LDP!

    So, yeah, work with the specialized predictors, not against them 🙂
    Which means using the stack pointer as a stack pointer, not as yet another register!
    (On the flip side, like I said, under the right conditions, you can store to, then zero-cycle-load from the stack, so the penalty for using a stack temporary is close to non-existent.)

    I believe you are wrong about int register allocation. I see evidence of late int allocation all over the place (though not early release).
    Run the following in a basic loop.
    Delay block (I used 7 chained fsqrt.S, to get 70 cycles)
    370 ADD x0, x5, x5
    N (variable) NOPs

    Compare to
    Delay block (I used 7 chained fsqrt.S, to get 70 cycles)
    380 ADD x0, x5, x5
    N (variable) NOPs

    VERY different!
    The first case blocks with the ADDs all piled up in the int scheduling queues (stalled because run out registers) but OUT OF THE WAY OF THE NOPS. So the NOPs can slide past them and keep going. ADDs stall, but not the entire machine.
    The second case has the ADDs spill out of the scheduling queues all the way up to occupying some Rename slots. NOPs can’t slide past them and the machine stalls completely.

    (You can replace NOPs with MOV xn, xm if you like. And probably also with fp instructions. You just need something that keeps executing when the INT pipes are stalled.)

    I don’t know how to show graphs in this comment, but the effect is very obvious, and once you know to expect it you see it everywhere.
    I haven’t tested it for flags, but I would expect the exact same thing.

    The ROB needs to be thought of as a collection of items that each do a job, not as a long queue. One of those jobs that needs to be done is unwinding register remapping in the event of
    – a misspeculation
    – that’s not associated with a checkpoint.
    This is done via the History File (about 630 entries) which takes a new entry every time the mapping table changes. (So on allocation of new destination registers, but also on all the clever zero-cycle MOV/Immediate/Load tricks. And 2 HF slots allocated when two register mappings change, eg ADDS, or LDP).
    A different task is tracking “Failable” instructions (which seem to be speculated branches and load/stores [because of LS speculation]). This structure, I suspect, is your “Coalesced ROB”.

    I believe that dependencies are not tracked in the OoO machinery via physical registerID (the standard Tomasulo scheme) but by the ROB ID of the instruction producing a particular result. Of course saying ROB ID there is somewhat vague — more precisely the slot in the History File, I would guess. You have to go beyond Tomasulo if you want late register allocation (or virtual load/store queue). One way to do it is Virtual Registers; but using the ROB is another direction to go (which is nice because it uses much the same mechanism for register dependency management and LS dependency management); and the language Apple uses throughout their patents suggests to me the “ROB” is acting as the central locus of this dependency tracking.

    One final warning (because I wasted way too much time to this!)
    I haven’t much probed the details of the ROB yet. I have validated your discovery that NOPs can be flash-retired at 56 (7*8)/cycle. I have suggestions, but wouldn’t yet trust them much, that registers are freed (or perhaps that HF entries are freed?) at 8/cycle. My guess is that NOPs (and NOP variants, like MOV xn, xn) behave as they do as a kind of spandrel, not a design feature but something that falls out of the rest of the design, like every “failable” instruction gets the start of a new ROB “row” that can hold up to 8 instructions (cf classical POWER-style grouping), and non-failables are packed into the other 7 of that row. So ~330 rows, 330*7=2310 NOPs can be held in the ROB — not because so many ROBs are desired, but because ~330 ROB rows are desired (lots of failables!), and each such row can hold up to 7 non-failables.

    I’m guessing Retire of Failables can happen fast (if the failable did not misspeculate, the whole row can go, and up to 8 of these per cycle).
    Somewhat orthogonal to that is the Completion of History File entries. I’m guessing these are also processed 8/cycle. Do they happen in lockstep with the “primary” ROB (ie the rows of failables), so that ROB retire is blocked waiting for HF entries to complete? Or does the HF lag the ROB, so that the ROB marks HF entries in bulk as “safe” (up to 56 in a cycle), then 8/cycle, entries are cleared from the HF and treated appropriately (physical registers freed, entries in checkpoints freed, etc). My guess would be the second, but that’s still in the air.

    Anyway that’s not my point, my point is, if you want to investigate branches (which is the obvious thing to do when you start investigating the ROB) you will hit (at least) two unexpected features.
    First is that non-speculated branches are “executed” at Decode time (by sending the address resteer to fetch, along with a desultory ROB allocation, but not a dispatch down the int pipeline), because of course they are! That’s Apple’s style — anything you can do early you do!
    You can test this by running code of the form (B+4 6x ADD), which will run 7 wide per cycle, the B does not use an int pipe.
    (BTW the B+4 is NOT treated as a NOP. eg you can’t run 8 (or even 2) of them in a single cycle. I’m surprised Apple let this one slip, when they caught MOV xn, xn as a NOP. But there you are — it certainly doesn’t matter as much as the catastrophic treatment of xzr.)

    Second is the following:
    Obviously Apple has a remarkably accurate branch predictor (and requires it, with this degree of OoO).
    • One aspect of keeping a branch predictor accurate, once you’re achieving really high accuracy, is that you must not pollute it with inaccurate information. This means that you must not integrate branch information into the bulk of the predictor until that information is non-speculative.
    • Which means that you’re going to augment your main predictor with side-storage that can hold tentative branch information, and, as that branch information is validated (as the generating branch retires through the ROB) this information is integrated into the primary predictors.
    • The second salient fact is that this integration is stunningly slow! One integration/three cycles! I always skipped over this particular detail when reading the TAGE papers, so I don’t know how intrinsic this slow integration is.
    But the consequences are very obvious…

    Apple tries to limit the fallout in a few different ways.
    The integration of (validated) speculative branch material happens after retire, so the rest of the machine does not depend on it.
    The pool of storage for this material is fairly high. As best I can tell (not 100% sure, but fairly confident) the storage is essentially based on PC%8 (strip the low 4 bits). So we have 8 banks of storage, each bank can hold about 88 entries.
    Consequences are
    – when the bank that would hold an entry is finally filled up, the machine halts.
    Processing of branches drops from 1 or 2 per cycle (depending on taken vs non-taken status) to one every three cycles.
    – best case you can have ~88*8=704 branches in tentative state
    – you get this best case with a linear stream of B+4
    – WORST CASE (if every branch has the same PC mod 32 [ie 8*sizeof(instruction)]), so something like (B+32, 7x NOP) is the machine stalls after 88 such branches
    – enjoy hours of fun exploring as you change number of NOPs, change NOPs to other things, and see constant random results because THE ONLY THING THAT MATTERS is how you are changing the relative PCs of the successive branches 🙂

    But, seriously, going forward this is going to be a big problem. Suppose you want to sustain 4-wide execution over thousands of cycles. That means you can’t have a “relevant” branch (which is certainly all taken branches, speculative or not, probably also all misspeculated branches) more frequently than 1 in 12 on average, which is getting pretty tight. Yet more reason to use predication and CSEL wherever you can.
    How to fix it?
    Honestly I don’t know.
    One partial solution is to use IBM’s trick of treating a B.cc+8 as a predicated instruction, so it’s not seen as a branch, not handled by the prediction machinery. Hell, might even be worth doing that for double instructions so also B.cc+12.
    Another solution might be that, even if there are basic reasons why integrating new data into TAGE is intrinsically slow and sequential, it could be pipelined? Or perhaps a pre-processor could see patterns or structure in the material that’s going to be integrated, so that a single integration still costs 3 (or even 4) cycles, but the unit that is integrated is frequently two or three branch results at a time rather than a single result at a time?

    However the practical fact to note, going forward, is that any machine probing that starts to uses branches has to be very careful about this fact. You can hit delays you don’t expect because they’re happening so far out from the “primary” pipeline, and they can change in what looks like random ways as the result of changing the PCs of the branches…

  2. Chester makes the very interesting point that the slowdowns I have been ascribing to integrating speculative history information into the non-speculative history store could alternatively be ascribed to the system having to switch from an L1 to an L2 BTB.

    It’s certainly possible in theory, but I think the numbers are too small. Of course IBM z/ is a very different type of machine, but z14 (a 2018 machine running at 5.5GHz) has an L1 BTB of 8,192 entries (doubled for z15) and an L2 BTB of 128K entries. This shows what’s technically feasible if you are willing to spend the transistors, and suggests where Apple might be placed. Compare that with the situation we are seeing of branch throughput dropping at around 2000 branches.

    Nevertheless this is a possibility worth bearing in mind as one possible part of an explanation; and it might be worth running the test over even larger numbers of successive branches (16K up to even 256 or 512k successive branches!) to see if we detect any further falling off in back-to-back branch throughput.

Leave a comment