@@ -318,65 +318,104 @@ merging must be done as (A+B)+C or A+(B+C) instead.
318
318
So merging is always done on two consecutive runs at a time, and in-place,
319
319
although this may require some temp memory (more on that later).
320
320
321
- When a run is identified, its base address and length are pushed on a stack
322
- in the MergeState struct. merge_collapse() is then called to potentially
323
- merge runs on that stack. We would like to delay merging as long as possible
324
- in order to exploit patterns that may come up later, but we like even more to
325
- do merging as soon as possible to exploit that the run just found is still
326
- high in the memory hierarchy. We also can't delay merging "too long" because
327
- it consumes memory to remember the runs that are still unmerged, and the
328
- stack has a fixed size.
329
-
330
- What turned out to be a good compromise maintains two invariants on the
331
- stack entries, where A, B and C are the lengths of the three rightmost not-yet
332
- merged slices:
333
-
334
- 1. A > B+C
335
- 2. B > C
336
-
337
- Note that, by induction, #2 implies the lengths of pending runs form a
338
- decreasing sequence. #1 implies that, reading the lengths right to left,
339
- the pending-run lengths grow at least as fast as the Fibonacci numbers.
340
- Therefore the stack can never grow larger than about log_base_phi(N) entries,
341
- where phi = (1+sqrt(5))/2 ~= 1.618. Thus a small # of stack slots suffice
342
- for very large arrays.
343
-
344
- If A <= B+C, the smaller of A and C is merged with B (ties favor C, for the
345
- freshness-in-cache reason), and the new run replaces the A,B or B,C entries;
346
- e.g., if the last 3 entries are
347
-
348
- A:30 B:20 C:10
349
-
350
- then B is merged with C, leaving
351
-
352
- A:30 BC:30
353
-
354
- on the stack. Or if they were
355
-
356
- A:500 B:400: C:1000
357
-
358
- then A is merged with B, leaving
359
-
360
- AB:900 C:1000
361
-
362
- on the stack.
363
-
364
- In both examples, the stack configuration after the merge still violates
365
- invariant #2, and merge_collapse() goes on to continue merging runs until
366
- both invariants are satisfied. As an extreme case, suppose we didn't do the
367
- minrun gimmick, and natural runs were of lengths 128, 64, 32, 16, 8, 4, 2,
368
- and 2. Nothing would get merged until the final 2 was seen, and that would
369
- trigger 7 perfectly balanced merges.
370
-
371
- The thrust of these rules when they trigger merging is to balance the run
372
- lengths as closely as possible, while keeping a low bound on the number of
373
- runs we have to remember. This is maximally effective for random data,
374
- where all runs are likely to be of (artificially forced) length minrun, and
375
- then we get a sequence of perfectly balanced merges (with, perhaps, some
376
- oddballs at the end).
377
-
378
- OTOH, one reason this sort is so good for partly ordered data has to do
379
- with wildly unbalanced run lengths.
321
+ When a run is identified, its length is passed to found_new_run() to
322
+ potentially merge runs on a stack of pending runs. We would like to delay
323
+ merging as long as possible in order to exploit patterns that may come up
324
+ later, but we like even more to do merging as soon as possible to exploit
325
+ that the run just found is still high in the memory hierarchy. We also can't
326
+ delay merging "too long" because it consumes memory to remember the runs that
327
+ are still unmerged, and the stack has a fixed size.
328
+
329
+ The original version of this code used the first thing I made up that didn't
330
+ obviously suck ;-) It was loosely based on invariants involving the Fibonacci
331
+ sequence.
332
+
333
+ It worked OK, but it was hard to reason about, and was subtle enough that the
334
+ intended invariants weren't actually preserved. Researchers discovered that
335
+ when trying to complete a computer-generated correctness proof. That was
336
+ easily-enough repaired, but the discovery spurred quite a bit of academic
337
+ interest in truly good ways to manage incremental merging on the fly.
338
+
339
+ At least a dozen different approaches were developed, some provably having
340
+ near-optimal worst case behavior with respect to the entropy of the
341
+ distribution of run lengths. Some details can be found in bpo-34561.
342
+
343
+ The code now uses the "powersort" merge strategy from:
344
+
345
+ "Nearly-Optimal Mergesorts: Fast, Practical Sorting Methods
346
+ That Optimally Adapt to Existing Runs"
347
+ J. Ian Munro and Sebastian Wild
348
+
349
+ The code is pretty simple, but the justification is quite involved, as it's
350
+ based on fast approximations to optimal binary search trees, which are
351
+ substantial topics on their own.
352
+
353
+ Here we'll just cover some pragmatic details:
354
+
355
+ The `powerloop()` function computes a run's "power". Say two adjacent runs
356
+ begin at index s1. The first run has length n1, and the second run (starting
357
+ at index s1+n1, called "s2" below) has length n2. The list has total length n.
358
+ The "power" of the first run is a small integer, the depth of the node
359
+ connecting the two runs in an ideal binary merge tree, where power 1 is the
360
+ root node, and the power increases by 1 for each level deeper in the tree.
361
+
362
+ The power is the least integer L such that the "midpoint interval" contains
363
+ a rational number of the form J/2**L. The midpoint interval is the semi-
364
+ closed interval:
365
+
366
+ ((s1 + n1/2)/n, (s2 + n2/2)/n]
367
+
368
+ Yes, that's brain-busting at first ;-) Concretely, if (s1 + n1/2)/n and
369
+ (s2 + n2/2)/n are computed to infinite precision in binary, the power L is
370
+ the first position at which the 2**-L bit differs between the expansions.
371
+ Since the left end of the interval is less than the right end, the first
372
+ differing bit must be a 0 bit in the left quotient and a 1 bit in the right
373
+ quotient.
374
+
375
+ `powerloop()` emulates these divisions, 1 bit at a time, using comparisons,
376
+ subtractions, and shifts in a loop.
377
+
378
+ You'll notice the paper uses an O(1) method instead, but that relies on two
379
+ things we don't have:
380
+
381
+ - An O(1) "count leading zeroes" primitive. We can find such a thing as a C
382
+ extension on most platforms, but not all, and there's no uniform spelling
383
+ on the platforms that support it.
384
+
385
+ - Integer divison on an integer type twice as wide as needed to hold the
386
+ list length. But the latter is Py_ssize_t for us, and is typically the
387
+ widest native signed integer type the platform supports.
388
+
389
+ But since runs in our algorithm are almost never very short, the once-per-run
390
+ overhead of `powerloop()` seems lost in the noise.
391
+
392
+ Detail: why is Py_ssize_t "wide enough" in `powerloop()`? We do, after all,
393
+ shift integers of that width left by 1. How do we know that won't spill into
394
+ the sign bit? The trick is that we have some slop. `n` (the total list
395
+ length) is the number of list elements, which is at most 4 times (on a 32-box,
396
+ with 4-byte pointers) smaller than than the largest size_t. So at least the
397
+ leading two bits of the integers we're using are clear.
398
+
399
+ Since we can't compute a run's power before seeing the run that follows it,
400
+ the most-recently identified run is never merged by `found_new_run()`.
401
+ Instead a new run is only used to compute the 2nd-most-recent run's power.
402
+ Then adjacent runs are merged so long as their saved power (tree depth) is
403
+ greater than that newly computed power. When found_new_run() returns, only
404
+ then is a new run pushed on to the stack of pending runs.
405
+
406
+ A key invariant is that powers on the run stack are strictly decreasing
407
+ (starting from the run at the top of the stack).
408
+
409
+ Note that even powersort's strategy isn't always truly optimal. It can't be.
410
+ Computing an optimal merge sequence can be done in time quadratic in the
411
+ number of runs, which is very much slower, and also requires finding &
412
+ remembering _all_ the runs' lengths (of which there may be billions) in
413
+ advance. It's remarkable, though, how close to optimal this strategy gets.
414
+
415
+ Curious factoid: of all the alternatives I've seen in the literature,
416
+ powersort's is the only one that's always truly optimal for a collection of 3
417
+ run lengths (for three lengths A B C, it's always optimal to first merge the
418
+ shorter of A and C with B).
380
419
381
420
382
421
Merge Memory
0 commit comments