Skip to content

Commit 5cb4c67

Browse files
authored
bpo-34561: Switch to Munro & Wild "powersort" merge strategy. (#28108)
For list.sort(), replace our ad hoc merge ordering strategy with the principled, elegant, and provably near-optimal one from Munro and Wild's "powersort".
1 parent 19871fc commit 5cb4c67

File tree

3 files changed

+178
-92
lines changed

3 files changed

+178
-92
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
List sorting now uses the merge-ordering strategy from Munro and Wild's ``powersort()``. Unlike the former strategy, this is provably near-optimal in the entropy of the distribution of run lengths. Most uses of ``list.sort()`` probably won't see a significant time difference, but may see significant improvements in cases where the former strategy was exceptionally poor. However, as these are all fast linear-time approximations to a problem that's inherently at best quadratic-time to solve truly optimally, it's also possible to contrive cases where the former strategy did better.

‎Objects/listobject.c

+79-33
Original file line numberDiff line numberDiff line change
@@ -1139,12 +1139,11 @@ sortslice_advance(sortslice *slice, Py_ssize_t n)
11391139
if (k)
11401140

11411141
/* The maximum number of entries in a MergeState's pending-runs stack.
1142-
* This is enough to sort arrays of size up to about
1143-
* 32 * phi ** MAX_MERGE_PENDING
1144-
* where phi ~= 1.618. 85 is ridiculouslylarge enough, good for an array
1145-
* with 2**64 elements.
1142+
* For a list with n elements, this needs at most floor(log2(n)) + 1 entries
1143+
* even if we didn't force runs to a minimal length. So the number of bits
1144+
* in a Py_ssize_t is plenty large enough for all cases.
11461145
*/
1147-
#define MAX_MERGE_PENDING 85
1146+
#define MAX_MERGE_PENDING (SIZEOF_SIZE_T * 8)
11481147

11491148
/* When we get into galloping mode, we stay there until both runs win less
11501149
* often than MIN_GALLOP consecutive times. See listsort.txt for more info.
@@ -1159,7 +1158,8 @@ sortslice_advance(sortslice *slice, Py_ssize_t n)
11591158
*/
11601159
struct s_slice {
11611160
sortslice base;
1162-
Py_ssize_t len;
1161+
Py_ssize_t len; /* length of run */
1162+
int power; /* node "level" for powersort merge strategy */
11631163
};
11641164

11651165
typedef struct s_MergeState MergeState;
@@ -1170,6 +1170,9 @@ struct s_MergeState {
11701170
*/
11711171
Py_ssize_t min_gallop;
11721172

1173+
Py_ssize_t listlen; /* len(input_list) - read only */
1174+
PyObject **basekeys; /* base address of keys array - read only */
1175+
11731176
/* 'a' is temp storage to help with merges. It contains room for
11741177
* alloced entries.
11751178
*/
@@ -1513,7 +1516,8 @@ gallop_right(MergeState *ms, PyObject *key, PyObject **a, Py_ssize_t n, Py_ssize
15131516

15141517
/* Conceptually a MergeState's constructor. */
15151518
static void
1516-
merge_init(MergeState *ms, Py_ssize_t list_size, int has_keyfunc)
1519+
merge_init(MergeState *ms, Py_ssize_t list_size, int has_keyfunc,
1520+
sortslice *lo)
15171521
{
15181522
assert(ms != NULL);
15191523
if (has_keyfunc) {
@@ -1538,6 +1542,8 @@ merge_init(MergeState *ms, Py_ssize_t list_size, int has_keyfunc)
15381542
ms->a.keys = ms->temparray;
15391543
ms->n = 0;
15401544
ms->min_gallop = MIN_GALLOP;
1545+
ms->listlen = list_size;
1546+
ms->basekeys = lo->keys;
15411547
}
15421548

15431549
/* Free all the temp memory owned by the MergeState. This must be called
@@ -1920,37 +1926,74 @@ merge_at(MergeState *ms, Py_ssize_t i)
19201926
return merge_hi(ms, ssa, na, ssb, nb);
19211927
}
19221928

1923-
/* Examine the stack of runs waiting to be merged, merging adjacent runs
1924-
* until the stack invariants are re-established:
1925-
*
1926-
* 1. len[-3] > len[-2] + len[-1]
1927-
* 2. len[-2] > len[-1]
1929+
/* Two adjacent runs begin at index s1. The first run has length n1, and
1930+
* the second run (starting at index s1+n1) has length n2. The list has total
1931+
* length n.
1932+
* Compute the "power" of the first run. See listsort.txt for details.
1933+
*/
1934+
static int
1935+
powerloop(Py_ssize_t s1, Py_ssize_t n1, Py_ssize_t n2, Py_ssize_t n)
1936+
{
1937+
int result = 0;
1938+
assert(s1 >= 0);
1939+
assert(n1 > 0 && n2 > 0);
1940+
assert(s1 + n1 + n2 <= n);
1941+
/* midpoints a and b:
1942+
* a = s1 + n1/2
1943+
* b = s1 + n1 + n2/2 = a + (n1 + n2)/2
1944+
*
1945+
* Those may not be integers, though, because of the "/2". So we work with
1946+
* 2*a and 2*b instead, which are necessarily integers. It makes no
1947+
* difference to the outcome, since the bits in the expansion of (2*i)/n
1948+
* are merely shifted one position from those of i/n.
1949+
*/
1950+
Py_ssize_t a = 2 * s1 + n1; /* 2*a */
1951+
Py_ssize_t b = a + n1 + n2; /* 2*b */
1952+
/* Emulate a/n and b/n one bit a time, until bits differ. */
1953+
for (;;) {
1954+
++result;
1955+
if (a >= n) { /* both quotient bits are 1 */
1956+
assert(b >= a);
1957+
a -= n;
1958+
b -= n;
1959+
}
1960+
else if (b >= n) { /* a/n bit is 0, b/n bit is 1 */
1961+
break;
1962+
} /* else both quotient bits are 0 */
1963+
assert(a < b && b < n);
1964+
a <<= 1;
1965+
b <<= 1;
1966+
}
1967+
return result;
1968+
}
1969+
1970+
/* The next run has been identified, of length n2.
1971+
* If there's already a run on the stack, apply the "powersort" merge strategy:
1972+
* compute the topmost run's "power" (depth in a conceptual binary merge tree)
1973+
* and merge adjacent runs on the stack with greater power. See listsort.txt
1974+
* for more info.
19281975
*
1929-
* See listsort.txt for more info.
1976+
* It's the caller's responsibilty to push the new run on the stack when this
1977+
* returns.
19301978
*
19311979
* Returns 0 on success, -1 on error.
19321980
*/
19331981
static int
1934-
merge_collapse(MergeState *ms)
1982+
found_new_run(MergeState *ms, Py_ssize_t n2)
19351983
{
1936-
struct s_slice *p = ms->pending;
1937-
19381984
assert(ms);
1939-
while (ms->n > 1) {
1940-
Py_ssize_t n = ms->n - 2;
1941-
if ((n > 0 && p[n-1].len <= p[n].len + p[n+1].len) ||
1942-
(n > 1 && p[n-2].len <= p[n-1].len + p[n].len)) {
1943-
if (p[n-1].len < p[n+1].len)
1944-
--n;
1945-
if (merge_at(ms, n) < 0)
1985+
if (ms->n) {
1986+
assert(ms->n > 0);
1987+
struct s_slice *p = ms->pending;
1988+
Py_ssize_t s1 = p[ms->n - 1].base.keys - ms->basekeys; /* start index */
1989+
Py_ssize_t n1 = p[ms->n - 1].len;
1990+
int power = powerloop(s1, n1, n2, ms->listlen);
1991+
while (ms->n > 1 && p[ms->n - 2].power > power) {
1992+
if (merge_at(ms, ms->n - 2) < 0)
19461993
return -1;
19471994
}
1948-
else if (p[n].len <= p[n+1].len) {
1949-
if (merge_at(ms, n) < 0)
1950-
return -1;
1951-
}
1952-
else
1953-
break;
1995+
assert(ms->n < 2 || p[ms->n - 2].power < power);
1996+
p[ms->n - 1].power = power;
19541997
}
19551998
return 0;
19561999
}
@@ -2357,7 +2400,7 @@ list_sort_impl(PyListObject *self, PyObject *keyfunc, int reverse)
23572400
}
23582401
/* End of pre-sort check: ms is now set properly! */
23592402

2360-
merge_init(&ms, saved_ob_size, keys != NULL);
2403+
merge_init(&ms, saved_ob_size, keys != NULL, &lo);
23612404

23622405
nremaining = saved_ob_size;
23632406
if (nremaining < 2)
@@ -2393,13 +2436,16 @@ list_sort_impl(PyListObject *self, PyObject *keyfunc, int reverse)
23932436
goto fail;
23942437
n = force;
23952438
}
2396-
/* Push run onto pending-runs stack, and maybe merge. */
2439+
/* Maybe merge pending runs. */
2440+
assert(ms.n == 0 || ms.pending[ms.n -1].base.keys +
2441+
ms.pending[ms.n-1].len == lo.keys);
2442+
if (found_new_run(&ms, n) < 0)
2443+
goto fail;
2444+
/* Push new run on stack. */
23972445
assert(ms.n < MAX_MERGE_PENDING);
23982446
ms.pending[ms.n].base = lo;
23992447
ms.pending[ms.n].len = n;
24002448
++ms.n;
2401-
if (merge_collapse(&ms) < 0)
2402-
goto fail;
24032449
/* Advance to find next run. */
24042450
sortslice_advance(&lo, n);
24052451
nremaining -= n;

‎Objects/listsort.txt

+98-59
Original file line numberDiff line numberDiff line change
@@ -318,65 +318,104 @@ merging must be done as (A+B)+C or A+(B+C) instead.
318318
So merging is always done on two consecutive runs at a time, and in-place,
319319
although this may require some temp memory (more on that later).
320320

321-
When a run is identified, its base address and length are pushed on a stack
322-
in the MergeState struct. merge_collapse() is then called to potentially
323-
merge runs on that stack. We would like to delay merging as long as possible
324-
in order to exploit patterns that may come up later, but we like even more to
325-
do merging as soon as possible to exploit that the run just found is still
326-
high in the memory hierarchy. We also can't delay merging "too long" because
327-
it consumes memory to remember the runs that are still unmerged, and the
328-
stack has a fixed size.
329-
330-
What turned out to be a good compromise maintains two invariants on the
331-
stack entries, where A, B and C are the lengths of the three rightmost not-yet
332-
merged slices:
333-
334-
1. A > B+C
335-
2. B > C
336-
337-
Note that, by induction, #2 implies the lengths of pending runs form a
338-
decreasing sequence. #1 implies that, reading the lengths right to left,
339-
the pending-run lengths grow at least as fast as the Fibonacci numbers.
340-
Therefore the stack can never grow larger than about log_base_phi(N) entries,
341-
where phi = (1+sqrt(5))/2 ~= 1.618. Thus a small # of stack slots suffice
342-
for very large arrays.
343-
344-
If A <= B+C, the smaller of A and C is merged with B (ties favor C, for the
345-
freshness-in-cache reason), and the new run replaces the A,B or B,C entries;
346-
e.g., if the last 3 entries are
347-
348-
A:30 B:20 C:10
349-
350-
then B is merged with C, leaving
351-
352-
A:30 BC:30
353-
354-
on the stack. Or if they were
355-
356-
A:500 B:400: C:1000
357-
358-
then A is merged with B, leaving
359-
360-
AB:900 C:1000
361-
362-
on the stack.
363-
364-
In both examples, the stack configuration after the merge still violates
365-
invariant #2, and merge_collapse() goes on to continue merging runs until
366-
both invariants are satisfied. As an extreme case, suppose we didn't do the
367-
minrun gimmick, and natural runs were of lengths 128, 64, 32, 16, 8, 4, 2,
368-
and 2. Nothing would get merged until the final 2 was seen, and that would
369-
trigger 7 perfectly balanced merges.
370-
371-
The thrust of these rules when they trigger merging is to balance the run
372-
lengths as closely as possible, while keeping a low bound on the number of
373-
runs we have to remember. This is maximally effective for random data,
374-
where all runs are likely to be of (artificially forced) length minrun, and
375-
then we get a sequence of perfectly balanced merges (with, perhaps, some
376-
oddballs at the end).
377-
378-
OTOH, one reason this sort is so good for partly ordered data has to do
379-
with wildly unbalanced run lengths.
321+
When a run is identified, its length is passed to found_new_run() to
322+
potentially merge runs on a stack of pending runs. We would like to delay
323+
merging as long as possible in order to exploit patterns that may come up
324+
later, but we like even more to do merging as soon as possible to exploit
325+
that the run just found is still high in the memory hierarchy. We also can't
326+
delay merging "too long" because it consumes memory to remember the runs that
327+
are still unmerged, and the stack has a fixed size.
328+
329+
The original version of this code used the first thing I made up that didn't
330+
obviously suck ;-) It was loosely based on invariants involving the Fibonacci
331+
sequence.
332+
333+
It worked OK, but it was hard to reason about, and was subtle enough that the
334+
intended invariants weren't actually preserved. Researchers discovered that
335+
when trying to complete a computer-generated correctness proof. That was
336+
easily-enough repaired, but the discovery spurred quite a bit of academic
337+
interest in truly good ways to manage incremental merging on the fly.
338+
339+
At least a dozen different approaches were developed, some provably having
340+
near-optimal worst case behavior with respect to the entropy of the
341+
distribution of run lengths. Some details can be found in bpo-34561.
342+
343+
The code now uses the "powersort" merge strategy from:
344+
345+
"Nearly-Optimal Mergesorts: Fast, Practical Sorting Methods
346+
That Optimally Adapt to Existing Runs"
347+
J. Ian Munro and Sebastian Wild
348+
349+
The code is pretty simple, but the justification is quite involved, as it's
350+
based on fast approximations to optimal binary search trees, which are
351+
substantial topics on their own.
352+
353+
Here we'll just cover some pragmatic details:
354+
355+
The `powerloop()` function computes a run's "power". Say two adjacent runs
356+
begin at index s1. The first run has length n1, and the second run (starting
357+
at index s1+n1, called "s2" below) has length n2. The list has total length n.
358+
The "power" of the first run is a small integer, the depth of the node
359+
connecting the two runs in an ideal binary merge tree, where power 1 is the
360+
root node, and the power increases by 1 for each level deeper in the tree.
361+
362+
The power is the least integer L such that the "midpoint interval" contains
363+
a rational number of the form J/2**L. The midpoint interval is the semi-
364+
closed interval:
365+
366+
((s1 + n1/2)/n, (s2 + n2/2)/n]
367+
368+
Yes, that's brain-busting at first ;-) Concretely, if (s1 + n1/2)/n and
369+
(s2 + n2/2)/n are computed to infinite precision in binary, the power L is
370+
the first position at which the 2**-L bit differs between the expansions.
371+
Since the left end of the interval is less than the right end, the first
372+
differing bit must be a 0 bit in the left quotient and a 1 bit in the right
373+
quotient.
374+
375+
`powerloop()` emulates these divisions, 1 bit at a time, using comparisons,
376+
subtractions, and shifts in a loop.
377+
378+
You'll notice the paper uses an O(1) method instead, but that relies on two
379+
things we don't have:
380+
381+
- An O(1) "count leading zeroes" primitive. We can find such a thing as a C
382+
extension on most platforms, but not all, and there's no uniform spelling
383+
on the platforms that support it.
384+
385+
- Integer divison on an integer type twice as wide as needed to hold the
386+
list length. But the latter is Py_ssize_t for us, and is typically the
387+
widest native signed integer type the platform supports.
388+
389+
But since runs in our algorithm are almost never very short, the once-per-run
390+
overhead of `powerloop()` seems lost in the noise.
391+
392+
Detail: why is Py_ssize_t "wide enough" in `powerloop()`? We do, after all,
393+
shift integers of that width left by 1. How do we know that won't spill into
394+
the sign bit? The trick is that we have some slop. `n` (the total list
395+
length) is the number of list elements, which is at most 4 times (on a 32-box,
396+
with 4-byte pointers) smaller than than the largest size_t. So at least the
397+
leading two bits of the integers we're using are clear.
398+
399+
Since we can't compute a run's power before seeing the run that follows it,
400+
the most-recently identified run is never merged by `found_new_run()`.
401+
Instead a new run is only used to compute the 2nd-most-recent run's power.
402+
Then adjacent runs are merged so long as their saved power (tree depth) is
403+
greater than that newly computed power. When found_new_run() returns, only
404+
then is a new run pushed on to the stack of pending runs.
405+
406+
A key invariant is that powers on the run stack are strictly decreasing
407+
(starting from the run at the top of the stack).
408+
409+
Note that even powersort's strategy isn't always truly optimal. It can't be.
410+
Computing an optimal merge sequence can be done in time quadratic in the
411+
number of runs, which is very much slower, and also requires finding &
412+
remembering _all_ the runs' lengths (of which there may be billions) in
413+
advance. It's remarkable, though, how close to optimal this strategy gets.
414+
415+
Curious factoid: of all the alternatives I've seen in the literature,
416+
powersort's is the only one that's always truly optimal for a collection of 3
417+
run lengths (for three lengths A B C, it's always optimal to first merge the
418+
shorter of A and C with B).
380419

381420

382421
Merge Memory

0 commit comments

Comments
 (0)