Skip to content

bpo-34561: Switch to Munro & Wild "powersort" merge strategy. #28108

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Sep 6, 2021

Conversation

tim-one
Copy link
Member

@tim-one tim-one commented Sep 1, 2021

First stab at this - work in progress, but so far so good. Still needs changes to listsort.txt, and, in listobject.c, at least reducing the max stack space needed (for remembering pending runs to be merged ).

https://bugs.python.org/issue34561

tim-one and others added 4 commits September 1, 2021 11:01
member recording a run's starting index - that can be computed
cheaply on-the-fly, by storing instead the key array's base
address just once in the MergeState struct.
code, and a new writeup for listsort.txt.
@tim-one tim-one self-assigned this Sep 2, 2021
@tim-one
Copy link
Member Author

tim-one commented Sep 2, 2021

@rhettinger , I requested your review because you're the only active developer I recall who did significant work on fiddling our sort algorithm - although this change is at a deeper level than, e.g., adding a key= argument.

@taleinat , I requested your review because you once said you'd be happy to review a change of mine nobody else wanted to touch. So I'm calling your bluff 😉.

Both, this change isn't expected to make dramatic changes in timing. Indeed, I can barely detect any in ordinary cases (for example, running Lib/test/sortperf.py). It's aimed instead at replacing my long-ago ad hoc made-up merging strategy with one from later academic work that has provably stellar worst-case behavior, and a published paper readers can consult for details and proofs.

To see a clear - but still modest - timing difference, here's a case contrived to be unusually poor for the current merge strategy:

runs = 190000, 180000, 10000, 10000

from time import perf_counter as now
xs = []
for r in runs:
    xs.extend(range(r * 320))

t0 = now()
xs.sort()
t1 = now()
print(len(xs), t1 - t0)
@tim-one
Copy link
Member Author

tim-one commented Sep 2, 2021

Here's a slightly more effective way to contrive a "bad case" for the current code, which also allows specifying (a close approximation to) the desired list length.

N = 100_000_000  # target length of list

MAX_MINRUN = 64
base = N >> 1

runs = base, base - MAX_MINRUN, MAX_MINRUN, 1

xs = []
for r in runs:
    xs.extend(range(r))

from time import perf_counter as now
t0 = now()
xs.sort()
t1 = now()
print(len(xs), t1 - t0)
@tim-one
Copy link
Member Author

tim-one commented Sep 3, 2021

And another bad-case generator. This exaggerates the effects of merging patterns by using only two distinct objects as list elements, so essentially all operations are done on objects sitting in L1 cache.

It can also generate bad cases for powersort. However, note that they're relatively less bad than the timsort bad cases are for timsort. In fact, before I pulled the "only two objects" trick to slash cache misses, it was hard to be sure they were "bad" cases (ya, they were provably so, but not really measurably so 😉).

N = 100_000_000  # target length of list

MAX_MINRUN = 64

if 1:   # bad for timsort
    base = N >> 1
    runs = base, base - MAX_MINRUN, MAX_MINRUN, 1
else:   # bad for powersort
    base = N // (2 * MAX_MINRUN)
    runs = (base - 1)*MAX_MINRUN, MAX_MINRUN, MAX_MINRUN, base * MAX_MINRUN

xs = []
from itertools import repeat
for r in runs:
    xs.append(0)
    xs.extend(repeat(1, r-1))

from time import perf_counter as now
t0 = now()
xs.sort()
t1 = now()
print(len(xs), t1 - t0)
assert xs[: len(runs) + 1] == [0] * len(runs) + [1]
Copy link
Contributor

@cfbolz cfbolz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a super thorough review, but I did read through all the code and noticed a typo and a nitpick. (aside: it's a pity that we can't easily test those implementation-level C functions, it would be super nice to stick a random test for powerloop somewhere and compare it against a high level python implementation).

* b = s1 + n1 + n2/2 = a + (n1 + n2)/2
*/
Py_ssize_t a = 2 * s1 + n1; /* 2*a */
Py_ssize_t b = a + n1 + n2; /* 2*b */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bit of a nitpick, but I found this confusing when reading the code and trying to follow along with the comments: the math a and b from the comments having the same names as the code a and b, despite those storing 2*math a (and b respectively).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added new comments to explain it. Does that help enough?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They definitely help, but what tripped me up is really the reuse of the letter "a" to mean both "a" (in the comment) and "2*a" (in the code). so maybe the comment or the code could instead talk about x and y instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, but I'm disinclined to make more changes. The overwhelmingly important concept here is "midpoint", and I don't want to obscure it under layers of pedantic name proliferation. In my mind, the midpoints are named "a" and "b", period, and the shift in perspective by 1 bit position is a triviality much better overlooked than emphasized 😉

reflect that it never merges the most recent run anymore. Putting
the new run on the stack before that function is called is a waste of
time now, and just complicated the function with now-silly code to pop
the run off at the start and push it back at the end. Now the caller
pushes it on the stack only after that function returns.
except in an assert. So get rid of it. Assert the run is adjacent
in its caller instead.
@tim-one tim-one merged commit 5cb4c67 into python:main Sep 6, 2021
@bedevere-bot
Copy link

@tim-one: Please replace # with GH- in the commit message next time. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
4 participants