Christopher Swenson
Confoo ca, Montréal 2017
Sorting!
We'll talk about lots of algorithms, and including quick sort, merge sort, and tim sort.
Curious people who love sorting.
Christopher Swenson, Ph.D.
Currently at Twilio, previously Google, US Government, Simple, Capital One.
I love sorting.
I wrote a moderately popular C sorting library.
In 2010, I wanted to learn how tim sort worked
So I implemented a bunch of sorts, including tim sort
One of
Not as simple as we'd initially think
Sidenote: a lot of things aren't (e.g., strlen
)
Many have probably "invented" this:
for i in range(len(data)):
for j in range(len(data) - i):
if data[j + 1] < data[j]:
data[j], data[j+1] = data[j+1], data[j]
Promise me, if you learn one thing from this talk, it’s that you should never use bubble sort.
We mostly compare sorting algorithms by
Compares are bad: can involve expensive function calls (.__cmp__()
, .compareTo()
)
Uses more compares than any other (reasonable) algorithm
Exactly $\frac{N(N - 1)}{2}$ comparisons
$O(N^2)$ comparisons and memory movement
selection sort (127,176 µs) | heap sort (592 µs) |
insertion sort (13,443 µs) | smooth sort |
quick sort (579 µs) | sample sort |
merge sort (903 µs) | bucket sort |
Tim sort (1005 µs) | bogo sort |
Timings for sorting 10,000 random 64-bit ints in C on my laptop.
Assume elements $0, \dots, i-1$ are sorted
Example: insert $3$ into $[0, 4, 7, 9, 10]$:
Still $O(N^2)$ worst case, but not as terrible on average
Okay for short lists
Previously, we wanted to put $3$ in $[0, 4, 7, 9, 10]$. Instead of right-to-left, let's:
$O(\text{log}\ N)$ comparisons instead of $O(N)$, but still $O(N^2)$ memory movement.
This is called Binary Insertion Sort
Divide and conquer
[9, 10, 4, 0, 7, 3]
[0], [3], [9, 10, 4, 7]
[0], [3], [4], [7], [9, 10]
[0], [3], [4], [7], [9], [10]
[0, 3, 4, 7, 9, 10]
For many, quick sort is among the best.
Often the default or first thing people think of.
For special cases, it can perform poorly.
Recurse:
[9, 10, 4, 0, 7, 3]
# split
[9, 10, 4], [0, 7, 3]
# split
[9, 10], [4], [0, 7, 3]
# split
[9], [10], [4], [0, 7, 3]
# trivial
[9], [10], [4], [0, 7, 3]
# merge
[9, 10], [4], [0, 7, 3]
# merge
[4, 9, 10], [0, 7, 3]
# split
[4, 9, 10], [0, 7], [3]
# split
[4, 9, 10], [0], [7], [3]
# trivial
[4, 9, 10], [0], [7], [3]
# merge
[4, 9, 10], [0, 7], [3]
# merge
[4, 9, 10], [0, 3, 7]
# merge
[0, 3, 4, 7, 9, 10]
# done
Though simple, merge sort is world-class
Simple, fast, but rigid
Key: be careful about creating unbalanced run sizes
Ensure that the stack of run sizes $[\dots, A, B, C]$ maintains:
Do left- and right-merges to maintain this
$A > B + C:$
If we don’t maintain this:
Turns out, that was wrong
After 13 years, found out the invariants are wrong
With stack of run sizes: $[\dots, A, B, C, D]$, need: