Sorting!

Christopher Swenson

Confoo ca, Montréal 2017

What is this talk?

Sorting!

We'll talk about lots of algorithms, and including quick sort, merge sort, and tim sort.


Who is this talk for?

Curious people who love sorting.

Who am I?

Christopher Swenson, Ph.D.

Currently at Twilio, previously Google, US Government, Simple, Capital One.

I love sorting.

I wrote a moderately popular C sorting library.

Motivation

In 2010, I wanted to learn how tim sort worked

So I implemented a bunch of sorts, including tim sort

Sorting

One of the classic problems of computer science

Not as simple as we'd initially think

Sidenote: a lot of things aren't (e.g., strlen)

Bubble sort

Many have probably "invented" this:

  • Bubble the largest element to the top
  • Bubble the next largest, etc.

for i in range(len(data)):
  for j in range(len(data) - i):
    if data[j + 1] < data[j]:
      data[j], data[j+1] = data[j+1], data[j]    

Don't. Ever. Use. Bubble. Sort.

Promise me, if you learn one thing from this talk, it’s that you should never use bubble sort.

But why?

We mostly compare sorting algorithms by

  • Number of compares
  • Number of swaps (less important)

Compares are bad: can involve expensive function calls (.__cmp__(), .compareTo())

Bad bad bubble sort

Uses more compares than any other (reasonable) algorithm

Exactly $\frac{N(N - 1)}{2}$ comparisons

$O(N^2)$ comparisons and memory movement

What are some others?

selection sort (127,176 µs)heap sort (592 µs)
insertion sort (13,443 µs)smooth sort
quick sort (579 µs)sample sort
merge sort (903 µs)bucket sort
Tim sort (1005 µs)bogo sort

Timings for sorting 10,000 random 64-bit ints in C on my laptop.

Insertion sort

Assume elements $0, \dots, i-1$ are sorted

Example: insert $3$ into $[0, 4, 7, 9, 10]$:

[0, 4, 7, 9, 10, 3]
[0, 4, 7, 9, 3, 10]
[0, 4, 7, 3, 9, 10]
[0, 4, 3, 7, 9, 10]
[0, 3, 4, 7, 9, 10]

Still $O(N^2)$ worst case, but not as terrible on average

Okay for short lists

Detour: binary search

Previously, we wanted to put $3$ in $[0, 4, 7, 9, 10]$. Instead of right-to-left, let's:

  • Start in the middle: $7$
  • $3 < 7$ so throw away right half
  • Recurse

$O(\text{log}\ N)$ comparisons instead of $O(N)$, but still $O(N^2)$ memory movement.

This is called Binary Insertion Sort

[0, 4, 7, 9, 10, 3]
[0, 4, 3, 7, 9, 10]
[0, 3, 4, 7, 9, 10]

Quick sort

Divide and conquer

  • Pick element likely to be a median, called the pivot (common: pick end, pick the median of beginning, middle, end)
  • Separate into two lists: everything less than pivot, everything greater than pivot
  • Recursively quick sort each half

[9, 10, 4, 0, 7, 3]

[0], [3], [9, 10, 4, 7]

[0], [3], [4], [7], [9, 10]

[0], [3], [4], [7], [9], [10]

[0, 3, 4, 7, 9, 10]

Quick sort is good

For many, quick sort is among the best.

Often the default or first thing people think of.

For special cases, it can perform poorly.

Merge sort

Recurse:

  • Merge sort left half
  • Merge sort right half
  • Merge

[9, 10, 4, 0, 7, 3]

# split

[9, 10, 4], [0, 7, 3]

# split

[9, 10], [4], [0, 7, 3]

# split

[9], [10], [4], [0, 7, 3]

# trivial

[9], [10], [4], [0, 7, 3]

# merge

[9, 10], [4], [0, 7, 3]

# merge

[4, 9, 10], [0, 7, 3]

# split

[4, 9, 10], [0, 7], [3]

# split

[4, 9, 10], [0], [7], [3]

# trivial

[4, 9, 10], [0], [7], [3]

# merge

[4, 9, 10], [0, 7], [3]

# merge

[4, 9, 10], [0, 3, 7]

# merge

[0, 3, 4, 7, 9, 10]

# done

Merge sort is good

Though simple, merge sort is world-class

  • $O(N\,\text{log}\ N)$ compares
  • $O(N\,\text{log}\ N)$ memory moves
  • $O(\text{log}\ N)$ or $O(N)$ storage

Simple, fast, but rigid

Tim sort

  • Smart balancing merge sort
  • Take advantage of natural sortedness
  • But deal well with randomness
  • Above all: minimize comparisons
  • Iterate through array
  • Build a stack of run sizes.
  • (Run: increasing sequence)
    (If decreasing, reverse it in-place)
  • Runs must have minimum size 32–64
  • If there is not a natural run, make one
  • Use binary insertion sort at low levels

Key: be careful about creating unbalanced run sizes

Ensure that the stack of run sizes $[\dots, A, B, C]$ maintains:

  1. $A > B + C$
  2. $B > C$

Do left- and right-merges to maintain this

$A > B + C:$

  • We eventually will merge $A$ with $B, C, \dots$
  • Want those merges balanced:
    • Reduce memory usage
    • Reduce copying
  • Poor balancing = insertion sort

If we don’t maintain this:

  • Then $B, C, \dots$, could start to grow
  • Later left with a tiny $A$ and a large $B$.

So, that was slightly a lie

Turns out, that was wrong

After 13 years, found out the invariants are wrong

With stack of run sizes: $[\dots, A, B, C, D]$, need:

  1. $B > C + D$
  2. $A > B + C$
  3. $C > D$

Tim sort galloping

  • Like doing binary search during merge, but sideways
  • Check position $1, 2, 4, 8, \dots$
  • Good when there are a lot of duplicates
  • Can make performance worse with small-to-medium # dupes
  • I'm not as sold on this

Tim sort example

Partially sorted numbers

Tim sort example:

Random numbers

When to use what?

  • Slow comparison: Tim sort
  • (Python, Ruby, Java)
  • Fast comparison: quick sort, merge sort, heap sort
  • (C/C++ primitive types)
  • Huge data: merge sort variants
  • Limited domain: bucket/radix sort