Sorting!

Christopher Swenson

Confoo ca, Montréal 2017

What is this talk?

Sorting!

We'll talk about lots of algorithms, and including quick sort, merge sort, and tim sort.

Who is this talk for?

Curious people who love sorting.

Who am I?

Christopher Swenson, Ph.D.

Currently at Twilio, previously Google, US Government, Simple, Capital One.

I love sorting.

I wrote a moderately popular C sorting library.

Motivation

In 2010, I wanted to learn how tim sort worked

So I implemented a bunch of sorts, including tim sort

Sorting

One of the classic problems of computer science

Not as simple as we'd initially think

Sidenote: a lot of things aren't (e.g., strlen)

Bubble sort

Many have probably "invented" this:

Bubble the largest element to the top
Bubble the next largest, etc.


for i in range(len(data)):
  for j in range(len(data) - i):
    if data[j + 1] < data[j]:
      data[j], data[j+1] = data[j+1], data[j]

Don't. Ever. Use. Bubble. Sort.

Promise me, if you learn one thing from this talk, it’s that you should never use bubble sort.

But why?

We mostly compare sorting algorithms by

Number of compares
Number of swaps (less important)

Compares are bad: can involve expensive function calls (.__cmp__(), .compareTo())

Bad bad bubble sort

Uses more compares than any other (reasonable) algorithm

Exactly $\frac{N(N - 1)}{2}$ comparisons

$O(N^2)$ comparisons and memory movement

What are some others?

selection sort (127,176 µs)	heap sort (592 µs)
insertion sort (13,443 µs)	smooth sort
quick sort (579 µs)	sample sort
merge sort (903 µs)	bucket sort
Tim sort (1005 µs)	bogo sort

Timings for sorting 10,000 random 64-bit ints in C on my laptop.

Insertion sort

Assume elements $0, \dots, i-1$ are sorted

Example: insert $3$ into $[0, 4, 7, 9, 10]$:

[0, 4, 7, 9, 10, 3]

[0, 4, 7, 9, 3, 10]

[0, 4, 7, 3, 9, 10]

[0, 4, 3, 7, 9, 10]

[0, 3, 4, 7, 9, 10]

Still $O(N^2)$ worst case, but not as terrible on average

Okay for short lists

Detour: binary search

Previously, we wanted to put $3$ in $[0, 4, 7, 9, 10]$. Instead of right-to-left, let's:

Start in the middle: $7$
$3 < 7$ so throw away right half
Recurse

$O(\text{log}\ N)$ comparisons instead of $O(N)$, but still $O(N^2)$ memory movement.

This is called Binary Insertion Sort

[0, 4, 7, 9, 10, 3]

[0, 4, 3, 7, 9, 10]

[0, 3, 4, 7, 9, 10]

Quick sort

Divide and conquer

Pick element likely to be a median, called the pivot (common: pick end, pick the median of beginning, middle, end)
Separate into two lists: everything less than pivot, everything greater than pivot
Recursively quick sort each half

[9, 10, 4, 0, 7, 3]

[0], [3], [9, 10, 4, 7]

[0], [3], [4], [7], [9, 10]

[0], [3], [4], [7], [9], [10]

[0, 3, 4, 7, 9, 10]