## An Empirical Analysis of QuickSelect and QuickMedian

The QuickSelect algorithm is an efficient way to find the N-th largest element in a list. The QuickMedian algorithm uses QuickSelect to efficiently find the middle element(s) of the array.

This blog post sketches how QuickSelect and QuickMedian work in general and in my Java implementation. The bulk of the post discusses the plot of QuickMedian vs. SortMedian’s runtime shown below (click the image for more details). ## QuickSelect Efficiency

QuickSelect uses the partitioning step of QuickSort to divide an array into two sections around a pivot.  All elements smaller than the pivot move before the pivot and all elements larger than the pivot move after the pivot.  A new pivot is chosen on the side of the partition that will contain the N-th element, and the partitioning step is run again.  It turns out that the number of elements visited on successive iterations of partitioning decreases so quickly in the average case that the overall runtime is expected $O(n)$, although worst case is $O(n^2)$.

## QuickMedian Implementation

QuickMedian was originally intended to be a demonstration of the QuickSelect algorithm.  It turns out to be surprisingly tricky to write a robust median function!  Much of the trouble comes from handling even-length arrays in the case of large or infinite values.  When an array is of even length and the middle two values have large magnitudes, we need an averaging formula that avoids overflow.  When the middle two elements are infinity, in some cases we must return `NaN`.

Presenting arrays containing `NaN` to QuickSelect produces undefined behavior, which seems like the correct behavior.  What seems like incorrect behavior is present in Numpy, where `NaN` sorts before other values.  Here, the median value of a list can be decreased by inserting things which are not even numbers! Incorrect behavior may also result from a median calculation that uses Python’s native list `sort()` method.  Here, `NaN` is not ordered with respect to other numbers, so the median of an array can change from call to call!

The final problem faced by QuickMedian was one of inefficiency. Even-length arrays require two calls to QuickSelect to find the middle two elements. This is inefficient if implemented naively, because the two calls operate independently over the full array.  It should not be the case that runtimes vary greatly between arrays of length 1,000,000 and 1,000,001. To improve performance, QuickSelect was refactored to return the array indexes defining the smallest partition found that contains the N-th element. This partition is usually very small, and so the second QuickSelect call over this sub-array has negligible cost.

## QuickMedian versus SortMedian

I wanted to examine the performance of QuickMedian, prove that it worked better than the sorting median method, and analyze the effects of different pivot picking methods.  The wall clock time was accumulated over the course of several randomized trials, divided and sorted carefully, then plotted.

### Method

The evaluation consisted of 100 repetitions of trials in which several conditions were exhaustively varied.  In each trial, the median methods processed a total of 1e6 elements, which were divided up into arrays of constant length.  This number of elements was used because processing just one array of length 1000 was too quick to measure using the wall clock system time on my machine.  The conditions which were varied across trials were:

• Length of arrays into which the 1e6 elements were divided
• Even or odd-length arrays
• Sorting of the arrays: random, sorted, reverse-sorted, mostly-sorted, mostly-reverse-sorted
• QuickSelect pivot-picking method: middle, random, median-of-three, median-of-randomized-three
• Whether the arrays contained duplicates
Array element values were consecutive integers; each integer was repeated under the “duplicate” condition.  The mostly-sorted arrays were sorted but for the very middle 10%.  Trials were completely randomized in a paranoid effort to eliminate JVM runtime optimizations.

### Analysis

The runtimes were divided into series and sorted in such a way as to show interesting systemic effects.

Sorting by array length addressed the main question of computational complexity and runtime speed.  The QuickMedian runtimes remain roughly constant, while the SortMedian runtimes increase.  Remember that the $N=1e6$ elements in each trial are divided into $k$ arrays each of length $n$ such that $n = N/k$.  QuickMedian has an expected runtime of $k * O(n)$, which is $O(N)$ and thus expected to be constant across trials.  SortMedian has an expected runtime of $k * O(n \log n)$ which is expected $O(N \log n)$, which is expected to increase as the array length increases!  The array lengths grow exponentially (something like 1e2, 1e3, 1e4, and 1e5), so the runtimes increase linearly.  This, and the fact that QuickMedian’s runtimes lie well-below most of the SortMedian runtimes)  is perhaps the central finding of the evaluation.

Then, splitting by the most interesting intra-algorithm condition showed further systemic effects.  The runtimes from the QuickMedian algorithm were divided into series based upon the pivot picking method used, while the SortMedian runtimes were divided by how the array was initially sorted.

The SortMedian results show that my platform’s Java sorting algorithm works fastest on reverse-sorted arrays.  The ripples in these series are interestingly due to the presence or absence of duplicates.

The QuickMedian results confirm that the fastest runtime is given by the deterministic pivot picking methods on sorted arrays, where the middle element is almost immediately found.  I was surprised to see that the randomized median-of-three pivot picking method never outperforms the deterministic median-of-three pivot picking method.  However, it makes sense because median-of-three killer sequences are probably unlikely.  I haven’t looked into this behavior yet, but I am assuming that the randomized median-of-three pivot picking method is more resistant to naturally-occurring killer sequences, so it will remain the default pivot picking method in QuickMedian.

## Conclusion

QuickMedian is clearly the fastest way to find the median, and should be used particularly when the length of input arrays is at least 100 or greater.  If throughput performance is important, you are operating on safe and mostly-sorted data, and the possibility of a worst-case O(n^2) runtime won’t dissuade you: consider switching to the non-randomized median-of-three pivot picking method when constructing the QuickMedian object.

Tags: ,