# Sorted

In which I show you around the sorting algorithm used in Java 6

## Respect due

Although it’s a pretty standard interview question, almost nobody writes their own sorting routines these days. Java has a variety of “sorted” things (Sets, Maps, etc), so that you don’t have to worry about it yourself.

If you can say which order any pair of *things* goes in (ie they are mutually-comparable, either by literally being Comparable, or if you have a Comparator for them), then you can do a comparison sort of a collection of *things*. This is great of course, because we’ve usually got better things to do than rewriting a sort. The algorithm which underpins all the sorting features in Java’s collections API is in java.util.Arrays. Let’s have a closer look.

## What they say

This sort is guaranteed to be stable: equal elements will not be reordered as a result of the sort.

The sorting algorithm is a modified mergesort (in which the merge is omitted if the highest element in the low sublist is less than the lowest element in the high sublist). This algorithm offers guaranteed n*log(n) performance.

That’s from the JavaDoc. Most languages have “stable” sort implementations (*rolls eyes at JavaScript*), and the mergesort is a good choice, with `n*log(n)` performance it’s as fast as any comparison sort can be, and it has better worst-case performance than quicksort (used in previous versions of Java) and heapsort. It’s also pretty simple.

## Mergesort

Mergesort works like this:

- Split the input array in half
- Sort the first half
- Sort the second half
*Merge*the two sorted halves together by iterating through both halves simultaneously.- Sorted.

The two halves are sorted by recursion, i.e. by passing the half-arrays to mergesort itself. So, it’s mergesorts all the way down? Not quite. Once the array is small enough, an insertion sort is used instead. This is sensible – insertion sort is very fast on small arrays, it operates in-place and is stable but is *O(n ^{2})* so we don’t use it on bigger arrays. In pseudocode:

mergesort( array ){ if ( array.length < INSERTIONSORT_THRESHOLD ) then insertionsort(array) otherwise first = mergesort( array.slice(0, length/2) ) second = mergesort( array.slice(length/2, length) ) // this is the optimisation referred to in the javadoc if ( first.last_element < second.first_element ) then concatenate(first, second) and return otherwise for ( i=0 to array.length ) if ( first.first_element < second.first_element ) then remove first.first_element and put it into array[i] otherwise remove second.first_element and put it into array[i] }

I hope you can see how the merge sort works. The implementation is about 35 lines long, it’s simple and it works well.

Let’s just look at what `INSERTIONSORT_THRESHOLD`

is. In Java6, this is hard-coded to *7*. This was presumably derived empirically, so lets try to validate it with different values. Using arrays of different lengths filled with random `longs`

and different values for `INSERTIONSORT_THRESHOLD`

(source), here’s the performance of sorting the arrays:

So, what to notice? 7 seems to be a pretty much optimal value, although anything under 20 looks about as good. The odd patterns in the lines for the larger values of T are caused by the dominance of the insertion-sort times in those cases. Insertion sort is actually pretty fast on nearly-sorted data, so the peaks and troughs are caused by differences in the random arrays that I generated, and appear different when the test is re-run.

## Just one more thing…

Now we’ve looked at the merge-sort in Java 6, we’ve seen how it works, and played a little with it. The scene is set. In Sorted (Part II), I’ll look at how some of the features of the upcoming Java 7 release can be used to improve it.

## UPDATE:

All my test data was originally just arrays of `long`

s. A long-long comparison is very cheap, but this sort works with all Comparable objects, so perhaps the `INSERTIONSORT_THRESHOLD`

was chosen to benefit objects which have an expensive `compareTo()`

method? Here’s the same graph as before, but showing the *number of pairwise comparisons* needed rather than the time taken:

Again the variation in how nearly-sorted the input is explains the wobbliness of the higher lines. But the result is interesting, that far fewer comparisons are needed if we cut `INSERTIONSORT_THRESHOLD`

down to 2 or even 1 (in which case it *is* mergesorts all the way down).

So, why 7?

I don’t know – really! It saves some stack depth, as we can avoid the last 3 levels of recursion, which perhaps does count for something – especially as the time saving isn’t much.

**Aside**: QuickSort is another algorithm which can be defined recursively, and like mergesort it is a common optimisation to treat small-enough arrays differently. With quicksort, it is possible to say `if (array.length < SOME_THRESHOLD ) then return `

– then after the quicksort is all said and done we can run an insertion-sort over the whole array in one pass. This approach is not possible with merge-sort though. Why not?*without* sorting

## Leave a Reply