Skip to content
July 10, 2011 / marrowboy

Sorted

In which I show you around the sorting algorithm used in Java 6

Respect due

Although it’s a pretty standard interview question, almost nobody writes their own sorting routines these days. Java has a variety of “sorted” things (Sets, Maps, etc), so that you don’t have to worry about it yourself.

If you can say which order any pair of things goes in (ie they are mutually-comparable, either by literally being Comparable, or if you have a Comparator for them), then you can do a comparison sort of a collection of things. This is great of course, because we’ve usually got better things to do than rewriting a sort. The algorithm which underpins all the sorting features in Java’s collections API is in java.util.Arrays. Let’s have a closer look.

What they say

This sort is guaranteed to be stable: equal elements will not be reordered as a result of the sort.

The sorting algorithm is a modified mergesort (in which the merge is omitted if the highest element in the low sublist is less than the lowest element in the high sublist). This algorithm offers guaranteed n*log(n) performance.

That’s from the JavaDoc. Most languages have “stable” sort implementations (*rolls eyes at JavaScript*), and the mergesort is a good choice, with n*log(n) performance it’s as fast as any comparison sort can be, and it has better worst-case performance than quicksort (used in previous versions of Java) and heapsort. It’s also pretty simple.

Mergesort

Mergesort works like this:

  • Split the input array in half
  • Sort the first half
  • Sort the second half
  • Merge the two sorted halves together by iterating through both halves simultaneously.
  • Sorted.

The two halves are sorted by recursion, i.e. by passing the half-arrays to mergesort itself. So, it’s mergesorts all the way down? Not quite. Once the array is small enough, an insertion sort is used instead. This is sensible – insertion sort is very fast on small arrays, it operates in-place and is stable but is O(n2) so we don’t use it on bigger arrays. In pseudocode:

mergesort( array ){
  if ( array.length < INSERTIONSORT_THRESHOLD ) then insertionsort(array)
  otherwise

  first  = mergesort( array.slice(0, length/2) )
  second = mergesort( array.slice(length/2, length) )

  // this is the optimisation referred to in the javadoc
  if ( first.last_element < second.first_element ) then
     concatenate(first, second) and return
  otherwise

  for ( i=0 to array.length )
    if ( first.first_element < second.first_element ) then
      remove first.first_element and put it into array[i]
    otherwise
      remove second.first_element and put it into array[i]
}

I hope you can see how the merge sort works. The implementation is about 35 lines long, it’s simple and it works well.

Let’s just look at what INSERTIONSORT_THRESHOLD is. In Java6, this is hard-coded to 7. This was presumably derived empirically, so lets try to validate it with different values. Using arrays of different lengths filled with random longs and different values for INSERTIONSORT_THRESHOLD (source), here’s the performance of sorting the arrays:

So, what to notice? 7 seems to be a pretty much optimal value, although anything under 20 looks about as good. The odd patterns in the lines for the larger values of T are caused by the dominance of the insertion-sort times in those cases. Insertion sort is actually pretty fast on nearly-sorted data, so the peaks and troughs are caused by differences in the random arrays that I generated, and appear different when the test is re-run.

Just one more thing…

Now we’ve looked at the merge-sort in Java 6, we’ve seen how it works, and played a little with it. The scene is set. In Sorted (Part II), I’ll look at how some of the features of the upcoming Java 7 release can be used to improve it.

UPDATE:

All my test data was originally just arrays of longs. A long-long comparison is very cheap, but this sort works with all Comparable objects, so perhaps the INSERTIONSORT_THRESHOLD was chosen to benefit objects which have an expensive compareTo() method? Here’s the same graph as before, but showing the number of pairwise comparisons needed rather than the time taken:

Again the variation in how nearly-sorted the input is explains the wobbliness of the higher lines. But the result is interesting, that far fewer comparisons are needed if we cut INSERTIONSORT_THRESHOLD down to 2 or even 1 (in which case it is mergesorts all the way down).

So, why 7?

I don’t know – really! It saves some stack depth, as we can avoid the last 3 levels of recursion, which perhaps does count for something – especially as the time saving isn’t much.

Aside: QuickSort is another algorithm which can be defined recursively, and like mergesort it is a common optimisation to treat small-enough arrays differently. With quicksort, it is possible to say if (array.length < SOME_THRESHOLD ) then return without sorting – then after the quicksort is all said and done we can run an insertion-sort over the whole array in one pass. This approach is not possible with merge-sort though. Why not?

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: