return 42;

On randomized algorithms

wheel-of-fortune-Q-18_l.jpgI wrote this article consecutively to the previous one - The other side of the table: Simon Peyton-Jones. After I finished reading Simon's reply, I thought that it would be interesting to write a bit about randomized algorithms. This post should be straightforward. Some later parts require some familiarity with probability and discrete mathematics, but the rest should make sense even if you skip the more math-intensive parts.

Randomization has become a standard approach in algorithm design and it can be applied in a wide range of problems. Often it leads to more efficient, simpler and shorter solutions than their deterministic counterparts. As opposed to a deterministic algorithm, a randomized algorithm takes a source of random numbers and makes random choices during execution. Hence, the behavior of the algorithm can change even on a fixed input and is likely to be good on every input.

The first class of randomized algorithms are Las Vegas algorithms, which always give correct results, the only variation from one run to another being its running time. The running time is a random variable whose expectation is bounded, for example by a polynomial. The second class are Monte Carlo algorithms, whose running time is deterministic, but whose results are only correct with a probability of geq1div3.gif. A widely known example of a Monte Carlo algorithm is the Miller-Rabin primality test.

One of the most widely used randomized algorithms is the randomized Quicksort, which I will talk about. I picked it because it is very practical, easily comprehendible and the implementation is trivial. It is an example of a Las Vegas algorithm. Also, many of you are probably already familiar with either the normal (deterministic) or the randomized version, or both.

Quicksort
hoare.jpg
Quicksort is a divide and conquer sorting algorithm developed by Tony Hoare, who published the first version in the Communications of the ACM, Volume 4, Issue 7, in July 1961. (Oh, just in case any Slashdotters happen to read this: No, the fact he is a senior researcher at Microsoft Research does not make him an evil person.)

The algorithm works by first partitioning (dividing/rearranging) an input array containing the elements to be sorted into 2 sub-arrays around an element called the pivot x.gif, such that all the elements in the lower (left) sub-array are leq.gif the pivot x.gif leq.gif all the elements in the upper (right) sub-array. Following the division is the conquer step, which recursively sorts both sub-arrays. Eventually, there is no work required to combine the sub-arrays since they are sorted in place. A lot of people actually see the recursiveness of Quicksort as one of its weaknesses. I appreciate the compactness and elegance it introduces.

Anyhow, we all know that code says more than a thousand words, so here we go. The following code is one of the simpler implementations of Quicksort, without any optimizations, as it appeared in Steven Skiena's The Algorithm Design Manual.

void quicksort(int s[], int l, int h)
{
    int p; /* index of partition */
    if ((h - l) > 0) {
        p = partition(s, l, h);
        quicksort(s, l, p - 1);
        quicksort(s, p + 1, h);
    }
}

int partition(int s[], int l, int h)
{
    int i;
    int p; /* pivot element index */
    int firsthigh; /* divider position for pivot element */

    p = h;
    firsthigh = l;
    for (i = l; i < h; i++)
        if(s[i] < s[p]) {
            swap(&s[i], &s[firsthigh]);
            firsthigh++;
        }
    swap(&s[p], &s[firsthigh]);

    return(firsthigh);

}

void swap(int *a, int *b)
{
        int x;

        x = *a;
        *a = *b;
        *b = x;
}


I will skip an elaborate analysis of the deterministic Quicksort, as we want to emphasize on the randomized version. Most of you are likely to know that Quicksort has an expected running time of theta_nlogn.gif on an input array of n elements, which is remarkably efficient. If we are unlucky though, it has worst case running time of theta_nsquared.gif - which is worse than Heapsort and Mergesort. This behavior occurs when the partitioning routine produces one sub-array with n-1.gif elements and another one with 0 elements. That happens when the input array n.gif is already sorted, no matter if forward or backwards. In this situation, even insertion sort would run in the shorter time of o_n.gif. So what can we do about that? What is an easy way to improve the performance of Quicksort? There are actually two ways. One would be to randomize the input array. Another one, the one we will talk about now, is to pick the pivot randomly.

Randomized Quicksort

The above implementation automatically selects the last element in each sub-array as the pivot. We will now look at a technique called random sampling. So instead of always taking the last element in each sub-array, we will select a random element as the pivot during the partitioning. One possible implementation based on our previous code looks like this:

int partition(int s[], int l, int h)
{
    int i;
    int p; /* pivot element index */
    int firsthigh; /* divider position for pivot element */

    p = l + (random() % (h - l + 1);
    swap(&s[p], &s[h]);
    firsthigh = l;
    for (i = l; i < h; i++)
        if(s[i] < s[h]) {
            swap(&s[i], &s[firsthigh]);
            firsthigh++;
        }
    swap(&s[h], &s[firsthigh]);

    return(firsthigh);
}

Well, so what are the advantages of the randomized version?

  • The running time is independent of input ordering.
  • The algorithms does not makes any assumptions about the distribution of the elements in the input array. Already sorted arrays will work as well as random ones.
  • No specific input that can produce the worst case behavior, which is determined only by the random number generator.
As we can see, by introducing some randomness the worst case running time of theta_nsquared.gif becomes increasingly unlikely and leaves us with the property that the algorithm has a worst-case expected running time of theta_nlogn.gif. So how to we proof the correctness of this assumption? This is where some probabilistic analysis comes in handy. The following proof is taken from Manuel Blum's lecture notes from his class CMU 15-451 on algorithms.

Analysis of randomized Quicksort

First of all, let's assume that no two elements in the input array are equal. We will be writing will be
the quantity we care about (the total number of comparisons) as a sum of simpler random variables, and then just analyze the simpler ones. Define a random variable xij.gif to be 1 if the algorithm does compare the ith smallest and jth smallest elements during the sort and 0 if it does not. This is called an expectation variable. x_.gif denotes the total number of comparisons made by the algorithm. Since we never compare the same pair of elements twice, we have

proof_ltx_0.gif

and therefore

proof_ltx_1.gif

Let's consider one of these xij.gif for i_smaller_j.gif. Denote the ith smallest element by in the array by ei.gif and the jth smallest element by ej.gif , and conceptually imagine lining up the elements in sorted order. If the pivot we choose happens to be either ei.gif or ej.gif then we compare them. If our pivot is less than ei.gif or greater than ej.gif then both end up in the same bucket and we have to choose another pivot. So, we can think of this like a dart game: we throw a dart at random into the array: if we hit ei.gif or ej.gif then xij.gif becomes 1, if we hit between them xij.gif becomes 0, and otherwise we throw another dart. At each step, the probability that xij.gif equals 1 conditioned on the event that the game ends in that step is exactly 2divjminiplus1.gif. So now we can say that the overall probability that xij.gif equals 1 is 2divjminiplus1.gif.

In other words, for a given element i, it is compared to element iplus1.gif with probability 1, to element iplus2.gif with probability 2div3.gif, to element iplus3.gif with probability 2div4.gif, to element ..., yes, and so on. That gives us

proof_ltx_2.gif

The quantity proof_ltx_3.gif, denoted hn.gif, is called the nth harmonic number and is in the range proof_ltx_4.gif (this can be seen by considering the integral of proof_ltx_5.gif ). Therefore,

proof_ltx_6.gif

Summoning the evil

mcilroy.jpgIn 1999 Doug McIlroy from Dartmouth College published a paper called A Killer Adversary for Quicksort which describes a simple way to find the inputs that even force the randomized Quicksort into the worst case running time of theta_nsquared.gif. It also provides a simple C program illustrating the technique.

Quicksort can be made to go quadratic by constructing input on the fly in response to the sequence of items compared. The technique is illustrated by a specific adversary for the standard C qsort function. The general method works against any implementation of quicksort–even a randomizing one–that satisfies certain very mild and realistic assumptions.
Russ Cox wrote a post about this paper, too.

Conclusion and one more thing

This is just a very trivial example of randomized algorithms and many of you are likely to be already familiar with it. Though, if this post happens to attract some interest in this topic, I might write up some more posts about randomized algorithms. If you want to learn more about Quicksort and the analysis of both, the randomized and the deterministic version, there is a good video lecture available at MIT OpenCourseWare here.

I chose to post source code in C, as I think that it is one of the languages most of you know, that it is more practical than pseudocode and that at the same time it is rather easy to read. If you happen to not be happy with my decision, feel free to post an implementation in your favorite programming language as a comment.

Also, as this is my first rather deep post, if you happen to find an error in this post, please leave a comment and I will correct the mistake. Unfortunately, though, I won't be able to reward you with a hexadecimal dollar as Donald Knuth does.

Update: A few more links to resources related to this topic are posted over here. Also I will write a real follow-up on this topic when I find some time for it.

Labels: , ,

Published Wednesday, 24 June 2009 at 5:46 AM by Tobias Svensson

Comments

  •   Anonymous Anonymous wrote a comment at 24 June 2009 21:42

    There obviously exists input with worst-case behavior since the computer itself is deterministic. Usually you won't know what it is, but there is a chance that you'll get the worst case every time (as small as it may be).
  •   Blogger Gareth McCaughan wrote a comment at 25 June 2009 03:32

    "a probability of >= n/3"? I take it you meant something like "a probability that tends to 1 as n tends to infinity"...
  •   Blogger tobiassvn wrote a comment at 25 June 2009 03:52

    @Gareth Thanks for pointing me at that. It is actually supposed to be >=1/3. This error has now been fixed.

    The probability of the correctness of the answer is only depended of the choices the algorithm makes and repetition decreases the error probability exponentially. >=n/3 would imply that the input plays a part as well, which is not true.
  •   Anonymous Anonymous wrote a comment at 26 June 2009 22:28

    You're using a lomuto partition : they go quadratic just by feeding them an array with the same value in each cell. You can of course fix it by using more randomness : for elements equal to the pivot (s[h]), flip a coin to decide whether to place it in the top or bottom half.
  •   Blogger tobiassvn wrote a comment at 27 June 2009 03:48

    I've chosen the implementation I've used because it is much easier to read, as opposed to Hoare's partitioning algorithm for example, and my goal was to show the advantages of introducing randomness.

Post a Comment

Some HTML tags for basic formatting are working. Unfortunately Blogger won't allow 'pre' and 'code' tags. If you want to paste code, please do it plain. Comments are displayed in a fixed-width font.