If you want to draw m random samples out of a set of n elements, Knuth’s algorithm ‘S’ (ref. The Art of Computer Programming, Vol 2, 3.4.2) comes in handy. It starts out by attempting to select the first element with a probability of m/n; subsequent elements are selected with a probability that depends on whether a previous element was drawn or not.

Let’s do an example. To select 2 random samples out of a set of 5 (say, the digits ‘0’, ‘1’, ‘2’, ‘3’, and ‘4’), we select ‘0’ with a probability of 2/5. If ‘0’ is not chosen, we attempt to select ‘1’ with a probability of 2/4 (since there are only 4 candidates left); otherwise (if ‘0’ has been selected) with a probability of 1/4, since one attempt to draw has already been used up. It’s easy to show the corresponding element selection probabilities in a binary tree:

1 2 3 4 5 6 |
_____ 2/5 _____ '0' / \ __ 2/4 __ __ 1/4 __ '1' / \ / \ 2/3 1/3 0/3 1/3 '2' : : : : : |

In this tree, the nodes represent the likelihood that the element in the column on the right will be picked. Transitions from a node to the right mean that the element was picked, while transistions to the left denote that the corresponding element was not picked. In the ‘selected’ case, the numerator is decreased by one; the denominator is decreased in either case.

Here is an implementation of Knuth’s algorithm in C:

1 2 3 4 5 6 7 8 9 10 11 |
// genknuth -- print m random samples out of the set of numbers [0 .. n) (exclusive). void genknuth(int m, int n) { int i; for (i = 0; i < n; ++i) { if ((bigrand() % (n - i)) < m) { printf("%d ", i); // Selected, so decrement probability numerator. --m; } } } |

For the sake of demonstration, assume that bigrand() is a function that returns a random number that is “big enough”, meaning a value that is greater or equal to 0 and has an upper bound of at least n – 1 (a larger upper bound does not harm). Spend some time to understand how the ‘if’ statement implements the selection with the correct probablity.

So far, so good. But how would this clever algorithm look in another language? Let’s port it to Ruby!

Not being a Ruby programmer (but being an experienced Online Programmer), I had to ask Auntie Google many questions in order to cobble together this code:

1 2 3 4 5 6 7 8 9 10 |
# genknuth -- print m random samples out of the set of numbers [0 .. n) (exclusive). def genknuth(m, n) for i in 0 ... n if (bigrand() % (n - i)) < m print i, " " # Selected, so decrement probability numerator. --m end end end |

This Ruby code looks remarkably similar to the C version. It doesn’t contain any syntax errors, but nevertheless doesn’t output what it is supposed to. Can you see why?

For those folks who don’t want to waste their time on Ruby, here is the Python version, sporting the same behavior and thus the same blunder:

1 2 3 4 5 6 7 |
# genknuth -- print m random samples out of the set of numbers [0 .. n) (exclusive). def genknuth(m, n): for i in range(0, n): if (bigrand() % (n - i)) < m: print i, # Selected, so decrement probability numerator. --m |