Equal probability always
Puzzle: Given a stream of infinite numbers, write an algorithm such that
the probability of choosing a random number at any given time is 1/N, where
N is the count of numbers received so far.
Constraint: Since the stream of numbers is infinite, it is not possible to
store all the numbers either in memory or in secondary storage.
Example: If 1 million numbers have been received by 1 pm, then every number (chosen randomly) should have a probability of being chosen as 1/million
If at 2 pm, 10 million numbers have been received, then every number should have a probability of being chosen as 1/(10 million).
Solution: Since it is not possible to store all the numbers, we should think of storing just one number.
If we save just the last number always, then probability of choosing any other number is 0.
If we save just the first number (or number at N/2 position), then also the situation is no different.
Probability of choosing any number is 1/N if choose a random number between 0 and N-1 and
pick the stream number corresponding to that index. But that would require to store all the numbers.
Another try: We choose a random number between 0 and N and save the current
number (i.e. the Nth number) if that random number equals N.
With this algorithm, the probability of Nth number being choses in clearly 1/N.
Let us check the probability of being chosen for other numbers.
Probability that (N-1)th number was chosen is 1/(N-1).
For (N-1)th number to remain being-chosen when Nth number arrives, the
random number should not equal N and the probability for that is (N-1)/N.
Thus total probability for (N-1)th number to remain chosen when Nth number arrives is (1 / (N-1)) * ((N-1) / N) = 1/N
This can be extended to N-2, N-3 and so on for all the numbers to make sure that every
number has a probability of being chosen as 1/N
Extension to selection of 'k' numbers with k/N probability
If the requirement is to select 'k' numbers instead of just 1 number, then also
the approach remains mostly the same. Analysis of the probabilities changes a little bit though.
If 'k' numbers are to be selected, then required probability is k/N.
To do this, initially we select the first k numbers and put them in an array 'chosenSet'.
For every number that is received in the stream after these 'k' numbers, a
random number is generated between 0 and N.
If this random number is between 0 and k, then chosenSet[k] is replaced
with the current number (i.e. Nth number).
Analysis of probability: For Nth number, the probability of being placed into 'chosenSet' is clearly k/N.
For the numbers chosen in the first slot of 'k' numbers, the probability of remaining in the final selection
(i.e. when the Nth number arrives) is the probability of not being selected when (k+1)th number
arrived, and then not being selected when (k+2)th number arrived and so on.
This probability is:
(k / (k+1)) * ((k+1) / (k+2)) * ... ((N-1) / (N))
which is equal to (k / N)
So, with the above algorithm, 'k' elements can be selected from a stream of infinite numbers with a probability of k/N
Note: This algorithm for selection of k random elements from a set of N is also called Reservoir Sampling
It can be useful for situations like the following:
Given an iterator that can only move forwards, implement a function that will return 'k' random elements
|Email:||(Your email is not shared with anybody)|