Equal probability alwaysPuzzle: Given a stream of infinite numbers, write an algorithm such that the probability of choosing a random number at any given time is 1/N, where N is the count of numbers received so far. Constraint: Since the stream of numbers is infinite, it is not possible to store all the numbers either in memory or in secondary storage. Example: If 1 million numbers have been received by 1 pm, then every number (chosen randomly) should have a probability of being chosen as 1/million If at 2 pm, 10 million numbers have been received, then every number should have a probability of being chosen as 1/(10 million). Solution: Since it is not possible to store all the numbers, we should think of storing just one number. If we save just the last number always, then probability of choosing any other number is 0. If we save just the first number (or number at N/2 position), then also the situation is no different. Probability of choosing any number is 1/N if choose a random number between 0 and N-1 and pick the stream number corresponding to that index. But that would require to store all the numbers. Another try: We choose a random number between 0 and N and save the current number (i.e. the N^{th} number) if that random number equals N. With this algorithm, the probability of N^{th} number being choses in clearly 1/N. Let us check the probability of being chosen for other numbers. Probability that (N-1)^{th} number was chosen is 1/(N-1). For (N-1)^{th} number to remain being-chosen when N^{th} number arrives, the random number should not equal N and the probability for that is (N-1)/N. Thus total probability for (N-1)^{th} number to remain chosen when N^{th} number arrives is (1 / (N-1)) * ((N-1) / N) = 1/N This can be extended to N-2, N-3 and so on for all the numbers to make sure that every number has a probability of being chosen as 1/N Extension to selection of 'k' numbers with k/N probabilityIf the requirement is to select 'k' numbers instead of just 1 number, then also the approach remains mostly the same. Analysis of the probabilities changes a little bit though. If 'k' numbers are to be selected, then required probability is k/N. To do this, initially we select the first k numbers and put them in an array 'chosenSet'. For every number that is received in the stream after these 'k' numbers, a random number is generated between 0 and N. If this random number is between 0 and k, then chosenSet[k] is replaced with the current number (i.e. N^{th} number). Analysis of probability: For N^{th} number, the probability of being placed into 'chosenSet' is clearly k/N. For the numbers chosen in the first slot of 'k' numbers, the probability of remaining in the final selection (i.e. when the N^{th} number arrives) is the probability of not being selected when (k+1)^{th} number arrived, and then not being selected when (k+2)^{th} number arrived and so on. This probability is: (k / (k+1)) * ((k+1) / (k+2)) * ... ((N-1) / (N)) which is equal to (k / N) So, with the above algorithm, 'k' elements can be selected from a stream of infinite numbers with a probability of k/N Note: This algorithm for selection of k random elements from a set of N is also called Reservoir Sampling It can be useful for situations like the following: Given an iterator that can only move forwards, implement a function that will return 'k' random elements |
Got a thought to share or found a
bug in the code?
We'd love to hear from you:
Name: | |
Email: | (Your email is not shared with anybody) |
Comment: |
Facebook comments: