Problem: Using 1GB of memory only, find the missing numbers in 4 billion integers stored in a file.

Solution: One integer takes 4 bytes, so 4 billion integers would take:
4 bytes * (4 * 10^{9}) = 16 GB

Since we have only 1 GB of memory, we cannot get all the integers into memory at once.

Even if we could get all numbers into memory, we would need some data-structure like an array where
each integer from the 4 billion lot would be kept and then we will scan the array to find holes.

The array to store the 4 billion integers need not be of type int []

It can be of type boolean [] also since all we need is a true/false to know whether an int is present or not.

Memory required to store 4 billion booleans = 1 bit * (4 * 10^{9}) = 4Gb = 0.5 GB (small b means bit and capital B means bytes).

So this seems doable with 1 GB of memory.
We have a boolean array as boolean numberPresent[4*1000*1000*1000], load a few numbers from the
file in a loop and fill the boolean array.
Once all the integers have been processed, one pass through the array would give us all the missing numbers.

Only problem with this solution is that array indices can only be an integer and size of integer is:
2^{31} = 2147483648 (~2 billion) and
2^{32} = 4294967296 (~4 billion)

But 32^{nd} bit is for the sign bit.
And so any index beyond 2 billion would be treated as a negative index.
To solve this, we will need two boolean arrays each with a capacity of 2 billion integers.
One array will store the presence of first billion integers while the next array will store the remaining 2 billion integers' presence.

What if the memory was limited to 10 MB instead of 1 GB ?

10MB = 10^{7} booleans while space required is 4*10^{9} booleans. Around 400 times more than we have.

To solve this, we create buckets of integers, each bucket covering 10^{4} integers.

So first bucket will cover 0-10,000 integers, second bucket will cover 10,001-20,000 integers and so on.

What information do we store in each bucket?
Clearly it cannot be all the integers in each bucket because then it will be beyond 10MB.

We store count of numbers in each bucket.
If there are no missing numbers in a bucket, then the number of integers in that bucket would be 10,000.
So we make a second pass on the buckets after processing all 4 billion integers. If any bucket has less than 10,000
count, then it has missing numbers and the 4 billion integers can be scanned only in that range to find the missing numbers.

Order of execution would be 3*O(n) and memory required would be = 4 bytes * (4*10^{9}/10^{4}) = 1.6 MB

What is the minimum memory required to do the above?

We have seen that even 1.6 MB memory is sufficient to do the above processing.
Can we find the minimum memory required to achieve the same result?

Note that once a bucket is found to be missing numbers, the 4 billion integers need to be processed again.

In this pass, the missing bucket would need to store all the integers in its range as a boolean array.
So it would need k bits of memory = k/8 bytes of memory

Lets say bucket size is 'k' and number of buckets is 'n'

For minimum memory, size required by buckets array should be same as size required by one faulty bucket.
(Ideally, space required by numbers of all faulty buckets, but we do not how many buckets are faulty in advance.)

Additionally, n*k = 4*10^{9} also holds.

So, for minimum memory, k/8 = 4*n which gives 32*n^{2} = 4*10^{9}
This gives n = 2*10^{4} = 20k approx.
And so minimum memory required = 4 bytes * 20k = 80 KB

Got a thought to share or found a bug in the code? We'd love to hear from you: