Make delicious recipes!

Huffman Encoding (Finding bit-encoding for each character)


Huffman encoding is a form of compression which assigns least number of bits for encoding the most frequent character.
Compression is achieved due to this because the overall text requires lesser number of bits to encode.

Problem: Given an array of characters sorted by frequency, find the Huffman encoding for the same.

Solution: Let us take an example.

If input array was [A, B, C, D] with frequencies as [10, 30, 50, 90]
Then the Huffman encoding for the same would be [A:110, B:111, C:10, D:0]

Note that while decoding something like 0011010, the above encoding presents no ambiguity.
When begun from the starting, there is only one valid decoding = 0 0 110 10 = DDAC
This is so because none of the characters is encoded in a way that any prefix of its encoding forms a valid encoding for some other character.

Of course, this requires the decoding to start from the beginning always.
If you start arbitrarily from any point in the encoded output, decoding becomes impossible.

Continuing to the problem of encoding, we can get an encoding for each character as follows:
  1. Since the given array is sorted, it can act like a priority queue.
  2. Create a second priority queue to hold intermediate nodes.
  3. Get two characters from first queue and:
    1. Assign '0' to the least-frequency char.
    2. Assign '1' to the other char.
    3. Create a TreeNode with these two nodes as children.
      The frequency of this TreeNode is the sum of frequencies of its children.
    4. Add this TreeNode to the second queue.
  4. Get two nodes from first or second queues, such that their frequencies are the lowest.
    Repeat 3a-3d for those.
  5. Repeat #4 till there is only TreeNode in either of the queues.

The tree construction would look like this:

At the end of tree construction, traverse the tree assigning '0' to left nodes and '1' to right nodes.
Whenever a leaf is encountered, the encoding is print along with the character.


Thus, for the above, the encoding would be:
D: 0
C: 10
A: 110
B: 111






Like us on Facebook to remain in touch
with the latest in technology and tutorials!


Got a thought to share or found a
bug in the code?
We'd love to hear from you:

Name:
Email: (Your email is not shared with anybody)
Comment:

Facebook comments:

Site Owner: Sachin Goyal