Introduction to Huffman Coding - A Greedy Algorithm
Huffman Coding is a widely used algorithm for lossless data compression. It assigns variable-length codes to input characters, with shorter codes assigned to more frequent characters. The goal is to minimize the total number of bits used to represent the input data.
Steps Involved
Frequency Calculation:
- Calculate the frequency of each character in the input data.
Priority Queue:
- Create a priority queue (min-heap) where each node is a character and its frequency.
Building the Huffman Tree:
- While there is more than one node in the queue:
- Extract the two nodes with the lowest frequency.
- Create a new internal node with these two nodes as children and a frequency equal to the sum of their frequencies.
- Insert the new node back into the priority queue.
- The remaining node in the queue is the root of the Huffman Tree.
Code Assignment:
- Traverse the Huffman Tree from the root to the leaves, assigning a binary code to each character. Assign '0' for a left edge and '1' for a right edge.
Time and Space Complexity
Time Complexity:
- Building the frequency table: O(n)O(n)O(n)
- Building the priority queue: O(nlogn)O(n \log n)O(nlogn)
- Building the Huffman Tree: O(nlogn)O(n \log n)O(nlogn)
- Total: O(nlogn)O(n \log n)O(nlogn)
Space Complexity:
- Storing the frequency table: O(n)O(n)O(n)
- Storing the Huffman Tree: O(n)O(n)O(n)
- Storing the codes: O(n)O(n)O(n)
- Total: O(n)O(n)O(n)
Limitations of Huffman Coding Algorithm
-
Not Adaptive:
Huffman Coding is not adaptive and requires a complete pass through the data to build the frequency table. -
Not Optimal for Small Data Sets:
For very small data sets, the overhead of storing the Huffman Tree may outweigh the compression benefits. -
Fixed-Length Encoding:
In cases where all characters are equally frequent, Huffman Coding does not provide any compression advantage over fixed-length encoding.
Detailed Explanation with Example
Let's consider the text "abc" with frequencies:
- a: 1
- b: 1
- c: 1
Building the Huffman Tree:
- Initial heap: [(1, 'a'), (1, 'b'), (1, 'c')]
- Pop a and b, create new node with frequency 2.
- Heap after first merge: [(1, 'c'), (2, None)]
- Pop c and new node, create root node with frequency 3.
- Final tree:
Assigning Codes:
- c: 0
- a: 10
- b: 11
Encoding:
- Text "abc" becomes "10110"
This is a simplified example, but it shows how the algorithm works in practice. The actual frequencies and codes will depend on the input text.
*
/ \
c *
/ \
a b