Huffman Coding Algorithm Tutorial
Dr Arun Kumar
PhD (Computer Science)Table of Index
- Introduction to Huffman Coding - A Greedy Algorithm
- Steps Involved
- Frequency Calculation:
- Priority Queue:
- Building the Huffman Tree:
- Code Assignment:
- Time and Space Complexity
- Time Complexity:
- Space Complexity:
- Limitations of Huffman Coding Algorithm
- Not Adaptive:
- Not Optimal for Small Data Sets:
- Fixed-Length Encoding:
- Detailed Explanation with Example
- 1. Importing Libraries and Defining the Huffman Node Class
- 2. Building the Huffman Tree
- 3. Assigning Codes to Characters
- 4. Encoding the Text
- Example Usage
- What is the Huffman coding algorithm?
- What are the two types of Huffman coding?
- Is Huffman coding a Greedy algorithm?
- What is the disadvantage of Huffman algorithm?
- What is better than Huffman coding algorithm?
Step by Step Example
Frequently Asked Questions
Introduction to Huffman Coding - A Greedy Algorithm
Huffman Coding is a widely used algorithm for lossless data compression. It assigns variable-length codes to input characters, with shorter codes assigned to more frequent characters. The goal is to minimize the total number of bits used to represent the input data.
Steps Involved
Frequency Calculation:
- Calculate the frequency of each character in the input data.
Priority Queue:
- Create a priority queue (min-heap) where each node is a character and its frequency.
Building the Huffman Tree:
- While there is more than one node in the queue:
- Extract the two nodes with the lowest frequency.
- Create a new internal node with these two nodes as children and a frequency equal to the sum of their frequencies.
- Insert the new node back into the priority queue.
- The remaining node in the queue is the root of the Huffman Tree.
Code Assignment:
- Traverse the Huffman Tree from the root to the leaves, assigning a binary code to each character. Assign '0' for a left edge and '1' for a right edge.
Time and Space Complexity
Time Complexity:
- Building the frequency table: O(n)O(n)O(n)
- Building the priority queue: O(nlogn)O(n \log n)O(nlogn)
- Building the Huffman Tree: O(nlogn)O(n \log n)O(nlogn)
- Total: O(nlogn)O(n \log n)O(nlogn)
Space Complexity:
- Storing the frequency table: O(n)O(n)O(n)
- Storing the Huffman Tree: O(n)O(n)O(n)
- Storing the codes: O(n)O(n)O(n)
- Total: O(n)O(n)O(n)
Limitations of Huffman Coding Algorithm
-
Not Adaptive:
Huffman Coding is not adaptive and requires a complete pass through the data to build the frequency table. -
Not Optimal for Small Data Sets:
For very small data sets, the overhead of storing the Huffman Tree may outweigh the compression benefits. -
Fixed-Length Encoding:
In cases where all characters are equally frequent, Huffman Coding does not provide any compression advantage over fixed-length encoding.
Detailed Explanation with Example
Let's consider the text "abc" with frequencies:
- a: 1
- b: 1
- c: 1
Building the Huffman Tree:
- Initial heap: [(1, 'a'), (1, 'b'), (1, 'c')]
- Pop a and b, create new node with frequency 2.
- Heap after first merge: [(1, 'c'), (2, None)]
- Pop c and new node, create root node with frequency 3.
- Final tree:
Assigning Codes:
- c: 0
- a: 10
- b: 11
Encoding:
- Text "abc" becomes "10110"
This is a simplified example, but it shows how the algorithm works in practice. The actual frequencies and codes will depend on the input text.
*
/ \
c *
/ \
a b
Step By Step Example
1. Importing Libraries and Defining the Huffman Node Class
First, we import the necessary libraries and define the HuffmanNode class.
import heapq
from collections import defaultdict, Counter
class HuffmanNode:
def __init__(self, char, freq):
self.char = char
self.freq = freq
self.left = None
self.right = None
def __lt__(self, other):
return self.freq < other.freq
- heapq: This is used to implement a priority queue.
- Counter: This is used to count the frequency of characters in the text.
- HuffmanNode Class: This class represents a node in the Huffman Tree.
- __init__: Initializes a node with a character and its frequency. left and right are set to None initially.
- __lt__: Defines the less-than operator to compare nodes based on frequency. This is necessary for the priority queue to work correctly.
2. Building the Huffman Tree
Next, we define the function to build the Huffman Tree.
def build_huffman_tree(text):
frequency = Counter(text)
priority_queue = [HuffmanNode(char, freq) for char, freq in frequency.items()]
heapq.heapify(priority_queue)
while len(priority_queue) > 1:
left = heapq.heappop(priority_queue)
right = heapq.heappop(priority_queue)
merged = HuffmanNode(None, left.freq + right.freq)
merged.left = left
merged.right = right
heapq.heappush(priority_queue, merged)
return priority_queue[0]
- Frequency Calculation: Counter(text) creates a dictionary where the keys are characters and the values are their frequencies.
- Priority Queue Initialization: We create a list of HuffmanNode objects for each character-frequency pair and convert it into a heap using heapq.heapify().
- Building the Tree:
- We repeatedly pop the two nodes with the smallest frequencies from the heap.
- We create a new HuffmanNode with these two nodes as children and a frequency equal to their sum.
- We push the new node back into the heap.
- This process continues until there is only one node left in the heap, which is the root of the Huffman Tree.
3. Assigning Codes to Characters
Now, we define the function to assign Huffman codes to each character by traversing the tree.
def assign_codes(node, current_code, codes):
if node is None:
return
if node.char is not None:
codes[node.char] = current_code
assign_codes(node.left, current_code + '0', codes)
assign_codes(node.right, current_code + '1', codes)
- Base Case: If the current node is None, we return.
- Leaf Node: If the current node has a character (node.char is not None), we assign the current code to this character.
- Recursive Case: We recursively call the function for the left and right children, appending '0' for the left edge and '1' for the right edge.
4. Encoding the Text
Finally, we define the function to encode the input text using the Huffman Tree.
def huffman_encoding(text):
root = build_huffman_tree(text)
codes = {}
assign_codes(root, '', codes)
encoded_text = ''.join(codes[char] for char in text)
return encoded_text, codes
- Build the Huffman Tree: We call build_huffman_tree to get the root of the tree.
- Assign Codes: We call assign_codes to get the Huffman codes for each character.
- Encode the Text: We create the encoded text by replacing each character in the input text with its corresponding Huffman code.
- Return: We return the encoded text and the dictionary of codes.
Example Usage
Here's how you can use the above functions in a complete program:
if __name__ == "__main__":
text = "huffman coding example"
encoded_text, codes = huffman_encoding(text)
print("Encoded Text:", encoded_text)
print("Huffman Codes:", codes)
- Input Text: "huffman coding example"
- Output:
- Encoded Text: The binary representation of the input string.
- Huffman Codes: A dictionary with characters as keys and their corresponding Huffman codes as values.
Related Questions
What is the Huffman coding algorithm?
Huffman coding is a technique used for data compression. It assigns variable-length codes to input characters, with shorter codes assigned to more frequent characters, to reduce the overall size of the data.
What are the two types of Huffman coding?
There are two main types of Huffman coding: Static Huffman Coding: In this type, the code assignments are fixed and do not change over time. Dynamic Huffman Coding (also known as Adaptive Huffman Coding): This type allows the code assignments to adapt based on the input data's frequency distribution, resulting in potentially more efficient encoding for varying data.
Is Huffman coding a Greedy algorithm?
Yes, Huffman coding is a Greedy algorithm. It makes locally optimal choices at each step with the hope of finding a global optimum solution.
What is the disadvantage of Huffman algorithm?
One disadvantage of Huffman coding is that it requires the sender and receiver to have the same static Huffman tree for encoding and decoding. If the trees differ, the decoding will produce incorrect results.
What is better than Huffman coding algorithm?
There isn't a single algorithm that is universally better than Huffman coding in all scenarios. Different data characteristics and requirements may lead to different compression algorithms being more suitable. Some alternatives to Huffman coding include Arithmetic coding, Lempel-Ziv algorithms (like LZ77 and LZ78), and Run-Length Encoding (RLE). Each of these has its strengths and weaknesses depending on the specific use case.
Related Post
Prim's Algorithm: Understanding Minimum Spanning Trees
Prim's Algorithm is a greedy algorithm used to find the Minimum Spanning Tree (MST) of a weighted, undirected graph.
A step by step approach to learn Greedy Algorithm - Data Structure and Algorithms
A greedy algorithm is an approach for solving problems by making a sequence of choices, each of which looks best at the moment.
Brute Force Technique: Understanding and Implementing in JavaScript
Brute Force Technique: Understanding and Implementing in JavaScript