Huffman Coding (also known as Huffman Encoding) is an algorithm for doing data compression and it forms the basic idea behind file compression. This post talks about fixed length and variable length encoding, uniquely decodable codes, prefix rules and construction of Huffman Tree.





We already know that every character is stored in sequences of 0’s and 1’s using 8 bits. This is called fixed-length encoding as each character uses the same number of fixed bits storage.

Given a text, how to reduce the amount of space required to store a character?

The idea is to use variable-length encoding . We can exploit the fact that some characters occurs more frequently than others in a text (refer this) to design an algorithm which can represent the same piece of text using lesser number of bits. In variable-length encoding, we assign variable number of bits to characters depending upon their frequency in the given text. So, some character might end up taking 1 bit, some might end up taking two bits, some will be encoded using three bits, and so on. The problem with variable length encoding lies in its decoding.

Given a sequence of bits, how to decode it uniquely?

Lets consider the string “aabacdab”. It has 8 characters in it and uses 64 bits storage (using fixed-length encoding). If we note, the frequency of characters ‘a’, ‘b’, ‘c’ and ‘d’ are 4, 2, 1, 1 respectively. Lets try to represent “aabacdab” using lesser number of bits by using the fact that ‘a’ occurs more frequently than ‘b’ and ‘b’ occurs more frequently than ‘c’ and ‘d’. We start by randomly assigning single bit code 0 to ‘a’, 2-bit code 11 to ‘b’ and 3-bit code 100 and 011 to characters ‘c’ and ‘d’ respectively.

a 0

b 11

c 100

d 011





So the string aabacdab will be encoded to 00110100011011 (0|0|11|0|100|011|0|11) using above codes. But the real problem lies in decoding. If we try to decode the string 00110100011011 , it will lead to ambiguity as it can be decoded to,

0|011|0|100|011|0|11 adacdab

0|0|11|0|100|0|11|011 aabacabd

0|011|0|100|0|11|0|11 adacabab

..

and so on





To prevent ambiguities in decoding, we will ensure that our encoding satisfies what’s called the prefix rule which will result into uniquely decodable codes . The prefix rule states that no code is a prefix of another code. By code, we mean the bits used for a particular character. In above example, 0 is prefix of 011 which violates the prefix rule. So if our codes satisfies the prefix rule, the decoding will be unambiguous (and vice versa).

Lets consider above example again. This time we assign codes that satisfies the prefix rule to characters ‘a’, ‘b’, ‘c’ and ‘d’.

a 0

b 10

c 110

d 111





Using above codes, the string aabacdab will be encoded to 00100100011010 (0|0|10|0|100|011|0|10) . Now we can uniquely decode 00100100011010 back to our original string aabacdab .

Huffman Coding –

Now that we are clear on variable length encoding and prefix rule, let’s talk about Huffman coding.

The technique works by creating a binary tree of nodes. A node can be either a leaf node or an internal node. Initially, all nodes are leaf nodes, which contain the character itself, the weight (frequency of appearance) of the character. Internal nodes contain character weight and links to two child nodes. As a common convention, bit ‘0’ represents following the left child and bit ‘1’ represents following the right child. A finished tree has n leaf nodes and n-1 internal nodes. It is recommended that Huffman tree should discard unused characters in the text to produce the most optimal code lengths.



We will use priority queue for building Huffman tree where the node with lowest frequency is given highest priority. Below are the complete steps –



1. Create a leaf node for each character and add them to the priority queue.

2. While there is more than one node in the queue: Remove the two nodes of highest priority (lowest frequency) from the queue



Create a new internal node with these two nodes as children and with frequency equal to the sum of the two nodes’ frequencies.



Add the new node to the priority queue.

3. The remaining node is the root node and the tree is complete.





Consider some text consisting of only 'A', 'B', 'C', 'D' and 'E' character and their frequencies are 15, 7, 6, 6, 5 respectively. Below figures illustrate the steps followed by the algorithm –



The path from root to any leaf node stores the optimal prefix code (also called Huffman code) corresponding to character associated with that leaf node.



Below is C++ and Java implementation of Huffman coding compression algorithm:

C++ #include <iostream> #include <string> #include <queue> #include <unordered_map> using namespace std; // A Tree node struct Node { char ch; int freq; Node *left, *right; }; // Function to allocate a new tree node Node* getNode(char ch, int freq, Node* left, Node* right) { Node* node = new Node(); node->ch = ch; node->freq = freq; node->left = left; node->right = right; return node; } // Comparison object to be used to order the heap struct comp { bool operator()(Node* l, Node* r) { // highest priority item has lowest frequency return l->freq > r->freq; } }; // traverse the Huffman Tree and store Huffman Codes // in a map. void encode(Node* root, string str, unordered_map<char, string> &huffmanCode) { if (root == nullptr) return; // found a leaf node if (!root->left && !root->right) { huffmanCode[root->ch] = str; } encode(root->left, str + "0", huffmanCode); encode(root->right, str + "1", huffmanCode); } // traverse the Huffman Tree and decode the encoded string void decode(Node* root, int &index, string str) { if (root == nullptr) { return; } // found a leaf node if (!root->left && !root->right) { cout << root->ch; return; } index++; if (str[index] =='0') decode(root->left, index, str); else decode(root->right, index, str); } // Builds Huffman Tree and decode given input text void buildHuffmanTree(string text) { // count frequency of appearance of each character // and store it in a map unordered_map<char, int> freq; for (char ch: text) { freq[ch]++; } // Create a priority queue to store live nodes of // Huffman tree; priority_queue<Node*, vector<Node*>, comp> pq; // Create a leaf node for each character and add it // to the priority queue. for (auto pair: freq) { pq.push(getNode(pair.first, pair.second, nullptr, nullptr)); } // do till there is more than one node in the queue while (pq.size() != 1) { // Remove the two nodes of highest priority // (lowest frequency) from the queue Node *left = pq.top(); pq.pop(); Node *right = pq.top(); pq.pop(); // Create a new internal node with these two nodes // as children and with frequency equal to the sum // of the two nodes' frequencies. Add the new node // to the priority queue. int sum = left->freq + right->freq; pq.push(getNode('\0', sum, left, right)); } // root stores pointer to root of Huffman Tree Node* root = pq.top(); // traverse the Huffman Tree and store Huffman Codes // in a map. Also prints them unordered_map<char, string> huffmanCode; encode(root, "", huffmanCode); cout << "Huffman Codes are :

" << '

'; for (auto pair: huffmanCode) { cout << pair.first << " " << pair.second << '

'; } cout << "

Original string was :

" << text << '

'; // print encoded string string str = ""; for (char ch: text) { str += huffmanCode[ch]; } cout << "

Encoded string is :

" << str << '

'; // traverse the Huffman Tree again and this time // decode the encoded string int index = -1; cout << "

Decoded string is:

"; while (index < (int)str.size() - 2) { decode(root, index, str); } } // Huffman coding algorithm int main() { string text = "Huffman coding is a data compression algorithm."; buildHuffmanTree(text); return 0; } 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 #include <iostream> #include <string> #include <queue> #include <unordered_map> using namespace std ; // A Tree node struct Node { char ch ; int freq ; Node * left , * right ; } ; // Function to allocate a new tree node Node * getNode ( char ch , int freq , Node * left , Node * right ) { Node * node = new Node ( ) ; node -> ch = ch ; node -> freq = freq ; node -> left = left ; node -> right = right ; return node ; } // Comparison object to be used to order the heap struct comp { bool operator ( ) ( Node * l , Node * r ) { // highest priority item has lowest frequency return l -> freq > r -> freq ; } } ; // traverse the Huffman Tree and store Huffman Codes // in a map. void encode ( Node * root , string str , unordered_map < char , string > &huffmanCode ) { if ( root == nullptr ) return ; // found a leaf node if ( ! root -> left && ! root -> right ) { huffmanCode [ root -> ch ] = str ; } encode ( root -> left , str + "0" , huffmanCode ) ; encode ( root -> right , str + "1" , huffmanCode ) ; } // traverse the Huffman Tree and decode the encoded string void decode ( Node * root , int &index , string str ) { if ( root == nullptr ) { return ; } // found a leaf node if ( ! root -> left && ! root -> right ) { cout << root -> ch ; return ; } index ++ ; if ( str [ index ] == '0' ) decode ( root -> left , index , str ) ; else decode ( root -> right , index , str ) ; } // Builds Huffman Tree and decode given input text void buildHuffmanTree ( string text ) { // count frequency of appearance of each character // and store it in a map unordered_map < char , int > freq ; for ( char ch : text ) { freq [ ch ] ++ ; } // Create a priority queue to store live nodes of // Huffman tree; priority_queue < Node * , vector < Node * > , comp > pq ; // Create a leaf node for each character and add it // to the priority queue. for ( auto pair : freq ) { pq . push ( getNode ( pair . first , pair . second , nullptr , nullptr ) ) ; } // do till there is more than one node in the queue while ( pq . size ( ) != 1 ) { // Remove the two nodes of highest priority // (lowest frequency) from the queue Node * left = pq . top ( ) ; pq . pop ( ) ; Node * right = pq . top ( ) ; pq . pop ( ) ; // Create a new internal node with these two nodes // as children and with frequency equal to the sum // of the two nodes' frequencies. Add the new node // to the priority queue. int sum = left -> freq + right -> freq ; pq . push ( getNode ( '\0' , sum , left , right ) ) ; } // root stores pointer to root of Huffman Tree Node * root = pq . top ( ) ; // traverse the Huffman Tree and store Huffman Codes // in a map. Also prints them unordered_map < char , string > huffmanCode ; encode ( root , "" , huffmanCode ) ; cout << "Huffman Codes are :

" << '

' ; for ( auto pair : huffmanCode ) { cout << pair . first << " " << pair . second << '

' ; } cout << "

Original string was :

" << text << '

' ; // print encoded string string str = "" ; for ( char ch : text ) { str += huffmanCode [ ch ] ; } cout << "

Encoded string is :

" << str << '

' ; // traverse the Huffman Tree again and this time // decode the encoded string int index = - 1 ; cout << "

Decoded string is:

" ; while ( index < ( int ) str . size ( ) - 2 ) { decode ( root , index , str ) ; } } // Huffman coding algorithm int main ( ) { string text = "Huffman coding is a data compression algorithm." ; buildHuffmanTree ( text ) ; return 0 ; } Download Run Code Output:



Huffman Codes are :



h 111110

f 11110

i 1110

t 11011

l 110100

o 1100

n 1011

r 10101

d 0010

g 0001

H 00001

u 00000

c 0011

a 010

e 110101

011

m 1000

. 111111

s 1001

p 10100



Original string was :

Huffman coding is a data compression algorithm.



Encoded string is :

00001000001111011110100001010110110011110000101110101100010111110100101101001100100101101101001100111100100010100101011101011001100111101100101101101011010000011100101011110110111111101000111111



Decoded string is:

Huffman coding is a data compression algorithm.

Java import java.util.Comparator; import java.util.HashMap; import java.util.Map; import java.util.PriorityQueue; // A Tree node class Node { char ch; int freq; Node left = null, right = null; Node(char ch, int freq) { this.ch = ch; this.freq = freq; } public Node(char ch, int freq, Node left, Node right) { this.ch = ch; this.freq = freq; this.left = left; this.right = right; } } class Main { // traverse the Huffman Tree and store Huffman Codes in a map. public static void encode(Node root, String str, Map<Character,String> huffmanCode) { if (root == null) return; // found a leaf node if (root.left == null && root.right == null) { huffmanCode.put(root.ch, str); } encode(root.left, str + '0', huffmanCode); encode(root.right, str + '1', huffmanCode); } // traverse the Huffman Tree and decode the encoded string public static int decode(Node root, int index, StringBuilder sb) { if (root == null) return index; // found a leaf node if (root.left == null && root.right == null) { System.out.print(root.ch); return index; } index++; if (sb.charAt(index) == '0') index = decode(root.left, index, sb); else index = decode(root.right, index, sb); return index; } // Builds Huffman Tree and huffmanCode and decode given input text public static void buildHuffmanTree(String text) { // count frequency of appearance of each character // and store it in a map Map<Character, Integer> freq = new HashMap<>(); for (char c: text.toCharArray()) { freq.put(c, freq.getOrDefault(c, 0) + 1); } // Create a priority queue to store live nodes of Huffman tree // Notice that highest priority item has lowest frequency PriorityQueue<Node> pq; pq = new PriorityQueue<>(Comparator.comparingInt(l -> l.freq)); // Create a leaf node for each character and add it // to the priority queue. for (var entry : freq.entrySet()) { pq.add(new Node(entry.getKey(), entry.getValue())); } // do till there is more than one node in the queue while (pq.size() != 1) { // Remove the two nodes of highest priority // (lowest frequency) from the queue Node left = pq.poll(); Node right = pq.poll(); // Create a new internal node with these two nodes as children // and with frequency equal to the sum of the two nodes // frequencies. Add the new node to the priority queue. int sum = left.freq + right.freq; pq.add(new Node('\0', sum, left, right)); } // root stores pointer to root of Huffman Tree Node root = pq.peek(); // traverse the Huffman tree and store the Huffman codes in a map Map<Character, String> huffmanCode = new HashMap<>(); encode(root, "", huffmanCode); // print the Huffman codes System.out.println("Huffman Codes are : " + huffmanCode); System.out.println("Original string was : " + text); // print encoded string StringBuilder sb = new StringBuilder(); for (char c: text.toCharArray()) { sb.append(huffmanCode.get(c)); } System.out.println("Encoded string is : " + sb); // traverse the Huffman Tree again and this time // decode the encoded string int index = -1; System.out.print("Decoded string is: "); while (index < sb.length() - 2) { index = decode(root, index, sb); } } public static void main(String[] args) { String text = "Huffman coding is a data compression algorithm."; buildHuffmanTree(text); } } 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 import java . util . Comparator ; import java . util . HashMap ; import java . util . Map ; import java . util . PriorityQueue ; // A Tree node class Node { char ch ; int freq ; Node left = null , right = null ; Node ( char ch , int freq ) { this . ch = ch ; this . freq = freq ; } public Node ( char ch , int freq , Node left , Node right ) { this . ch = ch ; this . freq = freq ; this . left = left ; this . right = right ; } } class Main { // traverse the Huffman Tree and store Huffman Codes in a map. public static void encode ( Node root , String str , Map < Character , String > huffmanCode ) { if ( root == null ) return ; // found a leaf node if ( root . left == null && root . right == null ) { huffmanCode . put ( root . ch , str ) ; } encode ( root . left , str + '0' , huffmanCode ) ; encode ( root . right , str + '1' , huffmanCode ) ; } // traverse the Huffman Tree and decode the encoded string public static int decode ( Node root , int index , StringBuilder sb ) { if ( root == null ) return index ; // found a leaf node if ( root . left == null && root . right == null ) { System . out . print ( root . ch ) ; return index ; } index ++ ; if ( sb . charAt ( index ) == '0' ) index = decode ( root . left , index , sb ) ; else index = decode ( root . right , index , sb ) ; return index ; } // Builds Huffman Tree and huffmanCode and decode given input text public static void buildHuffmanTree ( String text ) { // count frequency of appearance of each character // and store it in a map Map < Character , Integer > freq = new HashMap <> ( ) ; for ( char c : text . toCharArray ( ) ) { freq . put ( c , freq . getOrDefault ( c , 0 ) + 1 ) ; } // Create a priority queue to store live nodes of Huffman tree // Notice that highest priority item has lowest frequency PriorityQueue <Node> pq ; pq = new PriorityQueue <> ( Comparator . comparingInt ( l -> l . freq ) ) ; // Create a leaf node for each character and add it // to the priority queue. for ( var entry : freq . entrySet ( ) ) { pq . add ( new Node ( entry . getKey ( ) , entry . getValue ( ) ) ) ; } // do till there is more than one node in the queue while ( pq . size ( ) != 1 ) { // Remove the two nodes of highest priority // (lowest frequency) from the queue Node left = pq . poll ( ) ; Node right = pq . poll ( ) ; // Create a new internal node with these two nodes as children // and with frequency equal to the sum of the two nodes // frequencies. Add the new node to the priority queue. int sum = left . freq + right . freq ; pq . add ( new Node ( '\0' , sum , left , right ) ) ; } // root stores pointer to root of Huffman Tree Node root = pq . peek ( ) ; // traverse the Huffman tree and store the Huffman codes in a map Map < Character , String > huffmanCode = new HashMap <> ( ) ; encode ( root , "" , huffmanCode ) ; // print the Huffman codes System . out . println ( "Huffman Codes are : " + huffmanCode ) ; System . out . println ( "Original string was : " + text ) ; // print encoded string StringBuilder sb = new StringBuilder ( ) ; for ( char c : text . toCharArray ( ) ) { sb . append ( huffmanCode . get ( c ) ) ; } System . out . println ( "Encoded string is : " + sb ) ; // traverse the Huffman Tree again and this time // decode the encoded string int index = - 1 ; System . out . print ( "Decoded string is: " ) ; while ( index < sb . length ( ) - 2 ) { index = decode ( root , index , sb ) ; } } public static void main ( String [ ] args ) { String text = "Huffman coding is a data compression algorithm." ; buildHuffmanTree ( text ) ; } } Download Run Code Output:



Huffman Codes are : { =100, a=010, c=0011, d=11001, e=110000, f=0000, g=0001, H=110001, h=110100, i=1111, l=101010, m=0110, n=0111, .=10100, o=1110, p=110101, r=0010, s=1011, t=11011, u=101011}



Original string was : Huffman coding is a data compression algorithm.



Encoded string is : 11000110101100000000011001001111000011111011001111101110001100111110111000101001100101011011010100001111100110110101001011000010111011111111100111100010101010000111100010111111011110100011010100



Decoded string is: Huffman coding is a data compression algorithm.







Note that the storage used by the input string is 47*8 = 376 bits but our encoded string only takes 194 bits. i.e. about 48% data compression. To make program readable, we have used string class to store the encoded string in above program.

Since efficient priority queue data structures require O(log(n)) time per insertion, and a complete binary tree with n leaves has 2n-1 nodes and huffman coding tree is a complete binary tree, this algorithm operates in O(nlog(n)) time, where n is the number of characters.



References:

https://en.wikipedia.org/wiki/Huffman_coding

https://en.wikipedia.org/wiki/Variable-length_code

Dr. Naveen garg, IIT-D (Lecture – 19 Data Compression)









(140 votes, average: 4.94 out of 5)

Loading...

Thanks for reading.

