Self-similar Huffman Trees with Extreme Guessing Properties

John Pliam
1998

1. Huffman Trees and Entropy

Any binary tree can be thought of as a prefix code, whose codewords are taken to be the bit-strings corresponding to paths from the root to the leaves of the tree. Since the proper prefixes of codewords corresponds to vertices lying strictly within the tree, any concatenation of codewords can be uniquely converted to a sequence of leaves. A Huffman tree is one whose codewords are of nondecreasing length when arranged lexicographically (with 0<1), and where the two longest codewords must be of equal length. Below is an example of a Huffman tree.

The entropy of a Huffman tree can be defined as the average of codeword lengths weighted by coefficients p_w = 2^-|w|, for every codeword w. It turns out [1] that the entropy of a Huffman tree is a measure of the optimal compression of a sequence of random objects drawn according to probabilities p_w. The average number of bits per leaf is at its smallest when each leaf is represented by its codeword.

2. A Leaf Guessing Game

The following situation comes up in cryptography. Suppose you are playing a kind of shell game where an object is hidden, with probability p_w, at one of the leaves of a Huffman tree. The carny agrees to lift any shell that you point to, but only gives you 2^H guesses, where H is the tree's entropy (rounded up to the nearest integer). It is clearly in your best interest to start at the closest leaf to the root and work your way outward. What is the chance of winning according to this optimal guessing strategy?

The entropy of our tree H is the average distance from root to leaf, while a complete binary tree with the same average has 2^H leaves. Thus it is intuitively appealing to assume that the number of leaves swept out by our optimal guessing strategy is an appreciable fraction of the total, and since we are picking the most probable 2^H leaves, it is tempting to assume that we would have at least even odds of winning the game.

This document identifies Huffman trees in which the chance of beating the carny not only falls below even odds, but in fact goes to zero in the limiting case.

3. Self-similar Huffman Trees

Consider a family of Huffman trees t_j,k defined as follows. The maximum depth of t_j,k is jk, and there is one codeword for each length {j, 2j, ..., (k-1)j}. All other codewords have constant length jk. This uniquely defines a Huffman tree. Below is a diagram of t_2,3.

Notice that there are a subtrees of t_2,3 which are isomorphic to t_2,2 and t_2,1. Furthermore, notice the zig-zag pattern formed in the left halftree by starting from the root and alternating between the proper prefixes of the short codewords and the roots of the subtrees isomorphic to t_j,i, for i<k. These observations characterize the self-similarity of these trees. In fact, for a fixed j, the family of trees t_j,k can be thought of as truncations of an infinite tree T_j to a depth jk. More of T₂ is shown below.

Part of T₃ is shown below.

4. The Futility of Leaf Guessing on Self-similar Huffman Trees

As it happens, the chance of winning the leaf guessing game is completely up to the carny. This notion is formalized in the following theorem taken from [2] or [3].

Theorem As k approaches infinity,

Notice that this probability is slightly more than the chance 2^-j of guessing correctly on the very first try. Clearly, as both j and k become large, your chance of winning this shell game approaches zero.

If we trust our instincts that the total probability of the 2^H most likely events is about 50%, the carny can trick us into playing a game in which we literally have one chance in a billion of winning.

References

[1] Thomas M. Cover and Joy A. Thomas, Elements of Information Theory, New York: John Wiley and Sons, 1991.

[2] John O. Pliam, Ciphers and their Products: Group Theory in Private Key Cryptography, PhD thesis, University of Minnesota, to appear 1999. See also [3].

[3] John O. Pliam, The Disparity between Work and Entropy in Cryptology," 1998.