An Interpretation of Compression

I started this project wanting to understand how LLMs process text. The first step is tokenization, which is splitting text into chunks. Interestingly, there are $2^{n-1}$ ways to split a string of length ‘n’. That means, there are $2^{99}$ ways to split a 100-character string. In words, the above number is approximately ‘Six hundred and thirty-three nonillion, eight hundred and twenty-five octillion, three hundred septillion, one hundred and fourteen sextillion, one hundred and fourteen quintillion, seven hundred quadrillion, seven hundred and forty-eight trillion, three hundred and fifty-one billion, six hundred and two million, eight hundred and sixteen.‘ Obviously, you can’t try them all. So BPE uses a greedy algorithm that merges the most frequent pair. From what I read, BPE is basically compression: frequent patterns get short representations, rare stuff gets broken into pieces.

Then… I went down a rabbit hole researching compression. Compression has the same structure. There are exponentially many ways to encode. How do we find the best way? Huffman coding comes in here.

An Overview

The goal of compression is to compress. You have data. You make data small! There are two categories. First, we have lossless compression. The original data can be perfectly recovered, and nothing is lost during the process. The alternative is lossy compression. Some information is discarded to achieve smaller sizes, which is useful for videos and images where smaller imperfections don’t really matter. For the following post, we will focus on the prior.

There are several methods to do lossless compression.

We can find repeated patterns. Let our string be: “the exit sign was blinking red on the way out”. The word “the” appears two times. Instead of writing “the” two times, you can use a single instance. For example, the compressed version would roughly be: “the exit sign was blinking red on ” + [go back 35 characters, copy 4] + “way out”. This is dictionary compression. In other words:

“[dictionary coder] operate[s] by searching for matches between the text to be compressed and a set of strings contained in a data structure (called the ‘dictionary’) maintained by the encoder. When the encoder finds such a match, it substitutes a reference to the string’s position in the data structure.”¹

This works well with code, text, and any data with a lot of repetition. It struggles when there is a lot of random data.

We could also use probability. We know that in English, different symbols appear with different frequencies. For example, the letter “e” appears in English text 170 times more than “z”. Entropy Coding gives common symbols shorter codes and rare symbols longer codes. There are several entropy coding methods. For prefix-free, symbol-by-symbol codes with known frequencies, Huffman is mathematically optimal. We will prove this using induction later.

Computers Need Bits

Computers can only understand bits. So, we need a system to convert letters to a sequence of 0s and 1s.

We can do the fixed-length approach. Say we have A, B, C, and D, and we need to figure out how many bits are needed to distinguish these 4 letters. The most intuitive approach is to use 2 bits for all letters (since 1 bit would only allow 2 options: 0 and 1). So, A = 00, B = 01, C = 10, and D = 11. This, unfortunately, is wasteful.

The alternative, as briefly mentioned above, is to give frequent symbols short codes, and rare symbols long codes. For example, A = 0, B = 10, C = 110, D = 111. This is approximately a 26% reduction in size compared to the previous approach. The difficulty here is that you can’t pick any arbitrary variable-length codes. For instance, if A was assigned 0, and B was assigned 01, decoding 001 becomes difficult. Is it A, A, then something random? Or A, B? Here, the code for A is a prefix for the code for B.

And that is why Huffman coding is prefix-free. In other words, no code can be the start of another code. If A is assigned 0, then the first bit of B cannot be 0.

This is where it gets pretty! Prefix-free codes are the same thing as binary trees. Excuse my laziness, but I found it to be extremely difficult to insert beautiful tree graphics into this post, so I will do my best to describe them verbally.

A binary tree is a structure where each node has at most two children (a left child and a right child). We place the symbols (A, B, C, D) at the leaves (nodes with no children). The path from the top node (root) to a leaf defines the code: Left Edge = 0, Right Edge = 1.

The key property of the tree is that it guarantees prefix-free. In other words, once you follow the path from the top node (root) and reach the code for the symbol, that path immediately ends. There, logically, cannot be any repetitions (prefixes). Our goal then is to build the binary tree that minimizes the total size of the compressed message.

Optimization

It may be difficult to arrive at this conclusion through my subpar description of binary trees, but we can actually express the total number of bits with the features of these trees. In particular:

$\text{Total Bits}= \displaystyle\sum\limits_{symbols} (frequency\cdot depth)$

Frequency, by the way, is the number of times that symbol appears; depth is the distance from the root (top node) of the tree. This equation basically means: Deeper leaves = longer codes = more bits. Now, recall we said that we should follow: give common symbols shorter codes, and rare symbols longer codes. Applying this logic to the trees, we should place frequent symbols shallower and rare symbols deeper. This becomes the problem. We need to find the optimal tree.

Too Many Trees

Unfortunately, there are so many trees.

Let T(n) be the number of trees with n leaves:

for n = 1, there is 1 possible tree since there is only 1 node.
for n = 2, you must have 3 nodes, and again, there is only 1 possible tree.
for n = 3, you have 5 nodes, and you can choose to continue the tree from the 2nd or 3rd node. Thus, you have 2 possible trees.
for n = 4, you have 7 nodes, and you have the option, like n = 3, to continue from the 2nd or 3rd node, AND the option to continue from the 4th or 5th node. Additionally, you can continue from both the 2nd and 3rd node. Together, you have 5 trees.
for n = 5, by drawing the same trees, you will discover that there are 14 trees.

This is a recursion! A tree with n leaves has a root (top node), a left subtree (recursion) with ‘i’ leaves, and thus a right subtree with ‘n-i’ leaves. This means, the total number of trees with n leaves can be split into cases based on how many leaves go left.

for i = 1, you have 1 leaf on the left, n-1 on the right. Since there are T(1) ways to build the left subtree and T(n-1) ways to build the right subtree, then by the multiplication principle, the total number of trees is $T(1)\cdot T(n-1)$ .
for i = 2, you have 2 leaves of the left, and n-2 on the right. Similarly, the total number of trees is $T(2)\cdot T(n-2)$
for i = 3, you have 3 leaves on the left, and n-3 on the right. Similarly, the total number of trees is $T(3)\cdot T(n-3)$
for i = n-1, you have n-1 leaves on the left, and 1 on the right. That’s $T(n-1)\cdot T(1)$ trees.

To know T(n), we must add all cases up:

$T(n) = \displaystyle\sum\limits_{i=1}^{n-1}T(i)\cdot T(n-i)$

I can get you a closed-form equation, but it’s getting late, so…

We literally cannot try every single tree. In fact, by the time we get to n = 20 leaves, the number of trees is approximately 1.8 billion. When n gets large, it is almost impossible to find the optimal tree by drawing each one.

Huffman’s Algorithm

In 1952, David Huffman showed that to draw the optimal tree, you start with two nodes of the smallest weight (the smallest frequencies) and merge them into a new node. You then repeat this process until you get to the top.

Let us follow his proof by induction:

Base case (n = 2): there are two leaves, and only one possible tree, which is clearly the optimal tree. Assume Huffman is optimal for any set of fewer than n symbols.

We have n symbols with frequencies: $f_1 \leq f_2 \leq … \leq f_n$ . We have shown, in some optimal tree, that the two lowest-frequency symbols are siblings at maximum depth. Huffman’s algorithm strictly follows this rule. It merges the two smallest symbols with frequencies $f_1$ and $f_2$ into a new “super-node” z with weight $f_z = f_1 + f_2$ . This is now a reduced problem with n-1 symbols since we subtracted the two smallest symbols plus the new node z. Thus, by the inductive hypothesis, Huffman’s algorithm produces the optimal tree for these n-1 symbols.

Beyond Huffman

Huffman is optimal on the restriction that each symbol gets a whole number of bits. What if instead of encoding symbols one at a time, we encode the entire message as a single number between 0 and 1? This is Arithmetic coding:

$\text{Total Bits} = -\displaystyle\sum\limits_{i}log_2P(s_i |context)$

Without diving too deep into the math, the arithmetic coding formula reveals a lot. The bits depend entirely on prediction. The higher the probability you can predict the next symbol, the fewer bits you need. In other words, you can translate this occurrence into the colloquial: good prediction = good compression.

Large Language Models, interestingly, become state-of-the-art text compressors? Delétang et al. (2024) demonstrated that LLMs achieve state-of-the-art compression on text, beating gzip, bzip2, and all traditional methods. Bellard’s NNCP uses neural networks to compress enwik9 better than any classical algorithm. The LLM weights encode patterns about language. grammar, facts, style, reasoning. Better patterns = better predictions = fewer bits.

Compression as an Intelligence Test

In 1999, Matthew Majoney proposed that we use compression as a test for artificial intelligence. Consider the Turing test. A machine wins if the judge cannot distinguish it from a human. Mahoney showed mathematically

“The machine can win with a suboptimal solution (M ≠ L) by deterministically favoring the most likely responses over the true distribution.”

In other words, a machine does not need to match human intelligence to win, it just needs to math judge’s expectations of human intelligence. Mahoney concludes that compression, on the other hand, cannot be fooled.

“Compression ratio on a standard benchmark corpus could be used as an objective and quantitative alternative test for AI.”

In 2006, Marcus Hutter started the Hutter Prize: 500,000 bucks for the algorithm that can compress Wikipedia the best. Interestingly, the top performing algorithms combine AI predictions with lossless arithmetic coders like ANS. In terms of text compression, Claude Shannon estimated English text’s entropy (minimum bits needed) at around 1 bit per character, with a range of 0.6 to 1.3 bpc for human prediction. Today, the best achieve around 1.0 bits per character. Thus, there is a convincing argument that machines are approaching human-level compression of text.

In 2024, Huang et al. tested this. They measured compression performance across 31 LLMs and compared it to benchmark performances. They found that compression and intelligence correlate linearly. Better compression = better benchmark scores. Across all models. Across all benchmarks. Compression predicted benchmark performance better than model size, training data, or architecture.

If you are skeptical, we are on the same boat. Why, if machines can compress text at near (or equal to) human-level, are they still agreed to be “dumber” than humans? Where is Artificial General Intelligence, then?

I believe the answer is that we have been too narrow-scoped. Mahoney used text; Hutter uses Wikipedia (text); Huang et al. used ‘external text corpora’.

Multimodal Compression

Consider the phenomenological ‘being-in-the-world’ (excuse my subpar understanding of philosophy if this is a stretch). Humans don’t just process words. A child spends years compressing experience: reaching, grasping, falling, bumping, crying, before language even begins. Words come later, as labels for concepts already grounded in sensation. Text is humanity’s compression of experience. An LLM that compresses text well has learned to compress the compression. It models how humans describe reality. But it has never touched an object, heard a sound, or moved through space. Thus, the compression test, as currently implemented, measures language intelligence, not general intelligence.

Intelligence, then, should be measured with:

$I_{GeneralIntelligence}(A) = \displaystyle\sum\limits_{M\in modalities} weight\cdot R_M(A)$

where $R_M(A)$ is A’s compression on M, where M is a modality in the set of all modalities. By modalities, I mean text, image, video, audio, 3D, etc.

Humans are not optimal at any signal modality. But we’re General and Integrated. No current AI has this. I argue that a true test of intelligence for AI must be multimodal. We need a standardized compression test for images, video, audio, and 3D. The next step, or perhaps the next blog post, should provide entropy estimates for the above modalities and establish a unified multimodal benchmark. I believe that building this would be a true and objective test of general intelligence.

https://en.wikipedia.org/wiki/Dictionary_coder ↩︎