Fundamentals of Compression

Friday, December 26th, 2014

Compression is the process of reducing the storage space requirement for a set of data by converting it into a more compact form than its original format. This concept of data compression is fairly old, dating at least as far back as the mid 19th century, with the invention of Morse Code.

Morse Code was created to enable operators to transmit textual messages across an electrical telegraph system using a sequence of audible pulses to convey characters of the alphabet. The inventors of the code recognized that some letters of the alphabet are used more frequently than others (e.g. E is much more common than X), and therefore decided to use shorter pulse sequences for more common characters and longer sequences for less common ones. This basic compression scheme provided a dramatic improvement to the system's overall efficiency because it enabled operators to transmit a greater number of messages in a much shorter span of time.

Although modern compression processes are significantly more complicated than Morse Code, they still rely on the same basic set of concepts, which we will review in this article. These concepts are essential to the efficient operation of our modern computerized world — everything from local and cloud storage to data streaming over the Internet relies heavily on compression and would likely be cost ineffective without it.

Compression Pipeline

The following figure depicts the general paradigm of a compression scheme. Raw input data consists of a sequence of symbols that we wish to compress, or reduce in size. These symbols are encoded by the compressor, and the result is coded data. Note that although the coded data is usually smaller than the raw input data, this is not always the case (as we'll see later).

Coded data may then be fed into a decompressor, usually at a later point in time, where it is decoded and reconstructed back into raw data in the form of an output sequence of symbols. Note that throughout this article we'll use the terms sequence and string interchangably to refer to a sequential set of symbols.

If the output data is always identical to the input, then the compression scheme is said to be lossless, and utilizes a lossless encoder. Otherwise, it is a lossy compression scheme.

Lossless compression schemes are usually employed for compression of text, executable programs, or any other type of content where identical reconstruction is required. Lossy compression schemes are useful for images, audio, video, or any content where some degree of information loss is acceptable in order to gain improved compression efficiency.

Data Modelling

Information is defined as a quantity that measures the complexity of a piece of data. The more information that a data set has, the harder it is to compress it. The notion of rarity is correlated with this concept of information because the occurrence of a rare symbol provides more information than the occurrence of a common one.

For example, the occurrence of an earthquake in Japan would provide less information than an earthquake on the moon, as earthquakes are far less common on the moon. We can expect most compression algorithms to carefully regard the frequency or probability of a symbol when deciding how to compress it.

We describe the efficiency of a compression algorithm as its effectiveness at reducing information load. A more efficient compression algorithm will reduce the size of a particular data set more than a less efficient algorithm.

Probability Models

One of the most important steps in designing a compression scheme is to create a probability model for the data. This model allows us to examine the characteristics of the data in order to efficiently fit a compression algorithm to it. Let's walk through part of the modelling process to make this a bit more clear.

Let's assume we have an alphabet, G, which consists of all possible symbols in a data set. In our example, G contains precisely four symbols: A through D.

We also have a probability density function, P, which defines the probabilities for each symbol in G actually occurring in an input string. Symbols with higher probabilities are more likely to occur in an input string than symbols with lower probabilities.

For this example we will assume that our symbols are independent and identically distributed. The presence of one symbol in a source string is expected to have no correlation to the presence of any other.

Minimum Coding Rate

B is our most common symbol, with a probability of 40%, while C is our least likely symbol, which occurs only 10% of the time. Our goal will be to devise a compression scheme that minimizes the required storage space for our common symbols, while supporting an increase in the necessary storage space for more rare symbols. This trade off is a fundamental principle of compression, and inherint in virtually all compression algorithms.

Armed with the alphabet and pdf, we can take an initial shot at defining a basic compression scheme. If we simply encoded symbols as raw 8 bit ASCII values, then our compression efficiency, known as the coding rate, would be 8 bits per symbol. Suppose we improve this scheme by recognizing that our alphabet contains only 4 symbols. If we dedicate 2 bits for each symbol, we will still be able to fully reconstruct a coded string, but require only 1/4th the amount of space.

At this point we have significantly improved our coding rate (from 8 to 2 bits per symbol), but have completely ignored our probability model. As previously suggested, we can likely improve our coding efficiency by incorporating our model, and devising a strategy that dedicates fewer bits to the more common symbols (B and D), and more bits to the less common symbols (A and C).

This brings up an important insight first described in Shannon's seminal paper — we can define the theoretical minimum required storage space for a symbol (or event) based simply on its probability. We define the minimum coding rate for a symbol as follows:

For example, if a symbol occurs 50% of the time, then it would require, at an absolute minimum, 1 bit to store it.

Entropy and Redundancy

Going further, if we compute a weighted average of the minimum coding rates for the symbols in our alphabet, we arrive at a value that is known as the Shannon entropy, or simply the entropy of the model. Entropy is defined as the minimum coding rate at which a given model may be coded. It is based on an alphabet and its probability model, as described below.

As one might expect, models with many rare symbols will have a higher entropy than models with fewer and more common symbols. Furthermore, models with higher entropy values will be harder to compress than those with lower entropy values.

In our current example, our model has an entropy of 1.85 bits per symbol. The difference between our current coding rate (2) and entropy (1.85) is known as the redundancy of our compression scheme.

Entropy is an extremely useful topic with applicability to many different subfields including encryption and artificial intelligence. A full discussion of entropy is outside the scope of this article, but interested readers can learn much more here.

Coding the Model

We've taken a slight liberty thus far, and have automatically given ourselves the probabilities of our symbols. In reality, the model may not always be readily available and we will need to derive these values either through analysis of source strings (e.g. by tallying symbol probabilities from exemplar data), or by adaptively learning the model in tandem with a compression process. In either case, the probabilities of the actual source data may not perfectly match the model, and we will lose efficiency proportional to this divergence. For this reason, it is important to derive (or constantly maintain) as accurate a model as possible.

Common Algorithms

Once the probability model is defined for a set of data, we are able to devise a compression scheme that properly leverages the model. While the process of developing a new compression algorithm is beyond the scope of this article, we can leverage existing algorithms to fit our needs. We review some of the most popular algorithms below.

Each of the following algorithms is a sequential processor, which means that in order to reconstruct the nth element from a coded sequence, the previous 0...(n-1) symbols must first be decoded. Seeking operations are not possible because of the variable length nature of the encoded data — the decoder has no way of jumping to the correct offset for symbol n without first decoding all previous symbols. Additionally, some coding schemes rely on internal historical state that is maintained only by processing each symbol within a sequence in order.

A Lossy Note

Although lossy compression is outside the scope of this article, it is important to note that lossy compression often utilizes a lossless compressor as a part of its pipeline. Lossy compression may be achieved as a two step process where data is first carefully decimated (to discard unwanted or unnecessary information), and then compressed using a lossless algorithm. Popular image and video codecs such as JPEG and H.264 do precisely this, and rely upon lossless algorithms such as Huffman encoding or arithmetic encoding to help achieve their great efficiencies.


This article has focused on lossless compression techniques and provided a brief introduction to some of the most popular techniques. Hopefully this has piqued your interest in the important field of data compression, and provided pointers for further reading on the subject.