How File Compression Works

Get an insight on how file compression works.
Date created: April 12, 2024

What is a compressed file?

Let’s say you have a file compressed.zip, notice that this file ends with .zip extension (it denotes that it’s a zip file) . When you decompress/extract it’s content you might see several files file1.txt, file2.txt, file3.txt now these files you see, were compressed into a single file named compressed.zip.
When you check the size of the compressed.zip it should be lesser than the combined size of each file inside it.

Here is an example of what I mean by that. I’m on Linux so I’m using tar as my archiving utility. If you’re on windows you can right click or something.

couldn't load image
Figure: 1
In the image above, the file compressed.tar.gz contains, three files whose size sums up to be greater than that of compressed.tar.gz, so compressed.tar.gz is a compressed/archived file containing those three files.

If you’re using gzip as your archiving utility you can see the compression_ratio of the performed compression.

gzip -l file.txt.gz
compressed        uncompressed  ratio uncompressed_name
    33                  65      86.2%     file.txt

How does compression work?

To understand how compression works, we have to understand the types of compression:

LOSSLESS: no data loss

Lossless compression doesn’t lose any data. This type of compression is used when compressing text/document files where we cannot afford losing text. Let’s take an example of a file my_file.txt that contains

1111111111b1

Here, the part 1 is being repeated in every line.
This can be replaced by a symbol/abbreviation this can help save some bits of data, and no data is lost but only the repeating words/patterns are replaced. This above can be abbreviated as:

10 1 and b1

which means there are ten 1 and b1 in the file.
You would need a decompressor to decompress/unzip the compressed file. Since, the utility will recognize any abbreviations used and replace them with their actual values.

LOSSY: loses data

On the other hand, Lossy compression loses data.
Lossy compression cannot be used for text files, since data loss is not acceptable for such files.
But it is widely used for image, audio and video. Since, our senses aren’t that great , and we usually don’t notice the missing bits.

couldn't load image
Figure: 2

The highlighted portion in the above image appears to have the same color to our eyes, even when they’re slightly different. Lossy compression works on this assumption, it will generalize similar looking bits of data and even skipping and deleting some of them, which results in the loss of overall quality of the image compressed.
An example of Lossy compression is the image JPEG format. When you download something from the internet it sometimes looks like it’s from the 80s, that is due to the Lossy compression.
JPEG relies on the Discrete cosine transform formula for its compression.
To learn more about how this works check these out: