How File Compression Works
What is a compressed file?
Let’s say you have a file compressed.zip
, notice that this file ends with .zip
extension (it denotes that it’s a zip file) . When you decompress/extract it’s content you might see
several files file1.txt
, file2.txt
, file3.txt
now these files
you see, were compressed into a single file named compressed.zip
.
When you check the size of the compressed.zip
it should be lesser than
the combined size of each file inside it.
Here is an example of what I mean by that. I’m on Linux so I’m using tar
as my archiving utility. If you’re on windows you can right click or something.In the image above, the file compressed.tar.gz
contains, three files
whose size sums up to be greater than that of compressed.tar.gz
, so
compressed.tar.gz
is a compressed/archived file containing those three files.
If you’re using gzip as your archiving utility you can see the compression_ratio of the performed compression.
gzip -l file.txt.gz
compressed uncompressed ratio uncompressed_name
33 65 86.2% file.txt
How does compression work?
To understand how compression works, we have to understand the types of compression:
- Lossless
- Lossy
LOSSLESS: no data loss
Lossless compression doesn’t lose any data.
This type of compression is used when compressing text/document files where
we cannot afford losing text.
Let’s take an example of a file my_file.txt
that contains
1111111111b1
Here, the part 1
is being repeated in every line.
This can be replaced by a symbol/abbreviation this can help save some bits of data,
and no data is lost but only the repeating words/patterns are replaced.
This above can be abbreviated as:
10 1 and b1
which means there are ten 1 and b1
in the file.
You would need a decompressor to decompress/unzip the compressed file.
Since, the utility will recognize any abbreviations used and replace them
with their actual values.
LOSSY: loses data
On the other hand, Lossy compression loses data.
Lossy compression cannot be used for text files, since data loss is not
acceptable for such files.
But it is widely used for image, audio and video. Since, our senses aren’t that great
, and we usually don’t notice the missing bits.
The highlighted portion in the above image appears to have the same color to
our eyes, even when they’re slightly different.
Lossy compression works on this assumption, it will generalize similar looking
bits of data and even skipping and deleting some
of them, which results in the loss of overall quality of the image compressed.
An example of Lossy compression is the image JPEG
format. When you download something
from the internet it sometimes looks like it’s from the 80s, that is due to the Lossy compression.
JPEG relies on the Discrete cosine transform formula for its compression.
To learn more about how this works check these out: