compression is a way to condensate multiple input files in a
output archive, removing data redundancies, so the output is both
smaller (to save disk space and upload/download bandwidth) and easier
to handle than separate input files.
A common concern about compressing data - either for backup or
file distribution - is balancing worthy compression ratio with
reasonably fast operation, so i.e. end users will be able to unpack
data in a
timely fashion, or a backup process will end in a fixed maximum amount
As scenario of different goals and constrains will vary, file
compression efficiency factors must be carefully weighted minding
use of the data in first place.
Following, factors that influences more efficiency of compression and
needing more weight and attention in evaluation, and options to obtain
statistical models to map the input to a smaller
output eliminating redundancy in the data.
In this way the output carry
exactly all the information featured by the input in less bytes, and
can be expanded when needed to a 1:1 copy of the original data, which
is a fundamental property for storing some types of data - i.e. a
software, a database.
For this reason lossless compression algorithms are used for archive
used in general
purpose archive manager utilities, like 7Z, RAR,
and ZIP, where an
exact and reversible image
of the original data must be saved.
|Lossy compression, instead, works
or less relevant information (not just redundant data) and removing it.
In this way data
compression is improved but at the cost of making lossy compression a
non reversible process - as it comes at the cost of losing part of the
Lossy compression is consequently not suitable for general purpose file
(as in example losing a single byte of an executable file would make it
not working), but it works very well when loss of less relevant
information is acceptable, as for multimedia files compression
- in example for MP3
losing audio information below the audibility threshold, or losing not
visible details in JPEG images, or both in compressed video formats.
So, information loss is destructive for the ability of 1:1 reversal of
the algorithm (the information is permanently lost), but it is not
prejudicial for the ability of end users to receive meaningful
information - intelligible audio, clear picture or video.
Most common lossy compression algorithms are consequently usually fine
tuned for the specific pattern of a multimedia data type.
Due the lossy nature of those compression schemes, however, usually
professional editing work is performed on non compressed data (i.e. WAV
audio, or TIFF images) or data compressed in a lossless way (i.e. FLAC
audio, or PNG images) every time it is feasible so saving the work in
progress multiple times does not result in losing bits of the
information each time, with progressive degradation of quality -
reserving use of lossy compression to final step for creating a
reasonably sized output to distribute for media consumption.
General-purpose good practices
for improving data compression efficiency
You usually don't need to
archive duplicate files. Deduplicate
in order to avoid archiving redundant data. Identify and remove
duplicate files before archival decreases the input size improving both
operation time and final size result, and at the same time make easier
for the end user to navigate/search in a tidier archive. Don't remove
duplicate files if they are mandatorily needed in the path they are
originally featured, i.e. by a software or an automated procedure.
compressible files and evaluate if spending time to compress it or
simply store it "as is". Multimedia files (MP3, JPG, MPEG, AVI,
DIVX...) tend to poorly compressible, as those formats features lossy compression,
and, especially videos, are usually very large compared to other file
types (documents, applications), so it should be evaluated carefully if
they should be compressed at all - using "Store" option for
compression level, provided by most file archivers, meaning compression
is disabled - or even copied "as is".
To reduce disk usage of graphic files (JPEG, PNG, TIFF, BMP) see pictures
compression and optimization tips.
Some document formats (PDF, Open Office and new Office 2007 and beyond
file formats), and some databases, are already compressed (usually fast
deflate based lossless compression), so they generally does not
data is not compressible at all, being pseudo random there is
not a "shorter way" to represent the information carried in encrypted
Separating poorly compressible data from other data is a good way to
start a compression policy definition to decide the best strategy for
types of data.
usually attained with slower and more computing intensive algorithms,
i.e. RAR is slower and more powerful compressor than ZIP, and 7Z is
slower and more powerful compressor than RAR, see file format
comparison and benchmarks.
Different data types may lead to different results
with different data compression algorithms, in example weaker RAR and
ZIPX compression can close the gap with stronger 7Z compression when
multimedia files compression is involved, due to efficiently optimized
multimedia files employed in RAR and ZIPX when suitable data structures
are detected - anyway lossy compressed multimedia files remains
poorly compressible data structures.
Switching to a more powerful algorithm is usually more efficient in
terms of improving compression ratio than using highest compression
ratios of a weaker compression algorithm.
It should be evaluated carefully if better compression is really needed
deduplication, and evaluation of poorly compressible files), or if the
archive is mainly made for other reasons than sparing file size i.e.
applying encryption, handling the content as a single file, etc.
available as option for some archival formats like 7Z and RAR, can
compression ratio, it works providing a wider context for compression
algorithm to reduce data redundancy and represent it in a more
convenient way to spare output file size.
But the context information is needed also during extraction, so
extraction from a solid archive needs more time to parse all the
relevant context data (usually defined "solid block") and can be
significantly slower than from a non solid archive.
7Z allows to chose the block size to be used for
solid mode operation (the "window" data context is used by the
algorithm) to minimize overhead, but this option also slightly reduces
compression ratio improvements.
is an option meant
to improve compression ratio providing a wider context for compression
algorithm while compressing multiple files.
The ideas behind solid compression are simple and effective:
Solid compression is used in compressed TAR files (TAR.GZ, TAR.BZ2,
TGZ, TXZ...), and it is available as option for some archival formats,
like 7Z and RAR.
- when multiple files are processed as a single
(especially similar files, i.e. same type, or even revisions of the
same file), it is possible to find redundant data between the files of
the group, improving efficiency of compressed representation of the
data better than treating each file separately
- when many small files are processed as a single
block, overhead content (marker of file begin/end, checksum, table of
content) is written only once rather than once per file, saving extra
bytes of size for each input object.
Main drawbacks of solid compression are:
To mitigate those disadvantages, 7Z format allows to choose the block
size to be used for
solid mode operation (the "window" of data context that is parsed by
compression/extraction algorithm) minimizing overhead during
extraction, and possible impact of data corruption - but for the very
same reason reducing solid block size potentially reduces compression
- the context information is needed also during
compression / extraction to preserve the advantage of solid
compression. So, the partial extraction (a single file or group of file
rather than the whole archive) from a solid archive, or adding or deleting
already existing solid archive, needs more time because all the
relevant context data (usually defined "solid block") must be parsed,
making the process significantly slower than adding / extracting data
from a non solid archive
- for the very same reason, a damage in any part
may make all the data after that point non-usable for lack of context
information needed for extraction, while data corruption in non solid
archive usually harms only the data of a single file.
Solid blocks can be defined by size, number of files in a block, and if
blocks are separated by file extension.
Chose carefully if the intended use of the compressed data needs high
compression/solid compression to be used, the more often the data will
be needed to be extracted the more times the computational overhead
will apply for each end user.
In example, software distribution would greatly benefit of maximum
saving bandwidth is critical and end user usually extracts the data
only once, while the overhead may not be acceptable if the data needs
to be accessed often and fastest extraction time becomes a decisive
To fit in size constrains
(i.e. mail attachment limit, physical support size) is usually
feasible from most archival utilities splitting the output file in
volumes of desired size (volume spanning, or file split),
progressively numbered i.e .001, .002, .nnn so the receiver can extract
the whole archive, usually, saving all files in the same path and
starting extraction from .001 file.
This is the simplest and most efficient way to securely fit in
a mandatory output
size, rather than trying to improve compression ratio with
slower/heavier algorithms/settings in the hope to fit the desired
Quite obviously, best data compression practices mean nothing if the
file cannot be provided to the intended end user. If the archive needs
to be shared, the first concern is what archive
file types is capable to read the end user - what archive formats are
supported or can be supported through end user computing platform
(Microsoft Windows, Google Android/ChromeOS, iOS, Apple OSX, Linux,
BDS...) - if the user is willing
and authorized to install needed software.
So most of times the better choice in this case is staying with most
format (ZIP), while RAR is quite popular on MS Windows
platforms and TAR is
ubiquitously supported on Unix derivate systems, and 7Z is becoming increasingly popular
on all systems.
Some file sharing platforms, cloud services, and e-mail provides may
block some file types with the explanation they are commonly abused
(spam, viruses, illicit content), preventing it to reach the intended
end user(s), so it is critical to read terms of services to avoid this
Usually changing file extension is not a solution, as each archive file
has a well defined internal structure (that is meant for the file to
properly function, so can hardly be cloaked) so file format recognition
is seldom based on simple parsing the file extension.
In some other cases are blocked all encrypted files or all files of
unknown/unsupported formats that service provider are not able to
inspect / scan for viruses.
Self extracting archives
are useful to provide the end user of the appropriate extraction
routines without the need of installing any software, but being the
extraction module embedded in the archive it represent an overhead of
some 10s or 100s of KB, which makes it a noticeable disadvantage only
the case of very small (e.g. approximately less than 1MB) archives -
however well in the size range of a typical archive of a
few textual documents. Moreover, being the self extracting archive an
executable file, some file sharing platforms, cloud providers, and
e-mail servers, may block the file, preventing it to reach the intended
function (File tools submenu) is intended for overwriting file data or
free partition space with
all-0 stream, in order to fill corresponding
physical disk area of homegeneus, highly compressible data.
This allows to save space when compressing disk images, either
low-level physical disk snapshot done for backup porpose, and Virtual
Machines guest virtual disks, as the 1:1 exact copy of the disk content
is not burdened of leftover data on free space area - some disk imaging
utilities and Virtual Machines players/managers have built-in
compression routines, zeroing free space before is strongly recommended
to improve compression ratio.
Zeroing deletion also offers
a basic grade of security improvement over PeaZip's "Quick delete"
simply remove the file from filesystem, making it not recoverable by
system's recycle bin but susceptible of being recovered with undelete
file utilities. Zero deletion however is not meant for advanced
security, and PeaZip's Secure delete
should be used instead
when it is needed to securely and permanently erase a file or sanitize
free space on a volume for privacy reasons.
Topics: maximum compression how to, fit mail attachment
limit, compress under mandatory size, highest compression ratio, solid
compression, self extracting archives, fastest compression, fast
extraction, smaller archive, reduce file size, fit data in cd/dvd size,
spanned volumes, sharing files, e-mail files, e-mail filtering, improve
compression ratio, best compression ratio, compression efficiency,
hints for data compression, file compression tips and tricks,
suggestions for improving compression efficiency, optimize compression,
optimal data compression, lossy, lossless, virtual machines, disk
Related articles: Add content to already
existing archive, Convert
archive files, Create 7Z
files, RAR files,
Create ZIP files, Encrypted files, Find
duplicate files, Comparison
of archive file
> Tips and tricks > Suggestions for
best compression results