The MVCOMP compression format
                                 Mateusz Viste

                            Last update: 2024-10-07


MVCOMP is a minimalist data compression format. It is very easy to implement,
a depacker can be written within less than 20 lines of C. It's easy, fast and
light. MVCOMP is meant to be used as a (de)compression method in constrained
environments (think 8086 and 16K of RAM).

Technically speaking, it is a primitive compression method that is limited to
back references and literal strings. Contrary to most compression algorithms,
MVCOMP does not operate on bit streams, but on whole words. This makes it very
simple and allows for fast implementations.

MVCOMP's compression ratio is not very strong, but it is not ridiculously weak
either: it is typically worse than ZIP -9 by less than 10 percentage points.

The compressed output is a stream of 16-bit words (tokens). There are three
types of token, but you only need to recognize two of them.


=== TOKEN TYPE 1: "BACK REFERENCE" ============================================

Token format is LLLL OOOO OOOO OOOO (16 bits), where:

OOOO OOOO OOOO  is the back reference offset (number of bytes-1 to rewind).
          LLLL  is the number of bytes (-1) to copy from the offset position.

In other words, when the decoder encounters such token, it needs to copy
LLLL+1 bytes from past output that occured OOOO+1 bytes ago.

LLLL must be non-zero (meaning that a back reference to a 1 byte string is not
legal). If LLLL is zero, then the token is of TYPE 2.

NOTE ABOUT SELF-DUPLICATING REFERENCES:
The LLLL (length) value can be larger than the back reference offset. In such
case, the back referenced string will naturally replicate itself. For example
the string "beefbeefbeef" can be encoded as "beef" with a back reference of -4
and a length of 8 bytes. This also allows RLE: for example the string "aaaaa"
may be encoded as "a" with a back reference of -1 and a length of 4 bytes.

This self-duplicating mechanism does not require any special handling from the
depacker code, as long as the depacker copies back-referenced strings from left
to right, one byte at a time.


=== TOKEN TYPE 2: "LITERAL STRING START" ======================================

Token format is 0000 UUUU BBBB BBBB

This token is used to encapsulate uncompressible data.

BBBB BBBB  is the literal value of the byte to be copied.
     UUUU  is the number of uncompressible words (type 3 tokens) that follow.


=== TOKEN TYPE 3: "LITERAL STRING CONTINUATION" ===============================

Token format is AAAA AAAA BBBB BBBB

It occurs after a TYPE 2 token (literal string start) that has a non-zero UUUU.
This token contains two raw bytes (A and B) to be copied to output as-is.


===============================================================================