The MVCOMP compression format Mateusz Viste Last update: 2024-10-07 MVCOMP is a minimalist data compression format. It is very easy to implement, a depacker can be written within less than 20 lines of C. It's easy, fast and light. MVCOMP is meant to be used as a (de)compression method in constrained environments (think 8086 and 16K of RAM). Technically speaking, it is a primitive compression method that is limited to back references and literal strings. Contrary to most compression algorithms, MVCOMP does not operate on bit streams, but on whole words. This makes it very simple and allows for fast implementations. MVCOMP's compression ratio is not very strong, but it is not ridiculously weak either: it is typically worse than ZIP -9 by less than 10 percentage points. The compressed output is a stream of 16-bit words (tokens). There are three types of token, but you only need to recognize two of them. === TOKEN TYPE 1: "BACK REFERENCE" ============================================ Token format is LLLL OOOO OOOO OOOO (16 bits), where: OOOO OOOO OOOO is the back reference offset (number of bytes-1 to rewind). LLLL is the number of bytes (-1) to copy from the offset position. In other words, when the decoder encounters such token, it needs to copy LLLL+1 bytes from past output that occured OOOO+1 bytes ago. LLLL must be non-zero (meaning that a back reference to a 1 byte string is not legal). If LLLL is zero, then the token is of TYPE 2. NOTE ABOUT SELF-DUPLICATING REFERENCES: The LLLL (length) value can be larger than the back reference offset. In such case, the back referenced string will naturally replicate itself. For example the string "beefbeefbeef" can be encoded as "beef" with a back reference of -4 and a length of 8 bytes. This also allows RLE: for example the string "aaaaa" may be encoded as "a" with a back reference of -1 and a length of 4 bytes. This self-duplicating mechanism does not require any special handling from the depacker code, as long as the depacker copies back-referenced strings from left to right, one byte at a time. === TOKEN TYPE 2: "LITERAL STRING START" ====================================== Token format is 0000 UUUU BBBB BBBB This token is used to encapsulate uncompressible data. BBBB BBBB is the literal value of the byte to be copied. UUUU is the number of uncompressible words (type 3 tokens) that follow. === TOKEN TYPE 3: "LITERAL STRING CONTINUATION" =============================== Token format is AAAA AAAA BBBB BBBB It occurs after a TYPE 2 token (literal string start) that has a non-zero UUUU. This token contains two raw bytes (A and B) to be copied to output as-is. ===============================================================================