Description of the NTFS (de)compression algorithm (based on a modified LZ77 algorithm) Copyright (c) 2001 Anton Altaparmakov This document is published under the GNU General Public License. Credits: This is based on notes taken from various places (most notably from Regis Duchesne's NTFS documentation and from various LZ77 descriptions) and further refined by looking at a few compressed streams to figure out some uncertainties. Note: You should also read the runlist description with regards to compression in linux-ntfs/include/layout.h. Just search for "Attribute compression". FIXME: Should merge the info from there into this document some time. Compressed data is organized in logical "compression" blocks (cb). Each cb has a size (cb_size) of 2^compression_unit clusters. In all versions of Windows, NTFS (NT/2k/XP, NTFS 1.2-3.1), the only valid compression_unit is 4, IOW, each cb is 2^4 = 16 clusters in size. We detect and warn about a compression_unit != 4 but we try to decompress the data anyway. Compression is only supported for cluster sizes between 512 and 4096. Thus a cb can be between 8 and 64kiB in size. Each cb is independent of the other cbs and is thus the minimal unit we have to parse even if we wanted to decompress only one byte. Also, a cb can be totally uncompressed and this would be indicated as a sparse cb in the runlist. Thus, we need to look at the runlist of the compressed data stream, starting at the beginning of the first cb overlapping @page. So we convert the page offset into units of clusters (vcn), and round the vcn down to a mutliple of cb_size clusters. We then scan the runlist for the appropriate position. Based on what we find there, we decide how to proceed. If the cb is not compressed at all, and covers the whole of @page, we pretend to be accessing an uncompressed file, so we fall back to what we do in aops.c::ntfs_file_readpage(), i.e. we do: return block_read_full_page(page, ntfs_file_get_block); If the cb is completely sparse, and covers the whole of @page, we can just zero out @page and complete the io (set @page up-to-date, unlock it, and finally return 0). In all other cases we initiate the decompression engine, but first some more on the compression algorithm. Before compression the data of each cb is further divided into 4kiB blocks, we call them "sub compression" blocks (sb), each including a header specifying its compressed length. So we could just scan the cb for the first sb overlapping @page and skip the sbs before that, or we could decompress the whole cb injecting the superfluous decompressed pages into the page cache as a form of read ahead (this is what zisofs does for example). In either case, we then need to read and decompress all sbs overlapping @page, potentially having to decompress one or more other cbs, too. As soon as @page is completed we could either stop or continue until we finish the current cb, injecting pages as we go along (again following the zisofs example). Because the sbs follow each other directly, we need to actually read in the whole cb in order to be able to scan through the cb to find the first sb overlapping @page, so it does make sense to follow the zisofs approach of decompressing the whole cb and injecting pages as we go along. So all discussion from now on will assume that we are going to do that. Although it might make sense not to decompress any sbs locate before @page because this would be a kind of "read-behind" which is probably silly, unless someone is reading the file backwards. Performing read-ahead by decompressing all sbs following @page OTOH, is very likely to be a good idea. So, we read the whole cb from disk and start at the first sb. As mentioned above, each sb is started with a header. The header is 16 bits of which the lower twelve bits (i.e. bits 0 to 11) are the length (L) - 3 of the sb (including the two bytes for the header itself, or L - 1 not counting the two bytes for the header). The higher four bits are set to 1011 (0xb) by the compressor for a compressed block, or to 0000 for an uncompressed block, but the decompressor only checks the most significant bit taking a 1 to signify a compressed block, and a 0 an uncompressed block. So from the header we know how many compressed bytes we need to decompress to obtain the next 4kiB of uncompressed data and if we didn't want to decompress this sb we could just seek to the next next one using the length read from the header. We could then continue seeking until we reach the first sb overlapping @page. In either case, we will reach a sb which we want to decompress. Having dealt with the 16-bit header of the sb, we now have length bytes of compressed data to decompress. This compressed stream is further split into tokens which are organized into groups of eight tokens. Each token group (tg) starts with a tag byte, which is an eight bit bitmap, the bits specifying the type of each of the following eight tokens. The least significant bit (LSB) corresponds to the first token and the most significant bit (MSB) corresponds to the last token. The two types of tokens are symbol tokens, specified by a zero bit, and phrase tokens, specified by a set bit. A symbol token (st) is a single byte and is to be taken literally and copied into the sliding window (the decompressed data). A phrase token (pt) is a pointer back into the sliding window (in bytes), together with a length (again in bytes), starting at the byte the back pointer is pointing to. Thus a phrase token defines a sequence of bytes in the sliding window which need to be copied at the current position into the sliding window (the decompressed data stream). Each pt consists of 2 bytes split into the back pointer (p) and the length (l), each of variable bit width (but the sum of the widths of p and l is fixed at 16 bits). p is at least 4 bits and l is at most 12 bits. The most significant bits contain the back pointer (p), while the least significant bits contain the length (l). l is actually stored as the number of bytes minus 3 (unsigned) as anything shorter than that would be at least as long as the 2 bytes needed for the actual pt, so no compression would be achieved. p is stored as the positive number of bytes minus 1 (unsigned) as going zero bytes back is meaningless. Note that decompression has to occur byte by byte, as it is possible that some of the bytes pointed to by the pt will only be generated in the sliding window as the byte sequence pointed to by the pt is being copied into it! To give a concrete example: a block full of the letter A would be compressed by storing the byte A once as a symbol token, followed by a single phrase token with back pointer -1 (p = 0, therefore go back by -(0 + 1) bytes) and length 4095 (l=0xffc, therefore length 0xffc + 3 bytes). The widths of p and l are determined from the current position within the decompressed data (cur_pos). We don't actually care about the widths as such however, but instead we want the mask (l_mask) with which to AND the pt to obtain l, and the number of bits (p_shift) by which to right shift the pt to obtain p. These are determined using the following algorithm: for (i = cur_pos, l_mask = 0xfff, p_shift = 12; i >= 0x10; i >>= 1) { l_mask >>= 1; p_shift--; } Note, that as usual in NTFS, the sb header, as well as each pt, are stored in little endian format.