|
|
1.1 ! root 1: 1. Algorithm ! 2: ! 3: The deflation algorithm used by zip and gzip is a variation of LZ77 ! 4: (Lempel-Ziv 1977, see reference below). It finds duplicated strings in ! 5: the input data. The second occurrence of a string is replaced by a ! 6: pointer to the previous string, in the form of a pair (distance, ! 7: length). Distances are limited to 32K bytes, and lengths are limited ! 8: to 258 bytes. When a string does not occur anywhere in the previous ! 9: 32K bytes, it is emitted as a sequence of literal bytes. (In this ! 10: description, 'string' must be taken as an arbitrary sequence of bytes, ! 11: and is not restricted to printable characters.) ! 12: ! 13: Literals or match lengths are compressed with one Huffman tree, and ! 14: match distances are compressed with another tree. The trees are stored ! 15: in a compact form at the start of each block. The blocks can have any ! 16: size (except that the compressed data for one block must fit in ! 17: available memory). A block is terminated when zip determines that it ! 18: would be useful to start another block with fresh trees. (This is ! 19: somewhat similar to compress.) ! 20: ! 21: Duplicated strings are found using a hash table. All input strings of ! 22: length 3 are inserted in the hash table. A hash index is computed for ! 23: the next 3 bytes. If the hash chain for this index is not empty, all ! 24: strings in the chain are compared with the current input string, and ! 25: the longest match is selected. ! 26: ! 27: The hash chains are searched starting with the most recent strings, to ! 28: favor small distances and thus take advantage of the Huffman encoding. ! 29: The hash chains are singly linked. There are no deletions from the ! 30: hash chains, the algorithm simply discards matches that are too old. ! 31: ! 32: To avoid a worst-case situation, very long hash chains are arbitrarily ! 33: truncated at a certain length, determined by a runtime option (zip -1 ! 34: to -9). So zip does not always find the longest possible match but ! 35: generally finds a match which is long enough. ! 36: ! 37: zip also defers the selection of matches with a lazy evaluation ! 38: mechanism. After a match of length N has been found, zip searches for a ! 39: longer match at the next input byte. If a longer match is found, the ! 40: previous match is truncated to a length of one (thus producing a single ! 41: literal byte) and the longer match is emitted afterwards. Otherwise, ! 42: the original match is kept, and the next match search is attempted only ! 43: N steps later. ! 44: ! 45: The lazy match evaluation is also subject to a runtime parameter. If ! 46: the current match is long enough, zip reduces the search for a longer ! 47: match, thus speeding up the whole process. If compression ratio is more ! 48: important than speed, zip attempts a complete second search even if ! 49: the first match is already long enough. ! 50: ! 51: ! 52: 2. gzip file format ! 53: ! 54: The pkzip format imposes a lot of overhead in various headers, which ! 55: are useful for an archiver but not necessary when only one file is ! 56: compressed. gzip uses a much simpler structure. Numbers are in little ! 57: endian format, and bit 0 is the least significant bit. ! 58: A gzip file is a sequence of compressed members. Each member has the ! 59: following structure: ! 60: ! 61: 2 bytes magic header 0x1f, 0x8b (\037 \213) ! 62: 1 byte compression method (0..7 reserved, 8 = deflate) ! 63: 1 byte flags ! 64: bit 0 set: file probably ascii text ! 65: bit 1 set: continuation of multi-part gzip file ! 66: bit 2 set: extra field present ! 67: bit 3 set: original file name present ! 68: bit 4 set: file comment present ! 69: bit 5 set: file is encrypted ! 70: bit 6,7: reserved ! 71: 4 bytes file modification time in Unix format ! 72: 1 byte extra flags (depend on compression method) ! 73: 1 byte operating system on which compression took place ! 74: ! 75: 2 bytes optional part number (second part=1) ! 76: 2 bytes optional extra field length ! 77: ? bytes optional extra field ! 78: ? bytes optional original file name, zero terminated ! 79: ? bytes optional file comment, zero terminated ! 80: 12 bytes optional encryption header ! 81: ? bytes compressed data ! 82: 4 bytes crc32 ! 83: 4 bytes uncompressed input size modulo 2^32 ! 84: ! 85: The format was designed to allow single pass compression without any ! 86: backwards seek, and without a priori knowledge of the uncompressed ! 87: input size or the available size on the output media. If input does ! 88: not come from a regular disk file, the file modification time is set ! 89: to the time at which compression started. ! 90: ! 91: The time stamp is useful mainly when one gzip file is transferred over ! 92: a network. In this case it would not help to keep ownership ! 93: attributes. In the local case, the ownership attributes are preserved ! 94: by gzip when compressing/decompressing the file. A time stamp of zero ! 95: is ignored. ! 96: ! 97: Bit 0 in the flags is only an optional indication, which can be set by ! 98: a small lookahead in the input data. In case of doubt, the flag is ! 99: cleared indicating binary data. For systems which have different ! 100: file formats for ascii text and binary data, the decompressor can ! 101: use the flag to choose the appropriate format. ! 102: ! 103: It must be possible to detect the end of the compressed data with any ! 104: compression format, regardless of the actual size of the compressed ! 105: data. If the compressed data cannot fit in one file (in particular for ! 106: diskettes), each part starts with a header as described above, but ! 107: only the last part has the crc32 and uncompressed size. A decompressor ! 108: may prompt for additional data for multipart compressed files. It is ! 109: desirable but not mandatory that multiple parts be extractable ! 110: independently so that partial data can be recovered if one of the ! 111: parts is damaged. This is possible only if no compression state is ! 112: kept from one part to the other. The compression-type dependent flags ! 113: can indicate this. ! 114: ! 115: If the file being compressed is on a file system with case insensitive ! 116: names, the original name field must be forced to lower case. There is ! 117: no original file name if the data was compressed from standard input. ! 118: ! 119: On operating systems which support multiple extensions, and on ! 120: original files without extension, the extension ".z" is added to the ! 121: original file name (foo.c => foo.c.z). Otherwise the string "z" (or ! 122: "-z" for VMS) is added to the original extension (foo.c => foo.cz or ! 123: foo.c-z). The original name or extension is truncated if necessary, ! 124: but in this case the original name is always saved in the compressed ! 125: file (foo.doc => foo.doz). ! 126: ! 127: Compression is always performed, even if the compressed file is ! 128: slightly larger than the original. The worst case expansion is ! 129: a few bytes for the gzip file header, plus 5 bytes every 32K block, ! 130: or an expansion ratio of 0.015% for large files. ! 131: ! 132: The encryption is that of zip 1.9. For the encryption check, the ! 133: last byte of the decoded encryption header must be zero. The time ! 134: stamp of an encrypted file might be set to zero to avoid giving a clue ! 135: about the construction of the random header. ! 136: ! 137: Jean-loup Gailly ! 138: [email protected] ! 139: ! 140: References: ! 141: ! 142: [LZ77] Ziv J., Lempel A., "A Universal Algorithm for Sequential Data ! 143: Compression", IEEE Transactions on Information Theory", Vol. 23, No. 3, ! 144: pp. 337-343. ! 145: ! 146: APPNOTE.TXT documentation file in PKZIP 1.93a. It is available by ! 147: ftp in ux1.cso.uiuc.edu:/pc/exec-pc/pkz193a.exe [128.174.5.59] ! 148: Use "unzip pkz193a.exe APPNOTE.TXT" to extract.
This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.