|
|
1.1 root 1: 1. Algorithm
2:
3: The deflation algorithm used by zip and gzip is a variation of LZ77
4: (Lempel-Ziv 1977, see reference below). It finds duplicated strings in
5: the input data. The second occurrence of a string is replaced by a
6: pointer to the previous string, in the form of a pair (distance,
7: length). Distances are limited to 32K bytes, and lengths are limited
8: to 258 bytes. When a string does not occur anywhere in the previous
9: 32K bytes, it is emitted as a sequence of literal bytes. (In this
10: description, 'string' must be taken as an arbitrary sequence of bytes,
11: and is not restricted to printable characters.)
12:
13: Literals or match lengths are compressed with one Huffman tree, and
14: match distances are compressed with another tree. The trees are stored
15: in a compact form at the start of each block. The blocks can have any
16: size (except that the compressed data for one block must fit in
17: available memory). A block is terminated when zip determines that it
18: would be useful to start another block with fresh trees. (This is
19: somewhat similar to compress.)
20:
21: Duplicated strings are found using a hash table. All input strings of
22: length 3 are inserted in the hash table. A hash index is computed for
23: the next 3 bytes. If the hash chain for this index is not empty, all
24: strings in the chain are compared with the current input string, and
25: the longest match is selected.
26:
27: The hash chains are searched starting with the most recent strings, to
28: favor small distances and thus take advantage of the Huffman encoding.
29: The hash chains are singly linked. There are no deletions from the
30: hash chains, the algorithm simply discards matches that are too old.
31:
32: To avoid a worst-case situation, very long hash chains are arbitrarily
33: truncated at a certain length, determined by a runtime option (zip -1
34: to -9). So zip does not always find the longest possible match but
35: generally finds a match which is long enough.
36:
37: zip also defers the selection of matches with a lazy evaluation
38: mechanism. After a match of length N has been found, zip searches for a
39: longer match at the next input byte. If a longer match is found, the
40: previous match is truncated to a length of one (thus producing a single
41: literal byte) and the longer match is emitted afterwards. Otherwise,
42: the original match is kept, and the next match search is attempted only
43: N steps later.
44:
45: The lazy match evaluation is also subject to a runtime parameter. If
46: the current match is long enough, zip reduces the search for a longer
47: match, thus speeding up the whole process. If compression ratio is more
48: important than speed, zip attempts a complete second search even if
49: the first match is already long enough.
50:
51:
52: 2. gzip file format
53:
54: The pkzip format imposes a lot of overhead in various headers, which
55: are useful for an archiver but not necessary when only one file is
56: compressed. gzip uses a much simpler structure. Numbers are in little
57: endian format, and bit 0 is the least significant bit.
58: A gzip file is a sequence of compressed members. Each member has the
59: following structure:
60:
61: 2 bytes magic header 0x1f, 0x8b (\037 \213)
62: 1 byte compression method (0..7 reserved, 8 = deflate)
63: 1 byte flags
64: bit 0 set: file probably ascii text
65: bit 1 set: continuation of multi-part gzip file
66: bit 2 set: extra field present
67: bit 3 set: original file name present
68: bit 4 set: file comment present
69: bit 5 set: file is encrypted
70: bit 6,7: reserved
71: 4 bytes file modification time in Unix format
72: 1 byte extra flags (depend on compression method)
73: 1 byte operating system on which compression took place
74:
75: 2 bytes optional part number (second part=1)
76: 2 bytes optional extra field length
77: ? bytes optional extra field
78: ? bytes optional original file name, zero terminated
79: ? bytes optional file comment, zero terminated
80: 12 bytes optional encryption header
81: ? bytes compressed data
82: 4 bytes crc32
83: 4 bytes uncompressed input size modulo 2^32
84:
85: The format was designed to allow single pass compression without any
86: backwards seek, and without a priori knowledge of the uncompressed
87: input size or the available size on the output media. If input does
88: not come from a regular disk file, the file modification time is set
89: to the time at which compression started.
90:
91: The time stamp is useful mainly when one gzip file is transferred over
92: a network. In this case it would not help to keep ownership
93: attributes. In the local case, the ownership attributes are preserved
94: by gzip when compressing/decompressing the file. A time stamp of zero
95: is ignored.
96:
97: Bit 0 in the flags is only an optional indication, which can be set by
98: a small lookahead in the input data. In case of doubt, the flag is
99: cleared indicating binary data. For systems which have different
100: file formats for ascii text and binary data, the decompressor can
101: use the flag to choose the appropriate format.
102:
103: It must be possible to detect the end of the compressed data with any
104: compression format, regardless of the actual size of the compressed
105: data. If the compressed data cannot fit in one file (in particular for
106: diskettes), each part starts with a header as described above, but
107: only the last part has the crc32 and uncompressed size. A decompressor
108: may prompt for additional data for multipart compressed files. It is
109: desirable but not mandatory that multiple parts be extractable
110: independently so that partial data can be recovered if one of the
111: parts is damaged. This is possible only if no compression state is
112: kept from one part to the other. The compression-type dependent flags
113: can indicate this.
114:
115: If the file being compressed is on a file system with case insensitive
116: names, the original name field must be forced to lower case. There is
117: no original file name if the data was compressed from standard input.
118:
119: On operating systems which support multiple extensions, and on
120: original files without extension, the extension ".z" is added to the
121: original file name (foo.c => foo.c.z). Otherwise the string "z" (or
122: "-z" for VMS) is added to the original extension (foo.c => foo.cz or
123: foo.c-z). The original name or extension is truncated if necessary,
124: but in this case the original name is always saved in the compressed
125: file (foo.doc => foo.doz).
126:
127: Compression is always performed, even if the compressed file is
128: slightly larger than the original. The worst case expansion is
129: a few bytes for the gzip file header, plus 5 bytes every 32K block,
130: or an expansion ratio of 0.015% for large files.
131:
132: The encryption is that of zip 1.9. For the encryption check, the
133: last byte of the decoded encryption header must be zero. The time
134: stamp of an encrypted file might be set to zero to avoid giving a clue
135: about the construction of the random header.
136:
137: Jean-loup Gailly
138: [email protected]
139:
140: References:
141:
142: [LZ77] Ziv J., Lempel A., "A Universal Algorithm for Sequential Data
143: Compression", IEEE Transactions on Information Theory", Vol. 23, No. 3,
144: pp. 337-343.
145:
146: APPNOTE.TXT documentation file in PKZIP 1.93a. It is available by
147: ftp in ux1.cso.uiuc.edu:/pc/exec-pc/pkz193a.exe [128.174.5.59]
148: Use "unzip pkz193a.exe APPNOTE.TXT" to extract.
This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.