coherent/g/usr/bin/gzip/algorithm.doc - annotate

Return to algorithm.doc CVS log
Up to [MW Coherent from dump] / coherent / g / usr / bin / gzip
Annotation of coherent/g/usr/bin/gzip/algorithm.doc, revision 1.1.1.1

1.1       root        1: 1. Algorithm
                      2: 
                      3: The deflation algorithm used by zip and gzip is a variation of LZ77
                      4: (Lempel-Ziv 1977, see reference below). It finds duplicated strings in
                      5: the input data.  The second occurrence of a string is replaced by a
                      6: pointer to the previous string, in the form of a pair (distance,
                      7: length).  Distances are limited to 32K bytes, and lengths are limited
                      8: to 258 bytes. When a string does not occur anywhere in the previous
                      9: 32K bytes, it is emitted as a sequence of literal bytes.  (In this
                     10: description, 'string' must be taken as an arbitrary sequence of bytes,
                     11: and is not restricted to printable characters.)
                     12: 
                     13: Literals or match lengths are compressed with one Huffman tree, and
                     14: match distances are compressed with another tree. The trees are stored
                     15: in a compact form at the start of each block. The blocks can have any
                     16: size (except that the compressed data for one block must fit in
                     17: available memory). A block is terminated when zip determines that it
                     18: would be useful to start another block with fresh trees. (This is
                     19: somewhat similar to compress.)
                     20: 
                     21: Duplicated strings are found using a hash table. All input strings of
                     22: length 3 are inserted in the hash table. A hash index is computed for
                     23: the next 3 bytes. If the hash chain for this index is not empty, all
                     24: strings in the chain are compared with the current input string, and
                     25: the longest match is selected.
                     26: 
                     27: The hash chains are searched starting with the most recent strings, to
                     28: favor small distances and thus take advantage of the Huffman encoding.
                     29: The hash chains are singly linked. There are no deletions from the
                     30: hash chains, the algorithm simply discards matches that are too old.
                     31: 
                     32: To avoid a worst-case situation, very long hash chains are arbitrarily
                     33: truncated at a certain length, determined by a runtime option (zip -1
                     34: to -9). So zip does not always find the longest possible match but
                     35: generally finds a match which is long enough.
                     36: 
                     37: zip also defers the selection of matches with a lazy evaluation
                     38: mechanism. After a match of length N has been found, zip searches for a
                     39: longer match at the next input byte. If a longer match is found, the
                     40: previous match is truncated to a length of one (thus producing a single
                     41: literal byte) and the longer match is emitted afterwards.  Otherwise,
                     42: the original match is kept, and the next match search is attempted only
                     43: N steps later.
                     44: 
                     45: The lazy match evaluation is also subject to a runtime parameter. If
                     46: the current match is long enough, zip reduces the search for a longer
                     47: match, thus speeding up the whole process. If compression ratio is more
                     48: important than speed, zip attempts a complete second search even if
                     49: the first match is already long enough.
                     50: 
                     51: 
                     52: 2. gzip file format
                     53: 
                     54: The pkzip format imposes a lot of overhead in various headers, which
                     55: are useful for an archiver but not necessary when only one file is
                     56: compressed. gzip uses a much simpler structure. Numbers are in little
                     57: endian format, and bit 0 is the least significant bit.
                     58: A gzip file is a sequence of compressed members. Each member has the
                     59: following structure:
                     60: 
                     61: 2 bytes  magic header  0x1f, 0x8b (\037 \213)  
                     62: 1 byte   compression method (0..7 reserved, 8 = deflate)
                     63: 1 byte   flags
                     64:             bit 0 set: file probably ascii text
                     65:             bit 1 set: continuation of multi-part gzip file
                     66:             bit 2 set: extra field present
                     67:             bit 3 set: original file name present
                     68:             bit 4 set: file comment present
                     69:             bit 5 set: file is encrypted
                     70:             bit 6,7:   reserved
                     71: 4 bytes  file modification time in Unix format
                     72: 1 byte   extra flags (depend on compression method)
                     73: 1 byte   operating system on which compression took place
                     74: 
                     75: 2 bytes  optional part number (second part=1)
                     76: 2 bytes  optional extra field length
                     77: ? bytes  optional extra field
                     78: ? bytes  optional original file name, zero terminated
                     79: ? bytes  optional file comment, zero terminated
                     80: 12 bytes optional encryption header
                     81: ? bytes  compressed data
                     82: 4 bytes  crc32
                     83: 4 bytes  uncompressed input size modulo 2^32
                     84: 
                     85: The format was designed to allow single pass compression without any
                     86: backwards seek, and without a priori knowledge of the uncompressed
                     87: input size or the available size on the output media. If input does
                     88: not come from a regular disk file, the file modification time is set
                     89: to the time at which compression started.
                     90: 
                     91: The time stamp is useful mainly when one gzip file is transferred over
                     92: a network. In this case it would not help to keep ownership
                     93: attributes. In the local case, the ownership attributes are preserved
                     94: by gzip when compressing/decompressing the file. A time stamp of zero
                     95: is ignored.
                     96: 
                     97: Bit 0 in the flags is only an optional indication, which can be set by
                     98: a small lookahead in the input data. In case of doubt, the flag is
                     99: cleared indicating binary data. For systems which have different
                    100: file formats for ascii text and binary data, the decompressor can
                    101: use the flag to choose the appropriate format.
                    102: 
                    103: It must be possible to detect the end of the compressed data with any
                    104: compression format, regardless of the actual size of the compressed
                    105: data. If the compressed data cannot fit in one file (in particular for
                    106: diskettes), each part starts with a header as described above, but
                    107: only the last part has the crc32 and uncompressed size. A decompressor
                    108: may prompt for additional data for multipart compressed files. It is
                    109: desirable but not mandatory that multiple parts be extractable
                    110: independently so that partial data can be recovered if one of the
                    111: parts is damaged. This is possible only if no compression state is
                    112: kept from one part to the other. The compression-type dependent flags
                    113: can indicate this.
                    114: 
                    115: If the file being compressed is on a file system with case insensitive
                    116: names, the original name field must be forced to lower case. There is
                    117: no original file name if the data was compressed from standard input.
                    118: 
                    119: On operating systems which support multiple extensions, and on
                    120: original files without extension, the extension ".z" is added to the
                    121: original file name (foo.c => foo.c.z). Otherwise the string "z" (or
                    122: "-z" for VMS) is added to the original extension (foo.c => foo.cz or
                    123: foo.c-z). The original name or extension is truncated if necessary,
                    124: but in this case the original name is always saved in the compressed
                    125: file (foo.doc => foo.doz).
                    126: 
                    127: Compression is always performed, even if the compressed file is
                    128: slightly larger than the original. The worst case expansion is
                    129: a few bytes for the gzip file header, plus 5 bytes every 32K block,
                    130: or an expansion ratio of 0.015% for large files.
                    131: 
                    132: The encryption is that of zip 1.9. For the encryption check, the
                    133: last byte of the decoded encryption header must be zero. The time
                    134: stamp of an encrypted file might be set to zero to avoid giving a clue
                    135: about the construction of the random header.
                    136: 
                    137: Jean-loup Gailly
                    138: [email protected]
                    139: 
                    140: References:
                    141: 
                    142: [LZ77] Ziv J., Lempel A., "A Universal Algorithm for Sequential Data
                    143: Compression", IEEE Transactions on Information Theory", Vol. 23, No. 3,
                    144: pp. 337-343.
                    145: 
                    146: APPNOTE.TXT documentation file in PKZIP 1.93a. It is available by
                    147: ftp in ux1.cso.uiuc.edu:/pc/exec-pc/pkz193a.exe [128.174.5.59]
                    148: Use "unzip pkz193a.exe APPNOTE.TXT" to extract.
unix.superglobalmegacorp.com
This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.