coherent/g/usr/bin/gzip/algorithm.doc - annotate

Return to algorithm.doc CVS log
Up to [MW Coherent from dump] / coherent / g / usr / bin / gzip
Annotation of coherent/g/usr/bin/gzip/algorithm.doc, revision 1.1

1.1     ! root        1: 1. Algorithm
        !             2: 
        !             3: The deflation algorithm used by zip and gzip is a variation of LZ77
        !             4: (Lempel-Ziv 1977, see reference below). It finds duplicated strings in
        !             5: the input data.  The second occurrence of a string is replaced by a
        !             6: pointer to the previous string, in the form of a pair (distance,
        !             7: length).  Distances are limited to 32K bytes, and lengths are limited
        !             8: to 258 bytes. When a string does not occur anywhere in the previous
        !             9: 32K bytes, it is emitted as a sequence of literal bytes.  (In this
        !            10: description, 'string' must be taken as an arbitrary sequence of bytes,
        !            11: and is not restricted to printable characters.)
        !            12: 
        !            13: Literals or match lengths are compressed with one Huffman tree, and
        !            14: match distances are compressed with another tree. The trees are stored
        !            15: in a compact form at the start of each block. The blocks can have any
        !            16: size (except that the compressed data for one block must fit in
        !            17: available memory). A block is terminated when zip determines that it
        !            18: would be useful to start another block with fresh trees. (This is
        !            19: somewhat similar to compress.)
        !            20: 
        !            21: Duplicated strings are found using a hash table. All input strings of
        !            22: length 3 are inserted in the hash table. A hash index is computed for
        !            23: the next 3 bytes. If the hash chain for this index is not empty, all
        !            24: strings in the chain are compared with the current input string, and
        !            25: the longest match is selected.
        !            26: 
        !            27: The hash chains are searched starting with the most recent strings, to
        !            28: favor small distances and thus take advantage of the Huffman encoding.
        !            29: The hash chains are singly linked. There are no deletions from the
        !            30: hash chains, the algorithm simply discards matches that are too old.
        !            31: 
        !            32: To avoid a worst-case situation, very long hash chains are arbitrarily
        !            33: truncated at a certain length, determined by a runtime option (zip -1
        !            34: to -9). So zip does not always find the longest possible match but
        !            35: generally finds a match which is long enough.
        !            36: 
        !            37: zip also defers the selection of matches with a lazy evaluation
        !            38: mechanism. After a match of length N has been found, zip searches for a
        !            39: longer match at the next input byte. If a longer match is found, the
        !            40: previous match is truncated to a length of one (thus producing a single
        !            41: literal byte) and the longer match is emitted afterwards.  Otherwise,
        !            42: the original match is kept, and the next match search is attempted only
        !            43: N steps later.
        !            44: 
        !            45: The lazy match evaluation is also subject to a runtime parameter. If
        !            46: the current match is long enough, zip reduces the search for a longer
        !            47: match, thus speeding up the whole process. If compression ratio is more
        !            48: important than speed, zip attempts a complete second search even if
        !            49: the first match is already long enough.
        !            50: 
        !            51: 
        !            52: 2. gzip file format
        !            53: 
        !            54: The pkzip format imposes a lot of overhead in various headers, which
        !            55: are useful for an archiver but not necessary when only one file is
        !            56: compressed. gzip uses a much simpler structure. Numbers are in little
        !            57: endian format, and bit 0 is the least significant bit.
        !            58: A gzip file is a sequence of compressed members. Each member has the
        !            59: following structure:
        !            60: 
        !            61: 2 bytes  magic header  0x1f, 0x8b (\037 \213)  
        !            62: 1 byte   compression method (0..7 reserved, 8 = deflate)
        !            63: 1 byte   flags
        !            64:             bit 0 set: file probably ascii text
        !            65:             bit 1 set: continuation of multi-part gzip file
        !            66:             bit 2 set: extra field present
        !            67:             bit 3 set: original file name present
        !            68:             bit 4 set: file comment present
        !            69:             bit 5 set: file is encrypted
        !            70:             bit 6,7:   reserved
        !            71: 4 bytes  file modification time in Unix format
        !            72: 1 byte   extra flags (depend on compression method)
        !            73: 1 byte   operating system on which compression took place
        !            74: 
        !            75: 2 bytes  optional part number (second part=1)
        !            76: 2 bytes  optional extra field length
        !            77: ? bytes  optional extra field
        !            78: ? bytes  optional original file name, zero terminated
        !            79: ? bytes  optional file comment, zero terminated
        !            80: 12 bytes optional encryption header
        !            81: ? bytes  compressed data
        !            82: 4 bytes  crc32
        !            83: 4 bytes  uncompressed input size modulo 2^32
        !            84: 
        !            85: The format was designed to allow single pass compression without any
        !            86: backwards seek, and without a priori knowledge of the uncompressed
        !            87: input size or the available size on the output media. If input does
        !            88: not come from a regular disk file, the file modification time is set
        !            89: to the time at which compression started.
        !            90: 
        !            91: The time stamp is useful mainly when one gzip file is transferred over
        !            92: a network. In this case it would not help to keep ownership
        !            93: attributes. In the local case, the ownership attributes are preserved
        !            94: by gzip when compressing/decompressing the file. A time stamp of zero
        !            95: is ignored.
        !            96: 
        !            97: Bit 0 in the flags is only an optional indication, which can be set by
        !            98: a small lookahead in the input data. In case of doubt, the flag is
        !            99: cleared indicating binary data. For systems which have different
        !           100: file formats for ascii text and binary data, the decompressor can
        !           101: use the flag to choose the appropriate format.
        !           102: 
        !           103: It must be possible to detect the end of the compressed data with any
        !           104: compression format, regardless of the actual size of the compressed
        !           105: data. If the compressed data cannot fit in one file (in particular for
        !           106: diskettes), each part starts with a header as described above, but
        !           107: only the last part has the crc32 and uncompressed size. A decompressor
        !           108: may prompt for additional data for multipart compressed files. It is
        !           109: desirable but not mandatory that multiple parts be extractable
        !           110: independently so that partial data can be recovered if one of the
        !           111: parts is damaged. This is possible only if no compression state is
        !           112: kept from one part to the other. The compression-type dependent flags
        !           113: can indicate this.
        !           114: 
        !           115: If the file being compressed is on a file system with case insensitive
        !           116: names, the original name field must be forced to lower case. There is
        !           117: no original file name if the data was compressed from standard input.
        !           118: 
        !           119: On operating systems which support multiple extensions, and on
        !           120: original files without extension, the extension ".z" is added to the
        !           121: original file name (foo.c => foo.c.z). Otherwise the string "z" (or
        !           122: "-z" for VMS) is added to the original extension (foo.c => foo.cz or
        !           123: foo.c-z). The original name or extension is truncated if necessary,
        !           124: but in this case the original name is always saved in the compressed
        !           125: file (foo.doc => foo.doz).
        !           126: 
        !           127: Compression is always performed, even if the compressed file is
        !           128: slightly larger than the original. The worst case expansion is
        !           129: a few bytes for the gzip file header, plus 5 bytes every 32K block,
        !           130: or an expansion ratio of 0.015% for large files.
        !           131: 
        !           132: The encryption is that of zip 1.9. For the encryption check, the
        !           133: last byte of the decoded encryption header must be zero. The time
        !           134: stamp of an encrypted file might be set to zero to avoid giving a clue
        !           135: about the construction of the random header.
        !           136: 
        !           137: Jean-loup Gailly
        !           138: [email protected]
        !           139: 
        !           140: References:
        !           141: 
        !           142: [LZ77] Ziv J., Lempel A., "A Universal Algorithm for Sequential Data
        !           143: Compression", IEEE Transactions on Information Theory", Vol. 23, No. 3,
        !           144: pp. 337-343.
        !           145: 
        !           146: APPNOTE.TXT documentation file in PKZIP 1.93a. It is available by
        !           147: ftp in ux1.cso.uiuc.edu:/pc/exec-pc/pkz193a.exe [128.174.5.59]
        !           148: Use "unzip pkz193a.exe APPNOTE.TXT" to extract.
unix.superglobalmegacorp.com
This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.