java.lang.Object
org.apache.commons.compress.compressors.lz77support.LZ77Compressor

public class LZ77Compressor extends Object
Helper class for compression algorithms that use the ideas of LZ77.

Most LZ77 derived algorithms split input data into blocks of uncompressed data (called literal blocks) and back-references (pairs of offsets and lengths) that state "add length bytes that are the same as those already written starting offset bytes before the current position. The details of how those blocks and back-references are encoded are quite different between the algorithms and some algorithms perform additional steps (Huffman encoding in the case of DEFLATE for example).

This class attempts to extract the core logic - finding back-references - so it can be re-used. It follows the algorithm explained in section 4 of RFC 1951 (DEFLATE) and currently doesn't implement the "lazy match" optimization. The three-byte hash function used in this class is the same as the one used by zlib and InfoZIP's ZIP implementation of DEFLATE. The whole class is strongly inspired by InfoZIP's implementation.

LZ77 is used vaguely here (as well as many other places that talk about it :-), LZSS would likely be closer to the truth but LZ77 has become the synonym for a whole family of algorithms.

The API consists of a compressor that is fed bytes and emits LZ77Compressor.Blocks to a registered callback where the blocks represent either literal blocks, back-references or end of data markers. In order to ensure the callback receives all information, the #finish method must be used once all data has been fed into the compressor.

Several parameters influence the outcome of the "compression":

windowSize
the size of the sliding window, must be a power of two - this determines the maximum offset a back-reference can take. The compressor maintains a buffer of twice of windowSize - real world values are in the area of 32k.
minBackReferenceLength
Minimal length of a back-reference found. A true minimum of 3 is hard-coded inside of this implementation but bigger lengths can be configured.
maxBackReferenceLength
Maximal length of a back-reference found.
maxOffset
Maximal offset of a back-reference.
maxLiteralLength
Maximal length of a literal block.
Since:
1.14
See Also:
  • "https://tools.ietf.org/html/rfc1951#section-4"
  • Field Details

    • THE_EOD

      private static final LZ77Compressor.Block THE_EOD
    • NUMBER_OF_BYTES_IN_HASH

      static final int NUMBER_OF_BYTES_IN_HASH
      See Also:
    • NO_MATCH

      private static final int NO_MATCH
      See Also:
    • params

      private final Parameters params
    • callback

      private final LZ77Compressor.Callback callback
    • window

      private final byte[] window
    • prev

      private final int[] prev
    • wMask

      private final int wMask
    • initialized

      private boolean initialized
    • currentPosition

      private int currentPosition
    • lookahead

      private int lookahead
    • insertHash

      private int insertHash
    • blockStart

      private int blockStart
    • matchStart

      private int matchStart
    • missedInserts

      private int missedInserts
    • HASH_SIZE

      private static final int HASH_SIZE
      See Also:
    • HASH_MASK

      private static final int HASH_MASK
      See Also:
    • H_SHIFT

      private static final int H_SHIFT
      See Also:
  • Constructor Details

    • LZ77Compressor

      public LZ77Compressor(Parameters params, LZ77Compressor.Callback callback)
      Initializes a compressor with parameters and a callback.
      Parameters:
      params - the parameters
      callback - the callback
      Throws:
      NullPointerException - if either parameter is null
  • Method Details

    • compress

      public void compress(byte[] data) throws IOException
      Feeds bytes into the compressor which in turn may emit zero or more blocks to the callback during the execution of this method.
      Parameters:
      data - the data to compress - must not be null
      Throws:
      IOException - if the callback throws an exception
    • compress

      public void compress(byte[] data, int off, int len) throws IOException
      Feeds bytes into the compressor which in turn may emit zero or more blocks to the callback during the execution of this method.
      Parameters:
      data - the data to compress - must not be null
      off - the start offset of the data
      len - the number of bytes to compress
      Throws:
      IOException - if the callback throws an exception
    • finish

      public void finish() throws IOException
      Tells the compressor to process all remaining data and signal end of data to the callback.

      The compressor will in turn emit at least one block (LZ77Compressor.EOD) but potentially multiple blocks to the callback during the execution of this method.

      Throws:
      IOException - if the callback throws an exception
    • prefill

      public void prefill(byte[] data)
      Adds some initial data to fill the window with.

      This is used if the stream has been cut into blocks and back-references of one block may refer to data of the previous block(s). One such example is the LZ4 frame format using block dependency.

      Parameters:
      data - the data to fill the window with.
      Throws:
      IllegalStateException - if the compressor has already started to accept data
    • nextHash

      private int nextHash(int oldHash, byte nextByte)
      Assumes we are calculating the hash for three consecutive bytes as a rolling hash, i.e. for bytes ABCD if H is the hash of ABC the new hash for BCD is nextHash(H, D).

      The hash is shifted by five bits on each update so all effects of A have been swapped after the third update.

    • doCompress

      private void doCompress(byte[] data, int off, int len) throws IOException
      Throws:
      IOException
    • slide

      private void slide() throws IOException
      Throws:
      IOException
    • initialize

      private void initialize()
    • compress

      private void compress() throws IOException
      Throws:
      IOException
    • insertString

      private int insertString(int pos)
      Inserts the current three byte sequence into the dictionary and returns the previous head of the hash-chain.

      Updates insertHash and prev as a side effect.

    • longestMatchForNextPosition

      private int longestMatchForNextPosition(int prevMatchLength)
    • insertStringsInMatch

      private void insertStringsInMatch(int matchLength)
    • catchUpMissedInserts

      private void catchUpMissedInserts()
    • flushBackReference

      private void flushBackReference(int matchLength) throws IOException
      Throws:
      IOException
    • flushLiteralBlock

      private void flushLiteralBlock() throws IOException
      Throws:
      IOException
    • longestMatch

      private int longestMatch(int matchHead)
      Searches the hash chain for real matches and returns the length of the longest match (0 if none were found) that isn't too far away (WRT maxOffset).

      Sets matchStart to the index of the start position of the longest match as a side effect.