PAQ is a series of data compression archivers that have evolved through collaborative development to top rankings on several benchmarks measuring compression ratio (although at the expense of speed and memory usage). Specialized versions of PAQ have won the Hutter Prize and the Calgary Challenge.^{[1]} PAQ is free software distributed under the GNU General Public License.^{[2]}
Contents 
PAQ uses a context mixing algorithm. Context mixing is related to Prediction by Partial Matching (PPM) in that the compressor is divided into a predictor and an arithmetic coder, but differs in that the nextsymbol prediction is computed using a weighed combination of probability estimates from a large number of models conditioned on different contexts. Unlike PPM, a context doesn't need to be contiguous. Most PAQ versions collect nextsymbol statistics for the following contexts:
All PAQ versions predict and compress one bit at a time, but differ in the details of the models and how the predictions are combined and postprocessed. Once the nextbit probability is determined, it is encoded by arithmetic coding. There are three methods for combining predictions, depending on the version.
PAQ1SSE and later versions postprocess the prediction using secondary symbol estimation (SSE). The combined prediction and a small context are used to look up a new prediction in a table. After the bit is encoded, the table entry is adjusted to reduce the prediction error. SSE stages can be pipelined with different contexts, or computed in parallel with the outputs averaged.
A string s is compressed to the shortest byte string representing a base 256 big endian number x in the range [0, 1] such that P(r < s) ≤ x < P(r ≤ s), where P(r < s) is the probability that a random string r with the same length as s will be lexicographically less than s. It is always possible to find an x such that the length of x is at most one byte longer than the Shannon limit, log_{2} P(r = s) bits. The length of s is stored in the archive header.
The arithmetic coder in PAQ is implemented by maintaining for each prediction a lower and upper bound on x, initially [0, 1]. After each prediction, the current range is split into two parts in proportion to P(0) and P(1), the probability that the next bit of s will be a 0 or 1, respectively, given the previous bits of s. The next bit is then encoded by selecting the corresponding subrange to be the new range.
The number x is decompressed back to string s by making an identical series of bit predictions (since the previous bits of s are known). The range is split as with compression. The portion containing x becomes the new range, and the corresponding bit is appended to s.
In PAQ, the lower and upper bounds of the range are represented in 3 parts. The most significant base256 digits are identical, so they can be written as the leading bytes of x. The next 4 bytes are kept in memory, such that the leading byte is different. The trailing bits are assumed to be all zeros for the lower bound and all ones for the upper bound. Compression is terminated by writing one more byte from the lower bound.
In PAQ versions through PAQ6, each model maps a set of distinct contexts to a pair of counts, n_{0}, a count of zero bits, and n_{1}, a count of 1 bits. In order to favor recent history, half of the count over 2 is discarded when the opposite bit is observed. For example, if the current state associated with a context is (n_{0},n_{1}) = (12,3) and a 1 is observed, then the counts are updated to (7,4).
A bit is arithmetic coded with space proportional to its probability, either P(1) or P(0) = 1  P(1). The probabilities are computed by weighted addition of the 0 and 1 counts:
where w_{i} is the weight of the i'th model. Through PAQ3, the weights were fixed and set in an adhoc manner. (Ordern contexts had a weight of n².) Beginning with PAQ4, the weights were adjusted adaptively in the direction that would reduce future errors in the same context set. If the bit to be coded is y, then the weight adjustment is:
Beginning with PAQ7, each model outputs a prediction (instead of a pair of counts). These predictions are averaged in the logistic domain:
where P(1) is the probability that the next bit will be a 1, P_{i}(1) is the probability estimated by the i'th model, and
After each prediction, the model is updated by adjusting the weights to minimize coding cost.
where η is the learning rate (typically 0.002 to 0.01), y is the predicted bit, and (y  P(1)) is the prediction error. The weight update algorithm differs from backpropagation in that the terms P(1)P(0) are dropped. This is because the goal of the neural network is to minimize coding cost, not root mean square error.
Most versions of PAQ use a small context to select among sets of weights for the neural network. Some versions use multiple networks whose outputs are combined with one more network prior to the SSE stages. Furthermore, for each input prediction there may be several inputs which are nonlinear functions of P_{i}(1) in addition to stretch(P(1))
Each model partitions the known bits of s into a set of contexts, and maps each context to a bit history represented by an 8 bit state. In versions through PAQ6, the state represents a pair of counters (n_{0}, n_{1}). In PAQ7 and later versions under certain conditions, the state also represents the value of the last bit or the entire sequence. The states are mapped to probabilities using a 256 entry table for each model. After a prediction by the model, the table entry is adjusted slightly (typically by 0.4%) to reduce the prediction error.
In all PAQ8 versions, the representable states are as follows:
To keep the number of states to 256, the following limits are placed on the representable counts: (41, 0), (40, 1), (12, 2), (5, 3), (4, 4), (3, 5), (2, 12), (1, 40), (0, 41). If a count exceeds this limit, then the next state is one chosen to have a similar ratio of n_{0} to n_{1}. Thus, if the current state is (n_{0} = 4, n_{1} = 4, last bit = 0) and a 1 is observed, then the new state is not (n_{0} = 4, n_{1} = 5, last bit = 1). Rather, it is (n_{0} = 3, n_{1} = 4, last bit = 1).
Most context models are implemented as hash tables. Some small contexts are implemented as direct lookup tables.
Some versions of PAQ, in particular PAsQDa, PAQAR (both PAQ6 derivatives), and PAQ8HP1 through PAQ8HP8 (PAQ8 derivatives and Hutter prize recipients) preprocess text files by looking up words in an external dictionary and replacing them with 13 byte codes. In addition, uppercase letters are encoded with a special character followed by the lower case letter. In the PAQ8HP series, the dictionary is organized by grouping syntactically and semantically related words together. This allows models to use just the most significant bits of the dictionary codes as context.
The following table is a sample from the Large Text Compression Benchmark by Matt Mahoney that consists of a file consisting of 10^{9} bytes (1GB or 0.931GiB) of Wikipedia English text.
Program  Compressed size (bytes)  % of original size  Compression time  Memory 

PAQ8HP8  133,423,109  13.34  64,639 sec.  1849 MiB 
PPMd  183,976,014  18.4  880 sec.  256 MiB 
bzip2  254,007,875  25.4  379 sec.  8 MiB 
InfoZIP  322,649,703  32.26  104 sec.  0.1 MiB 
See Comparison of file archivers for a list of file compression benchmarks.
The following lists the major enhancements to the PAQ algorithm. In addition, there have been a large number of incremental improvements, which are omitted.
The series PAQ8HP1 through PAQ8HP8 were released by Alexander Ratushnyak from August 21, 2006 through January 18, 2007 as Hutter Prize submissions. The Hutter Prize is a text compression contest using a 100 MB English and XML data set derived from Wikipedia's source. The PAQ8HP series was forked from PAQ8H. The programs include text preprocessing dictionaries and models tuned specifically to the benchmark. All nontext models were removed. The dictionaries were organized to group syntactically and semantically related words and to group words by common suffix. The former strategy improves compression because related words (which are likely to appear in similar context) can be modeled on the high order bits of their dictionary codes. The latter strategy makes the dictionary easier to compress. The size of the decompression program and compressed dictionary is included in the contest ranking.
On October 27, 2006, it was announced^{[5]} that PAQ8HP5 won a Hutter Prize for Lossless Compression of Human Knowledge of 3,416 euro.
On June 30, 2007, Ratushnyak's paq8hp12 was awarded a second Hutter prize of 1732 euro, improving upon his previous record by 3.46%.
Being free software, the software can be modified and redistributed by anyone who has a copy. This has allowed other authors to fork the PAQ compression engine and add new features such as a graphical user interface or better speed (at the expense of compression ratio). Notable PAQ derivatives include:

