The Schönhage–Strassen algorithm is an asymptotically fast multiplication algorithm for large integers. It was developed by Arnold Schönhage and Volker Strassen in 1971.^{[1]} The runtime bit complexity is, in Big O notation, O(N log N log log N), while the arithmetic complexity is O(N log N).^{[2]} The algorithm uses recursive Fast Fourier transforms in rings with 2^{2}n + 1 elements, a specific type of number theoretic transform.
The Schönhage–Strassen algorithm was the asymptotically fastest multiplication method known from 1971 to 2007 when a new method, Fürer's algorithm, was announced with lower asymptotic complexity;^{[3]} however, Fürer's algorithm currently only achieves an advantage for astronomically large values and is not used in practice.
In practice the Schönhage–Strassen algorithm starts to outperform older methods such as Karatsuba and Toom–Cook multiplication for numbers beyond 2^{2}15 to 2^{2}17 (10,000 to 40,000 decimal digits).^{[4]}^{[5]}^{[6]}
Contents 
This section explains in detail how Schönhage–Strassen is implemented. It is based primarily on an overview of the method by Crandall and Pomerance in their Prime Numbers: A Computational Perspective.^{[7]} Another source for detailed information is Knuth's The Art of Computer Programming.^{[8]}
Suppose we are multiplying two numbers like 123 and 456 using long multiplication with base B digits, but without performing any carrying. The result might look something like this:
1  2  3  
×  4  5  6  


6  12  18  
5  10  15  
4  8  12  


4  13  28  27  18 
This sequence (4, 13, 28, 27, 18) is called the acyclic or linear convolution of the two original sequences (1,2,3) and (4,5,6). Once you have the acyclic convolution of two sequences, computing the product of the original numbers is easy: you just perform the carrying (for example, in the rightmost column, you'd keep the 8 and add the 1 to the column containing 27). In the example this yields the correct product 56088.
There are two other types of convolutions that will be useful. Suppose the input sequences have n elements (here 3). Then the acyclic convolution has n+n−1 elements; if we take the rightmost n elements and add the leftmost n−1 elements, this produces the cyclic convolution:
28  27  18  
+  4  13  


28  31  31 
If we perform carrying on the cyclic convolution, the result is equivalent to the product of the inputs mod B^{n} − 1. In the example, 10^{3} − 1 = 999, performing carrying on (28, 31, 31) yields 3141, and 3141 ≡ 56088 (mod 999).
Conversely, if we take the rightmost n elements and subtract the leftmost n−1 elements, this produces the negacyclic convolution:
28  27  18  
−  4  13  


28  23  5 
If we perform carrying on the negacyclic convolution, the result is equivalent to the product of the inputs mod B^{n} + 1. In the example, 10^{3} + 1 = 1001, performing carrying on (28, 23, 5) yields 3035, and 3035 ≡ 56088 (mod 1001). The negacyclic convolution can contain negative numbers, which can be eliminated during carrying using borrowing, as is done in long subtraction.
Like other multiplication methods based on the Fast Fourier transform, Schönhage–Strassen depends fundamentally on the convolution theorem, which provides an efficient way to compute the cyclic convolution of two sequences. It states that:
Or in symbols:
If we compute the DFT and IDFT using a fast Fourier transform algorithm, and invoke our multiplication algorithm recursively to multiply the entries of the transformed vectors DFT(X) and DFT(Y), this yields an efficient algorithm for computing the cyclic convolution.
In this algorithm, it will be more useful to compute the negacyclic convolution; as it turns out, a slightly modified version of the convolution theorem can enable this as well. Suppose the vectors X and Y have length n, and a is a primitive root of unity of order 2n (that is, a^{2n} = 1 and a to all smaller powers is not 1). Then we can define a third vector A, called the weight vector, as:
Now, we can state:
In other words, it's the same as before except that the inputs are first multiplied by A, and the result is multiplied by A^{−1}.
The discrete Fourier transform is an abstract operation that can be performed in any algebraic ring; typically it's performed in the complex numbers, but actually performing complex arithmetic to sufficient precision to ensure accurate results for multiplication is slow and errorprone. Instead, we will use the approach of the number theoretic transform, which is to perform the transform in the integers mod N for some integer N. Just like there are primitive roots of unity of every order in the complex plane, given any order n we can choose a suitable N such that b is a primitive root of unity of order n in the integers mod N (in other words, b^{n} ≡ 1 (mod N), and no smaller power of b is equivalent to 1 mod N).
The algorithm will spend most of its time performing recursive multiplications of smaller numbers; with a naive algorithm, these occur in a number of places:
The key insight to Schönhage–Strassen is to choose N, the modulus, to be equal to 2^{n} + 1 for some integer n. This has a number of benefits in standard systems that represent large integers in binary form:
To make the recursive multiplications convenient, we will frame Schönhage–Strassen as being a specialized multiplication algorithm for computing not just the product of two numbers, but the product of two numbers mod 2^{n} + 1 for some given n. This is not a loss of generality, since one can always choose n large enough so that the product mod 2^{n} + 1 is simply the product.
In the course of the algorithm, there are many cases in which multiplication or division by a power of two (including all roots of unity) can be profitably replaced by a small number of shifts and adds. This makes use of the observation that:
This makes it simple to reduce a number represented in binary mod 2^{n} + 1: take the rightmost (least significant) n bits, subtract the next n bits, add the next n bits, and so on until the bits are exhausted. If the resulting value is still not between 0 and 2^{n}, normalize it by adding or subtracting a multiple of the modulus 2^{n} + 1. For example, if n=3 (and so the modulus is 2^{3}+1 = 9) and the number being reduced is 656, we have:
Moreover, it's possible to effect very large shifts without ever constructing the shifted result. Suppose we have a number A between 0 and 2^{n}, and wish to multiply it by 2^{k}. Dividing k by n we find k = qn + r with r < n. It follows that:
Since A is ≤ 2^{n} and r < n, A shiftleft r has at most 2n−1 bits, and so only one shift and subtraction (followed by normalization) is needed.
Finally, to divide by 2^{k}, observe that squaring the first equivalence above yields:
Hence,
Given an input numbers x and y, and an integer N, the following algorithm computes xy mod 2^{N} + 1:
In step 2, the observation is used that:
The parameter k is fixed and in practice is typically set to a small value such as 5, 6, or 7, depending on the size of the input numbers.
This section explains a number of important practical optimizations that have been considered when implementing Schönhage–Strassen in real systems. It is based primarily on a 2007 work by Gaudry, Kruppa, and Zimmermann describing enhancements to the GNU MultiPrecision Library.^{[9]}
Below a certain cutoff point, it's more efficient to perform the recursive multiplications using other algorithms, such as Toom–Cook multiplication. The results must be reduced mod 2^{n} + 1, which can be done efficiently as explained above with shifts and adds/subtracts.
Computing the IDFT involves dividing each entry by the primitive root of unity 2^{2n/2}k, an operation that is frequently combined with multiplying the vector by A^{1} afterwards, since both involve division by a power of two.
In a system where a large number is represented as an array of 2^{w}bit words, it's useful to ensure that the vector size 2^{k} is also a multiple of the bits per word by choosing k ≥ w; this allows the inputs to be broken up into pieces without bit shifts, and provides a uniform representation for values mod 2^{n} + 1 where the high word can only be zero or one.
Normalization involves adding or subtracting the modulus 2^{n}+1; this value has only two bits set, which means this can be done in constant time on average with a specialized operation.
Iterative FFT algorithms such as the Cooley–Tukey FFT algorithm, although frequently used for FFTs on vectors of complex numbers, tend to exhibit very poor cache locality with the large vector entries used in Schönhage–Strassen. The straightforward recursive, not inplace implementation of FFT is more successful, with all operations fitting in the cache beyond a certain point in the call depth, but still makes suboptimal use of the cache in higher call depths. Gaudry, Kruppa, and Zimmerman used a technique combining Bailey's 4step algorithm with higher radix transforms that combine multiple recursive steps. They also mix phases, going as far into the algorithm as possible on each element of the vector before moving on to the next one.
The "square root of 2 trick", first described by Schönhage, is to note that, provided k ≥ 2, 2^{3n/4}−2^{n/4} is a square root of 2 mod 2^{n}+1, and so a 4nth root of unity (since 2^{2n} ≡ 1). This allows the transform length to be extended from 2^{k} to 2^{k + 1}.
Finally, the authors are careful to choose the right value of k for different ranges of input numbers, noting that the optimal value of k may go back and forth between the same values several times as the input size increases.

