In computing, especially digital signal processing, multiplyaccumulate is a common operation that computes the product of two numbers and adds that product to an accumulator.
When done with floating point numbers it might be performed with two roundings (typical in many DSPs) or with a single rounding. When performed with a single rounding, it is called a fused multiplyadd (FMA) or fused multiplyaccumulate (FMAC).
Modern computers may contain a dedicated multiplyaccumulate unit, or MAC unit, consisting of a multiplier implemented in combinational logic followed by an adder and an accumulator register which stores the result when clocked. The output of the register is fed back to one input of the adder, so that on each clock the output of the multiplier is added to the register. Combinational multipliers require a large amount of logic, but can compute a product much more quickly than the method of shifting and adding typical of earlier computers. The first processors to be equipped with MACunits were digital signal processors, but the technique is now also common in generalpurpose processors.
When done with integers, the operation is typically exact (computed modulo some power of 2). However, floatingpoint numbers have only a certain amount of mathematical precision. That is, digital floatingpoint arithmetic is generally not associative or distributive. (See Floating point#Accuracy problems.)
Therefore, it makes a difference to the result whether the multiplyadd is performed with two roundings, or in one operation with a single rounding. When performed with a single rounding, the operation is termed a fused multiplyadd.
A fused multiplyadd is a floatingpoint multiplyadd operation performed in one step, with a single rounding. That is, where an unfused multiplyadd would compute the product b×c, round it to N significant bits, add the result to a, and round back to N significant bits, a fused multiplyadd would compute the entire sum a+b×c to its full precision before rounding the final result down to N significant bits.
A fast FMA can speed up and improve the accuracy of many computations which involve the accumulation of products:
When implemented inside a microprocessor, this can actually be faster than a multiply operation followed by an add, even though standard industrial implementations based on the original IBM RS/6000 design require a 2Nbit adder to compute the sum properly.^{[1]}
A useful benefit of including this instruction is that it allows an efficient software implementation of division and square root operations, thus eliminating the need for dedicated hardware for those operations.
The FMA operation is included in IEEE 7542008.
The 1999 standard of the C programming language supports the FMA operation through the fma
standard math library function.
The fused multiplyadd operation was introduced as multiplyadd fused in the IBM POWER1 processor (1990),^{[2]} but has been added to numerous other processors since then:
It will be implemented in AMD processors with FMA4 support. Intel plans to implement FMA3 in processors using its Haswell microarchitecture, due sometime in 2012.^{[4]}
FMA capability is also present in the NVIDIA GeForce 200 Series (GTX 200) GPUs, GeForce 300 Series GPUs and the NVIDIA Tesla C1060 Computing Processor & C2050 / C2070 GPU Computing Processor GPGPUs.^{[5]} FMA has been added to the AMD Radeon line with the 5x00 series.^{[6]}
