Optimizations enabled by ffastmath
This blog post describes the optimizations enabled by ffastmath
when
compiling C or C++ code with GCC 11 for x86_64 Linux (other languages/operating
systems/CPU architectures may enable slightly different optimizations).
ffastmath
Most of the “fast math” optimizations can be enabled/disabled individually, and
ffastmath
enables all of them:^{1}
ffinitemathonly
fnosignedzeros
fnotrappingmath
fassociativemath
fnomatherrno
freciprocalmath
funsafemathoptimizations
fcxlimitedrange
When compiling standard C (that is, when using std=c99
etc. instead of the default “GNU C” dialect), ffastmath
also enables ffpcontract=fast
, allowing the compiler to combine multiplication and addition instructions with an FMA
instruction (godbolt). C++ and GNU C are not affected as ffpcontract=fast
is already the default for them.
Linking using gcc
with ffastmath
does one additional thing – it makes the CPU treat subnormal numbers as 0.0.
Treating subnormal numbers as 0.0
The floatingpoint format has a special representation for values that are close to 0.0. These “subnormal” numbers (also called “denormals”) are very costly^{2} in some cases because the CPU handles subnormal results using microcode exceptions.
The x86_64 CPU has a feature to treat subnormal input as 0.0 and flush subnormal results to 0.0, eliminating this performance penalty. This can be enabled by
#define MXCSR_DAZ (1<<6) /* Enable denormals are zero mode */
#define MXCSR_FTZ (1<<15) /* Enable flush to zero mode */
unsigned int mxcsr = __builtin_ia32_stmxcsr();
mxcsr = MXCSR_DAZ  MXCSR_FTZ;
__builtin_ia32_ldmxcsr(mxcsr);
Linking using gcc
with ffastmath
makes it disable subnormal numbers for the application by adding this code in a global constructor that runs before main
.
ffinitemathonly
and fnosignedzeros
Many optimizations are prevented by properties of the floatingpoint values NaN
, Inf
, and 0.0
. For example:
x+0.0
cannot be optimized tox
because that is not true whenx
is0.0
.xx
cannot be optimized to0.0
because that is not true whenx
isNaN
orInf
.x*0.0
cannot be optimized to0.0
because that is not true whenx
isNaN
orInf
. The compiler cannot transform
if (x > y) { do_something(); } else { do_something_else(); }
to the form
if (x <= y) { do_something_else(); } else { do_something(); }
(which is a useful optimization when simplifying control flow and ensuring that the common case is handled without taking branches) because that is not true when
x
ory
isNaN
.
ffinitemathonly
and fnosignedzeros
tell the compiler that no calculation will produce NaN
, Inf
, or 0.0
, so the compiler can then do the kind of optimization described above.
Note: The program may behave in strange ways (such as not evaluating either the true or false part of an if
statement) if calculations produce Inf
, NaN
, or 0.0
when these flags are used.
fnotrappingmath
It is possible to enable trapping of floatingpoint exceptions by using the GNU libc function feenableexcept
to generate the signal SIGFPE
when any floatingpoint instruction overflow, underflow, generates NaN
, etc.
For example, the function compute
below does some calculations that overflow to Inf
, which makes the program terminate with SIGFPE
when FE_OVERFLOW
is enabled.
// Compile as "gcc example.c D_GNU_SOURCE O2 lm"
#include <stdio.h>
#include <fenv.h>
void compute(void) {
float f = 2.0;
for (int i = 0; i < 7; ++i) {
f = f * f;
printf("%d: f = %f\n", i, f);
}
}
int main(void) {
compute();
printf("\nWith overflow exceptions:\n");
feenableexcept(FE_OVERFLOW);
compute();
return 0;
}
This means that the compiler cannot schedule floatingpoint instructions to execute speculatively, as the speculated instruction could then generate a signal for cases where the original program did not. For example, the calculation x/y
is constant in the loop below, but the compiler cannot hoist it out of the loop because that could cause the function to execute x/y
for cases where all elements
of arr
are larger than 0.0
and could therefore crash for cases where the
original program did not (godbolt).
double arr[1024];
void foo(int n, double x, double y) {
for (int i = 0; i < n; ++i) {
if (arr[i] > 0.0)
arr[i] = x / y;
}
}
One other fun special case is C floatingpoint atomics. The C standard requires that floatingpoint exceptions are discarded when doing compound assignment (see “compound assignment” in the C standard), so the compiler must insert extra code unless trappingmath is disabled (godbolt).
Passing fnotrappingmath
tells the compiler that the program will not enable
floatingpoint exceptions, and the compiler can then do these optimizations.
fassociativemath
fassociativemath
allows reassociation of operands in series of floatingpoint operations (as well as a few more general reordering optimizations). Most of the optimizations need fnotrappingmath
, ffinitemathonly
, and fnosignedzeros
too.
Some examples of fassociativemath
optimizations:
Original  Optimized 

(X + Y)  X 
Y 
(X * Z) + (Y * Z) 
(X + Y) * Z 
(X * C) + X 
X * (C + 1.0) when C is a constant 
(C1 / X) * C2 
(C1 * C2) / X when C1 and C2 are constants 
(C1  X) < C2 
(C1  C2) > X when C1 and C2 are constants 
Reassociation is especially useful for vectorization. Consider, for example, the loop (godbolt)
float a[1024];
float foo(void) {
float sum = 0.0f;
for (int i = 0; i < 1024; ++i) {
sum += a[i];
}
return sum;
}
All additions to sum
are made serially, so the calculations cannot normally be vectorized. But fassociativemath
permits the compiler to change the order to
sum = (a[0] + a[4] + ... + a[1020]) + (a[1] + a[5] + ... + a[1021]) + (a[2] + a[6] + ... + a[1022]) + (a[3] + a[7] + ... + a[1023]);
and can compile the loop as if it was written as
float a[1024];
float foo(void) {
float sum0 = sum1 = sum2 = sum3 = 0.0f;
for (int i = 0; i < 1024; i += 4) {
sum0 += a[i ];
sum1 += a[i + 1];
sum2 += a[i + 2];
sum3 += a[i + 3];
}
return sum0 + sum1 + sum2 + sum3;
}
which is easy to vectorize.
fnomatherrno
The C mathematical functions may set errno
if called with invalid input. This possible side effect means that the compiler must call the libc function instead of using instructions that can calculate the result directly.
The compiler can, in some cases, mitigate the problem by not calling the function when it knows the operation will succeed. For example, (godbolt)
double foo(double x) {
return sqrt(x);
}
is compiled to code using the sqrtsd
instruction when the input is in range and only calls sqrt
for input that will return NaN
foo:
pxor xmm1, xmm1
ucomisd xmm1, xmm0
ja .L10
sqrtsd xmm0, xmm0
ret
.L10:
jmp sqrt
This eliminates most of the overhead (as the comparison/branch is essentially free when predicted correctly), but this extra branch and function call makes life harder for other optimizations (such as vectorization).
fnomatherrno
makes GCC optimize all math functions as if they do not set errno
(that is, the compiler does not need to call the libc functions if the architecture has suitable instructions).
Nonmath functions
One surprising effect of fnomatherrno
is that it makes GCC believe that memoryallocating libc functions (such as malloc
and strdup
) do not set errno
. This can be seen in the function below where fnomatherrno
makes GCC optimize away the call to perror
(godbolt)
void *foo(size_t size) {
errno = 0;
void *p = malloc(size);
if (p == NULL) {
if (errno)
perror("error");
exit(1);
}
return p;
}
This is reported as GCC bug 88576.
freciprocalmath
freciprocalmath
allows the compiler to compute x/y
as x*(1/y)
. This is useful for code of the form (godbolt)
float length = sqrtf(x*x + y*y + z*z);
x = x / length;
y = y / length;
z = z / length;
where the compiler now can generate the code as if it was written as
float t = 1.0f / sqrtf(x*x + y*y + z*z);
x = x * t;
y = y * t;
z = z * t;
This optimization generates more instructions, but the resulting code is, in general, better as multiplication is faster than division and can execute on more ports in the CPU.
funsafemathoptimizations
funsafemathoptimizations
enables various “mathematically correct” optimizations that may change the result because of how floatingpoint numbers work. Some examples of this are
Original  Optimized 

sqrt(x)*sqrt(x) 
x 
sqrt(x)*sqrt(y) 
sqrt(x*y) 
exp(x)*exp(y) 
exp(x+y) 
x/exp(y) 
x*exp(y) 
x*pow(x,c) 
pow(x,c+1) 
pow(x,0.5) 
sqrt(x) 
(int)log(d) 
ilog(d) 
sin(x)/cos(x) 
tan(x) 
Note: Many of these optimizations need additional flags, such as ffinitemathonly
and fnomatherrno
, in order to trigger.
funsafemathoptimizations
also enables
fnosignedzeros
fnotrappingmath
fassociativemath
freciprocalmath
as well as ffpcontract=fast
(for standard C) and treating subnormal numbers as 0.0 in the same way as described for ffastmath
.
fcxlimitedrange
The mathematical formulas for multiplying and dividing complex numbers are \[(a + ib) \times (c + id) = (ac  bd) + i(bc + ad)\] \[\frac{a + ib}{c + id} = \frac{ac + bd}{c^2 + d^2} + i\frac{bc  ad}{c^2 + d^2} \] but these does not work well for floatingpoint values.
One problem is that the calculations may turn overflowed values into NaN
instead of Inf
, so the implementation needs to adjust. Multiplication ends up with code similar to
double complex
mul(double a, double b, double c, double d) {
double ac, bd, ad, bc, x, y;
double complex res;
ac = a * c;
bd = b * d;
ad = a * d;
bc = b * c;
x = ac  bd;
y = ad + bc;
if (isnan(x) && isnan(y)) {
/* Recover infinities that computed as NaN + iNaN. */
_Bool recalc = 0;
if (isinf(a)  isinf(b)) {
/* z is infinite. "Box" the infinity and change NaNs
* in the other factor to 0. */
a = copysign(isinf(a) ? 1.0 : 0.0, a);
b = copysign(isinf(b) ? 1.0 : 0.0, b);
if (isnan(c)) c = copysign(0.0, c);
if (isnan(d)) d = copysign(0.0, d);
recalc = 1;
}
if (isinf(c)  isinf(d)) {
/* w is infinite. "Box" the infinity and change NaNs
* in the other factor to 0. */
c = copysign(isinf(c) ? 1.0 : 0.0, c);
d = copysign(isinf(d) ? 1.0 : 0.0, d);
if (isnan(a)) a = copysign(0.0, a);
if (isnan(b)) b = copysign(0.0, b);
recalc = 1;
}
if (!recalc
&& (isinf(ac)  isinf(bd)
 isinf(ad)  isinf(bc))) {
/* Recover infinities from overflow by changing NaNs
* to 0. */
if (isnan(a)) a = copysign(0.0, a);
if (isnan(b)) b = copysign(0.0, b);
if (isnan(c)) c = copysign(0.0, c);
if (isnan(d)) d = copysign(0.0, d);
recalc = 1;
}
if (recalc) {
x = INFINITY * (a * c  b * d);
y = INFINITY * (a * d + b * c);
}
}
__real__ res = x;
__imag__ res = y;
return res;
}
One other problem is that the calculations can overflow even when the result of the operation is in range. This is especially problematic for division, so the implementation of division needs to add even more extra code (see the C standard for more details).
fcxlimited_range
makes the compiler use the usual mathematical formulas for complex multiplication/division.

It also enables
fnosignalingnans
,fnoroundingmath
, andfexcessprecision=fast
, which are enabled by default when compiling C or C++ code for x86_64 Linux, so I will not describe them in this blog post. ↩ 
For example, Agner Fog’s microarchitecture document says that Broadwell has a penalty of approximately 124 clock cycles when an operation on normal numbers gives a subnormal result. ↩