Today we(CS 210 Scientific Computing) finish the floating point number chapter. I close this section by answering some questions in my questions list. Most of the answers come from the Handbook of floating-point arithmetic1.

Extended precision and double rounding

Two storage precisions(i.e. binary32 and binary64) are used to represent floating point numbers in memory. And an extended precision mode(binary80) may be used for representing intermediate results.

The actual behavior is determined by architecture, operating system(floating point arithmetic unit settings) and compilation options together.

For example, on the x86/Windows/Visual C++ platform, /fp:precise is used by default. The intermediate results are represented in binary80 precision. And rounding is required before moving them to memory. However on the x86-64 platform, The SSE floating point unit is used, and the intermediate results are represented in storage precisions. No extra rounding is needed.

One issue of using the extended precision(binary80) is double rounding, which means the intermediate result gets rounded in a coarser precision (80 in this case) first, and then rounded in narrower precision (64 or 32 in this case). The double rounding may introduce significant errors.

Even worse, bugs of this kind are hard to inspect by printf or a debugger. Those tools would force the intermediate results spilled to the memory, hence then round the results in storage precision, which may conceal the symptoms.


The evaluation order is not guaranteed. Since the IEEE 754 floating point numbers are not associative, the results can be different.

The use of FMA intrinsics(“contracted” operation) may also break the evaluation. One example is to compute sqrt(a * a - b * b) with fma(a, a, - b * b) when a and b are equal. This would break the symmetry between a and b, thus cause a non-zero result.

The optimization should be taken carefully as well. For example, can we always perform constant folding on x + 0 to get x? We cannot do this when x is negative zero and rounding-to-zero is used. See also signed zero arithmetic

Special arithmetic

Be aware of pow(1, x) == 1 and pow(x, 0) == 1 for any x, including NaN. This may be counterintuitive since it contradicts the propagation of NaN.


Random ASCII also has a great series of articles on floating point numbers. It addresses issues of floating point arithmetic in more depth.

  1. Muller, Jean-Michel, et al. Handbook of floating-point arithmetic. Springer Science & Business Media, 2009.