Today we(CS 210 Scientific Computing) finish the floating point number chapter. I close this section by answering some questions in my questions list. Most of the answers come from the Handbook of floating-point arithmetic1.
Extended precision and double rounding
Two storage precisions(i.e.
are used to represent floating point numbers in memory.
And an extended precision mode(
binary80) may be used for
representing intermediate results.
The actual behavior is determined by architecture, operating system(floating point arithmetic unit settings) and compilation options together.
on the x86/Windows/Visual C++ platform,
/fp:precise is used by default.
The intermediate results are represented in
And rounding is required before moving them to memory.
However on the x86-64 platform,
The SSE floating point unit is used,
and the intermediate results are represented in
No extra rounding is needed.
One issue of using the extended precision(
is double rounding,
which means the intermediate result gets rounded in a coarser precision
(80 in this case) first, and then rounded in narrower precision
(64 or 32 in this case).
The double rounding may introduce significant errors.
Even worse, bugs of this kind are hard to inspect by
printf or a debugger.
Those tools would force the intermediate results spilled to the memory,
hence then round the results in storage precision,
which may conceal the symptoms.
The evaluation order is not guaranteed. Since the IEEE 754 floating point numbers are not associative, the results can be different.
The use of FMA intrinsics(“contracted” operation)
may also break the evaluation.
One example is to compute
sqrt(a * a - b * b) with
fma(a, a, - b * b)
b are equal.
This would break the symmetry between
thus cause a non-zero result.
The optimization should be taken carefully as well.
For example, can we always perform constant folding on
x + 0
We cannot do this when
x is negative zero and rounding-to-zero is used.
signed zero arithmetic
Be aware of
pow(1, x) == 1 and
pow(x, 0) == 1 for any
This may be counterintuitive since it contradicts the propagation of
Random ASCII also has a great series of articles on floating point numbers. It addresses issues of floating point arithmetic in more depth.
Muller, Jean-Michel, et al. Handbook of floating-point arithmetic. Springer Science & Business Media, 2009. ↩