Random notes on floating point numbers
Today we(CS 210 Scientific Computing) finish the floating point number chapter. I close this section by answering some questions in my questions list. Most of the answers come from the Handbook of floatingpoint arithmetic^{1}.
Extended precision and double rounding
Two storage precisions(i.e. binary32
and binary64
)
are used to represent floating point numbers in memory.
And an extended precision mode(binary80
) may be used for
representing intermediate results.
The actual behavior is determined by architecture, operating system(floating point arithmetic unit settings) and compilation options together.
For example,
on the x86/Windows/Visual C++ platform,
/fp:precise
is used by default.
The intermediate results are represented in binary80
precision.
And rounding is required before moving them to memory.
However on the x8664 platform,
The SSE floating point unit is used,
and the intermediate results are represented in
storage precisions.
No extra rounding is needed.
One issue of using the extended precision(binary80
)
is double rounding,
which means the intermediate result gets rounded in a coarser precision
(80 in this case) first, and then rounded in narrower precision
(64 or 32 in this case).
The double rounding may introduce significant errors.
Even worse, bugs of this kind are hard to inspect by printf
or a debugger.
Those tools would force the intermediate results spilled to the memory,
hence then round the results in storage precision,
which may conceal the symptoms.
Evaluation
The evaluation order is not guaranteed. Since the IEEE 754 floating point numbers are not associative, the results can be different.
The use of FMA intrinsics(“contracted” operation)
may also break the evaluation.
One example is to compute sqrt(a * a  b * b)
with fma(a, a,  b * b)
when a
and b
are equal.
This would break the symmetry between a
and b
,
thus cause a nonzero result.
The optimization should be taken carefully as well.
For example, can we always perform constant folding on x + 0
to get x
?
We cannot do this when x
is negative zero and roundingtozero is used.
See also
signed zero arithmetic
Special arithmetic
Be aware of pow(1, x) == 1
and pow(x, 0) == 1
for any x
, including NaN
.
This may be counterintuitive since it contradicts the propagation of NaN
.
References
Random ASCII also has a great series of articles on floating point numbers. It addresses issues of floating point arithmetic in more depth.

Muller, JeanMichel, et al. Handbook of floatingpoint arithmetic. Springer Science & Business Media, 2009. ↩