Floating-point representations

Machine representations of real numbers

We have already discussed how a base-2 positional number system can be used to represent the set of integers (i.e., $\mathsf{\ldots a_3a_2a_1a_0}$ with $\mathsf{a_k \in \{ 0, 1 \}}$). What prevents us from faithfully implementing such a system on a computer is that the number of available bits is limited. Moreover, for practical reasons, it is best if

the widths of the various integer types are standardized
those widths are all multiples of one byte (8 bits)

This means that we can at best represent integers in the ranges 0, …, 255 (8-bit), 0, …, 65535 (16-bit), or 0, …, 4294967295 (32-bit). With two’s complement, we can represent negative numbers by shifting these ranges to be (almost) symmetric about zero.

But the integers are special. If we count the integers $\mathsf{i}$ in a bounded range $\mathsf{a < i < b}$, we find that there are always a finite number of them. The reals don’t share this property. There are an uncountably infinite number of real values $\mathsf{x}$ in every bounded range $\mathsf{a < x < b}$. Hence, it is not sufficient to restrict the range of the reals in order to build a finite representation for them. Instead, the usual strategy is to work with rational numbers ($\mathbb{Q} \subset \mathbb{R}$) and then limit ourselves to a finite subset of those.

Radix point binary numbers

A complete description of the reals would involve a representation with a potentially infinite number of integer digits (left of the radix point) and an potentially infinite number of fractional digits (right of the radix point).

\[\mathsf{\cdots a_3a_2a_1a_0.a_{-1}a_{-2}a_{-3}\cdots} \]

As usual, the positive subscripts indicate multiples of $\mathsf{2^0}$, $\mathsf{2^1}$, $\mathsf{2^3}$ and the negative ones multiples of $\mathsf{2^{-1}}$, $\mathsf{2^{-2}}$, $\mathsf{2^{-3}}$. For example,

\[\begin{aligned} \mathsf{1001}&\mathsf{_2 = 9} \\ \mathsf{1001}&\mathsf{.1_2 = 9.5} \\ \mathsf{1001}&\mathsf{.01_2 = 9.25} \\ \mathsf{1001}&\mathsf{.001_2 = 9.125} \\ \mathsf{1001}&\mathsf{.0001_2 = 9.0625} \\ \mathsf{1001}&\mathsf{.101_2 = 9.625} \\ \mathsf{1001}&\mathsf{.0101_2 = 9.3125} \end{aligned}\]

Fixed-point binary

So what if we make the representation finite by truncating the integer and fractional digits? In that case, the radix can be eliminated by turning the number into a fraction. Consider the the 8-bit binary number with three fractional digits:

\[\mathsf{a_4a_3a_2a_1a_0.a_{-1}a_{-2}a_{-3} = (a_4a_3a_2a_1a_0a_{-1}a_{-2}a_{-3}) \div 2^3} \]

Clearly, this representation spans the region from 0 to 255/8 in increments of 1/8. Even if we use many more digits than 8, we will always run into the problem that extremely large and extremely small numbers can’t be represented.

Floating-point numbers

For reliable numerical computation, we need a representation that can work across many orders of magnitude. Floating-point is a finite-precision approximation to the familiar scientific notation:

\[\begin{aligned} \mathsf{475.99} &= \mathsf{4.7599\times 10^2} \\ \mathsf{-0.005333} &= \mathsf{-5.333\times 10^{-3}} \end{aligned}\]

The floating-point representation is a binary version of scientific notation that uses a fixed number of bits to store the significand and the exponent. Formally speaking, this scheme approximates the reals by a subset of the rationals. It manages to cover many orders of magnitude by implementing a logarithmic mesh: i.e., there are an equal number of uniformly spaced values in each interval $\mathsf{[2^n,2^{n+1})}$.

For example, various intervals are covered as follows.

$\mathsf{[1,2)}$ in a grid of spacing $\mathsf{2^{-23}}$ (float) or $\mathsf{2^{-52}}$ (double)

$\mathsf{[2,4)}$ in a grid of spacing $\mathsf{2^{-22}}$ (float) or $\mathsf{2^{-51}}$ (double)

$\mathsf{[4,8)}$ in a grid of spacing $\mathsf{2^{-21}}$ (float) or $\mathsf{2^{-50}}$ (double)

etc.

A floating-point number is represented as

\[\mathsf{s \times m \times 2^{e-b}} \]

where $\mathsf{s=\pm 1}$ is the sign, $\mathsf{m}$ is the significand (or mantissa), $\mathsf{e}$ the exponent, and $\mathsf{b}$ the offset (or bias). Each of these parts is itself stored in a fix-width format.

Organization of the bit patterns

The internal format (i.e., the storage arrangement of the bits) for the 32- and 64-bit floating point types is arbitrary but has been standardized across platforms since the mid-1980s. For the 32-bit type (called float in C++, f32 in Rust, Float32 in Julia) there is a 1-bit sign field, an 8-bit exponent field biased by 127, and a 23-bit fraction field; for the 64-bit type (double) the exponent field is 11 bits wide, offset by 1023, and the fraction field is 52 bits wide.

Consider the real number $\mathsf{25.5}$, which happens to have an exact floating-point representation. Stored as a float, its exponent field is set to $\mathsf{131=10000011_2}$; its fraction field has three active bits in the first, fourth, and fifth position.

\[\mathsf{25.5 = 16 \biggl(1 + \frac{1}{2} + \frac{1}{16} + \frac{1}{32} \biggr) = + 2^{131-127}\bigl( 1 + 2^{-1} + 2^{-4} + 2^{-5} \bigr)} \]

It is often convenient to denote the bit pattern in base 16, since it is much more compact (requiring four times fewer digits than binary). In this case, the correct value is 0x41CC0000. (Recall that 0x is the C++ prefix that marks a number as hexadecimal.)

\[\begin{aligned} &\mathtt{s/eeeeeeee/fffffffffffffffffffffff}\\ &\mathtt{\underbrace{0\phantom{/}100}_\mathsf{4} \overbrace{0001}^\mathsf{1} \underbrace{1\phantom{/}100}_\mathsf{C} \overbrace{1100}^\mathsf{C} \underbrace{0000}_\mathsf{0} \overbrace{0000}^\mathsf{0} \underbrace{0000}_\mathsf{0} \overbrace{0000}^\mathsf{0}} \end{aligned}\]

The Institute of Electrical and Electronics Engineers (IEEE) has standardized several floating point formats. The two most common correspond to the C++ types float and double. They differ only in the number of bits used in each field.

The single-precision floating point (float) is represented this way:

\[\begin{aligned} &\mathtt{s/eeeeeeee/fffffffffffffffffffffff}\\ &\mathtt{0\phantom{/}01111111\phantom{/}00000000000000000000000} \end{aligned}\]

It has a 1-bit sign field, an 8-bit exponent field, and a 23-bit fraction field. The exponent $\mathsf{e \in \{ 0, 1, \ldots, 254 \}}$ is offset by $\mathsf{b = 127 = 01111111_2}$. The value $\mathsf{e = 255}$ is reserved for representing infinity and various error states. Hence, the actual exponent range is $\mathsf{-127 \le e-b \le 127}$. The mantissa is understood to be $\mathsf{m = 1.f}$, unless $\mathsf{e = 0}$, in which case the floating-point number is said to be denormalized and is interpreted as $\mathsf{s \times 0.f \times 2^{1-b}}$. Note that the hidden bit in the mantissa has flipped.

The double-precision floating point (double in C++, f64 in Rust, Float64 in Julia, and float in Python) is identical to the single-precision version except for the larger width of its fields:

\[\begin{aligned} &\mathtt{s/eeeeeeeeeee/ffffffffffffffffffffffffffffffffffffffffffffffffffff}\\ &\mathtt{0\phantom{/}01111111111\phantom{/}0000000000000000000000000000000000000000000000000000} \end{aligned}\]

It has a 1-bit sign field, an 11-bit exponent field, and a 52-bit fraction field. The exponent $\mathsf{e \in \{ 0,1,\ldots ,2047 \}}$ is offset by $\mathsf{b = 1023 = 01111111111_2}$, but the value $\mathsf{e = 2047}$ is reserved. The exponent range is $\mathsf{-1023 \le e-b \le 1024}$ ($\mathsf{e-b = -1023}$ and $\mathsf{e-b = 1023}$ are reserved.)

Some floating-point examples

The integers closest to zero can all be represented in floating-point format:

\[\begin{aligned} \mathtt{1\phantom{/}10000000\phantom{/}10000000000000000000000} &= \mathsf{ -\,(1+.5)\times 2^{128-127} = -1.5\times 2^1 = -3} \\ \mathtt{1\phantom{/}10000000\phantom{/}00000000000000000000000} &= \mathsf{ -\,(1+.0)\times 2^{128-127} = -1\times 2^1= -2} \\ \mathtt{0\phantom{/}01111111\phantom{/}00000000000000000000000} &= \mathsf{ +\,(1+.0)\times 2^{127-127} = 1\times 2^0 = 1} \\ \mathtt{0\phantom{/}10000000\phantom{/}00000000000000000000000} &= \mathsf{ +\,(1+.0)\times 2^{128-127} = 1\times 2^1= 2} \\ \mathtt{0\phantom{/}10000000\phantom{/}10000000000000000000000} &= \mathsf{ +\,(1+.5)\times 2^{128-127} = 1.5\times 2^1= 3}\\ \mathtt{0\phantom{/}10000001\phantom{/}00000000000000000000000} &= \mathsf{ +\,(1+.0)\times 2^{129-127} = 1\times 2^2= 4}\\ \mathtt{0\phantom{/}10000001\phantom{/}01000000000000000000000} &= \mathsf{ +\,(1+.25)\times 2^{129-127} = 1.25\times 2^2 = 5} \end{aligned}\]

For very large integers, however, there are gaps. The numbers 50331649 and 50331652 are adjacent, and there’s no way to represent 50331649, 50331650, or 50331651!

\[\begin{aligned} \mathtt{0\phantom{/}10101110\phantom{/}10000000000000000000000} &= \mathsf{ +\,(1+.5)\times 2^{174-127}} \\ &= \mathsf{1.5 \times 2^{37}}\\ &= \mathsf{50331648}\\ \mathtt{0\phantom{/}10101110\phantom{/}10000000000000000000001} &= \mathsf{ +\,(1+.50000011920929)\times 2^{174-127}}\\ &= \mathsf{50331652} \end{aligned}\]

And most real numbers (e.g., $\mathsf{1.285}$) cannot be represented exactly either.

\[\begin{aligned} \mathtt{0\phantom{/}10000101\phantom{/}01001000111101011100001} &= \mathsf{+\,(1+.284999999)\times 2^{133-127}}\\ &= \mathsf{+1.28499999926 \approx 1.285} \end{aligned}\]

Sufficiently large or small numbers cannot be represented at all. The normalized values of maximum and minimum magnitude are

\[\begin{aligned} \mathtt{0\phantom{/}11111110\phantom{/}11111111111111111111111} &= \mathsf{+\,(1+.999999999)\times 2^{254-127}}\\ &= \mathsf{+\,1.999999999 \times 2^{127} \approx 3.40\times 10^{38}}\\ \mathtt{0\phantom{/}00000001\phantom{/}00000000000000000000000} &= \mathsf{+(1+.000000000)2^127}\\ &= \mathsf{+\,1.000000000 \times 2^{126} \approx 1.18\times 10^{-38}} \end{aligned}\]

Denormalized numbers

The mantissa is interpreted as $\mathsf{1.f}$, unless the exponent field is exactly zero. In that case, there is no implicit leading one in the significand, i.e., $\mathsf{m = 0.f}$, and the number is said to be denormalized. Such numbers have an effective magnitude below the usual range, but at the cost of lost precision.

The smallest non-denormalized number is

\[\mathtt{0\phantom{/}00000001\phantom{/}00000000000000000000000 = (1.0) \times 2^{1-127} = (1.0) \times 2^{-126}} \]

The largest denormalized number is

\[\begin{aligned} \mathtt{0\phantom{/}00000000\phantom{/}11111111111111111111111} &= \mathsf{(0.11\ldots 1)_2 \times 2^{1-127}}\\ &= \mathsf{0.99\ldots 9 \times 2^{-126}}\\ &\approx \mathsf{1.0 \times 2^{-126}} \end{aligned} \]

Consider the following sequence of successive division by two, where each roman letter represents a binary digit.

\[\begin{aligned} \mathtt{0\phantom{/}00000011\phantom{/}abcdefghijklmnopqrstuvw} &= \mathsf{(1.abcdef\ldots w)_2\times 2^{-124}} \\ \mathtt{0\phantom{/}00000010\phantom{/}abcdefghijklmnopqrstuvw} &= \mathsf{(1.abcdef\ldots w)_2\times 2^{-125}} \\ \mathtt{0\phantom{/}00000001\phantom{/}abcdefghijklmnopqrstuvw} &= \mathsf{(1.abcdef\ldots w)_2\times 2^{-126}} \\ \mathtt{0\phantom{/}00000000\phantom{/}abcdefghijklmnopqrstuvw} &= \mathsf{(0.1abcde\ldots v)_2\times 2^{-126}} \\ \mathtt{0\phantom{/}00000000\phantom{/}0abcdefghijklmnopqrstuv} &= \mathsf{(0.01abcd\ldots u)_2\times 2^{-126}} \\ \mathtt{0\phantom{/}00000000\phantom{/}00abcdefghijklmnopqrstu} &= \mathsf{(0.001abc\ldots t)_2\times 2^{-126}} \\ \mathtt{0\phantom{/}00000000\phantom{/}000abcdefghijklmnopqrst} &= \mathsf{(0.0001ab\ldots s)_2\times 2^{-126}} \\ \mathtt{0\phantom{/}00000000\phantom{/}0000abcdefghijklmnopqrs} &= \mathsf{(0.00001a\ldots r)_2\times 2^{-126}} \end{aligned} \]

Eventually, all digits of precision are lost and the division underflows to zero.

\[\begin{aligned} \mathtt{0\phantom{/}00000000\phantom{/}0000000000000000000000a} &= \mathsf{(0.00000\ldots 01a)_2\times 2^{-126}}\\ \mathtt{0\phantom{/}00000000\phantom{/}0000000000000000000000a} &= \mathsf{(0.00000\ldots 001)_2\times 2^{-126}}\\ \mathtt{0\phantom{/}00000000\phantom{/}00000000000000000000000} &= \mathsf{0} \end{aligned} \]

Special floating-point values

The floating-point representation includes both positive and negative values of zero.

\[\begin{aligned} \mathtt{0\phantom{/}00000000\phantom{/}00000000000000000000000} &= \mathsf{+0}\\ \mathtt{1\phantom{/}00000000\phantom{/}00000000000000000000000} &= \mathsf{-0} \end{aligned} \]

It excludes numbers with the highest possible value of the exponent, which are reserved for the special values inf, -inf, and nan.

\[\begin{aligned} \mathtt{0\phantom{/}1111111111\phantom{/}00000000000000000000000} &= \mathsf{+{\tt inf}}\\ \mathtt{1\phantom{/}1111111111\phantom{/}00000000000000000000000} &= \mathsf{-{\tt inf}}\\ \mathtt{0\phantom{/}1111111111\phantom{/}xxxxxxxxxxxxxxxxxxxxxxx} &= {\tt nan}\\ \mathtt{1\phantom{/}1111111111\phantom{/}xxxxxxxxxxxxxxxxxxxxxxx} &= {\tt nan} \end{aligned} \]

In the above, at least one of the bits marked x is nonzero.

One of the two inf values arises when the result of an arithmetic calculation is too large in magnitude to be represented. Similarly, a calculation whose result is too small in magnitude yields a zero. In either case, the sign information is preserved. If a calculation has no numerical sense (e.g., sqrt(-2.0) or 0/0), then the result will likely be a nan.

These special values have well-defined arithmetic properties:

\[\begin{aligned} \mathsf{13/0} &= {\tt inf} \\ \mathsf{-4/0} &= -{\tt inf} \\ \mathsf{{\tt inf} + 7} & = {\tt inf} \\ \mathsf{{\tt inf} - 99.3} & = {\tt inf} \\ \mathsf{{\tt inf} \times 15.6} & = {\tt inf} \\ \mathsf{{\tt inf} \times (-3)} & = -{\tt inf} \\ \mathsf{{\tt inf} \div 2} & = {\tt inf} \\ \mathsf{{\tt inf} + {\tt inf}} & = {\tt inf} \\ \mathsf{{\tt inf} \times {\tt inf}} & = {\tt nan} \\ \mathsf{0 / 0} &= {\tt nan} \\ \mathsf{{\tt inf} - \inf} & = {\tt nan} \\ \mathsf{{\tt nan} + 7} & = {\tt nan} \\ \mathsf{{\tt nan} \times (-3)} & = {\tt nan} \\ \end{aligned} \]

Representation errors

Floating-point numbers have a few obvious drawbacks:

most real numbers cannot be represented exactly;
small representation errors can accumulate systematically during arithmetic operations.

Let’s write a program that (in a quite naive way) attempts to range over the values −3, −2.9, −2.8, …, 2.8, 2.9 using a for loop that increments a float variable.

#include <iostream>
using std::cout;
using std::endl;

#include <iomanip>
using std::setw;

int main()
{
   cout.precision(10);
   int n = -30;
   for (float x = -3.0; x < 3.0; x += 0.1, n += 1)
   cout << setw(16) << x << setw(6) << 0.1*n << endl;
   return 0;
}

As the output of this program makes clear, repeatedly adding 0.1 to -3.0 does not give the desired result. Each value is slightly off, and the comparison (x < 3.0) fails to cut off the sequence at 2.9.

              -3    -3
    -2.900000095  -2.9
    -2.800000191  -2.8
    -2.700000286  -2.7
    -2.600000381  -2.6

    -0.400000602  -0.4
    -0.300000608  -0.3
   -0.2000006139  -0.2
   -0.1000006124  -0.1
-6.124377023e-07     0
   0.09999939054   0.1
     0.199999392   0.2
    0.2999993861   0.3
    0.3999993801   0.4

     2.699999094   2.7
     2.799998999   2.8
     2.899998903   2.9
     2.999998808     3

Why does this happen? Although -3.0 can be exactly represented, 0.1 cannot. The closest we can get is 0.1000000015, which is represented as follows.

\[\begin{aligned} &\mathtt{s/eeeeeeee/fffffffffffffffffffffff}\\ &\mathtt{0\phantom{/}01111011\ 10011001100110011001101} \end{aligned}\]

The exact number is a nonterminating fraction: $\mathsf{0.1} = \mathsf{1.6/16} = \mathsf{1.6} \times \mathsf{2^{-4}} = \mathsf{(1.10011001100\overline{1100}\ldots)_2} \times \mathsf{2^{123-127}}$, with the $\mathsf{1100}$ sequence continuing indefinitely. This can be shoehorned into the 23-bit fractional field either by truncation or by rounding (banker’s rounding is often used):

\[\begin{aligned} &\mathtt{s/eeeeeeee/fffffffffffffffffffffff}\\ &\mathtt{0\phantom{/}01111011\phantom{/}1001100110011001100110{\color{red}1}}\\ &\mathtt{0\phantom{/}01111011\phantom{/}1001100110011001100110{\color{green}01100\cdots}}\\ \end{aligned}\]

Let’s look at what happens when we add 0.1 to −3.0 within the the floating-point scheme. The two representations are

\[\begin{aligned} &\mathtt{1~10000000~10000000000000000000000~(-3.0F)}\\ &\mathtt{0~01111011~10011001100110011001101~(+0.1F)} \end{aligned}\]

The fractional parts can only be added once the exponent fields have been made equal. Let’s show the mantissa’s hidden bit

\[\begin{aligned} &\mathtt{s/eeeeeeee/{\color{blue}1.}fffffffffffffffffffffff}\\ &\mathtt{1\phantom{/}10000000\phantom{/}{\color{blue}1.}10000000000000000000000~(-3.0F)}\\ &\mathtt{0\phantom{/}01111011\phantom{/}{\color{blue}1.}10011001100110011001101~(+0.1F)} \end{aligned}\]

and systematically shift the smaller magnitude number by factors of 2 until its exponent agrees with that of the larger:

\[\begin{aligned} &\mathtt{0\phantom{/}01111011\phantom{/}1.10011001100110011001101}\\ &\mathtt{0\phantom{/}01111100\phantom{/}0.110011001100110011001101}\\ &\mathtt{0\phantom{/}01111101\phantom{/}0.0110011001100110011001101}\\ &\mathtt{0\phantom{/}01111110\phantom{/}0.00110011001100110011001101}\\ &\mathtt{0\phantom{/}01111111\phantom{/}0.000110011001100110011001101}\\ &\mathtt{0\phantom{/}10000000\phantom{/}0.0000110011001100110011001101}\\ \end{aligned}\]

At this point, we can carry out conventional binary subtraction $-(\mathsf{3.0}-\mathsf{0.1})$

1 10000000 1.10000000000000000000000
0 10000000 0.00001100110011001100110
------------------------------------
1 10000000 1.01111000000000000000000
0 10000000 0.00000100110011001100110
------------------------------------
1 10000000 1.01110100000000000000000
0 10000000 0.00000000110011001100110
------------------------------------
1 10000000 1.01110011100000000000000
0 10000000 0.00000000010011001100110
------------------------------------
1 10000000 1.01110011010000000000000
0 10000000 0.00000000000011001100110
------------------------------------
1 10000000 1.01110011001110000000000
0 10000000 0.00000000000001001100110
------------------------------------
1 10000000 1.01110011001101000000000
0 10000000 0.00000000000000001100110
------------------------------------
1 10000000 1.01110011001100111000000
0 10000000 0.00000000000000000100110
------------------------------------
1 10000000 1.01110011001100110100000
0 10000000 0.00000000000000000000110
------------------------------------
1 10000000 1.01110011001100110011100
0 10000000 0.00000000000000000000010
------------------------------------
1 10000000 1.01110011001100110011010
0 10000000 0.00000000000000000000000
------------------------------------
1 10000000 1.01110011001100110011010

The final result

\[\begin{aligned} \mathtt{s}/\mathtt{eeeeeeee}/\mathtt{fffffffffffffffffffffff}\\ \mathtt{1}\phantom{/}\mathtt{10000000}\phantom{/}\mathtt{01110011001100110011010} \end{aligned}\]

corresponds to the decimal number $\mathsf{−2.9000000954}$. Beware of putting too much faith in the final few decimal digits reported by cout if your display precision exceeds the number of significant digits. For example, if we set cout.precision(11), we find that the last 3 decimal digits are inconsistent:

-3
+0.10000000149
--------------
-2.89999999851 != -2.9000000954

Some practical advice: if you want to range over an evenly-spaced set of floating-point values (e.g., over an interval $[\mathsf{a,b}]$ in $\mathsf{N}$ equal steps)

const double a = ...;
const double b = ...;
assert(a < b);
const int N = ...;
assert(N > 0);
const double width = b-a;
const double dx = width/N;

Then it’s best to replace loops like this

for (double x = a; x <= b; x += dx)
   foo(x);

with ones based on integer counting:

for (int n = 0; n <= N; ++n)
{
   const double x_n = a + dx*n;
   foo(x_n);
}

Constantly refreshing the value of x_n prevents the errors from accumulating. And using an integer counter gives you greater control over how you treat the endpoints of the interval.

Numerical Error

Floating point numbers are a peculiar finite subset of the rationals, designed to span many orders of magnitude and to have a consistent number of values in each factor-two interval. Since the exponent field is finite, there are minimum and maximum values that can be represented.

type

largest number

smallest number

epsilon

float

3.40E+38

1.5E-45

1.19E-7

double

1.80E+308

5.0E-324

2.22E-16

This means that any number represented as a float that is greater than $\mathsf{3.4}\times \mathsf{10^{38}}$ is an inf; any that is smaller than $\mathsf{1.5}\times \mathsf{10^{-45}}$ is identically 0. Within the representable range, an arbitrary real number $\mathsf{x} = \hat{\mathsf{x}} + \delta_\mathsf{x}$ will be closest to some floating point value $\hat{\mathsf{x}}$ but a distance $\delta_\mathsf{x}$ away. It’s easy to show that the relative error $\epsilon_\mathsf{x} = \delta_\mathsf{x}/\mathsf{x}$ is bounded by the machine epsilon $\epsilon_\textsf{m}$, defined to be the smallest value such that $1 + \epsilon_\textsf{m} > 1$. (That is, the bounds on the absolute error $\delta_\mathsf{x}$ scale with $|\mathsf{x}|$.)

In the case of a float, $\mathsf{1} = \mathsf{+2^{127-127}} \times \mathsf{(1+0.0\ldots 0)}$. The next highest number has an identical bit pattern except that the least significant bit is on.

\[\begin{aligned} \mathtt{0/01111111/00000000000000000000000} &= \mathsf{1}\\ \mathtt{0/01111111/00000000000000000000001} &= \mathsf{1} + \mathsf{\epsilon}_\textsf{m} \end{aligned}\]

The difference corresponds to $\epsilon_\textsf{m} = \mathsf{2^{-23}} = \mathsf{1.19}\times \mathsf{10^{-7}}$. Hence, the boolean test 1.0F + 1E-9F == 1.0F evaluates to true (whereas 1.0 + 1E-9 == 1.0 evaluates to false, because $\epsilon_\textsf{m} = \mathsf{2^{-52}} = \mathsf{2.22} \times \mathsf{10^{-16}}$ for doubles).

Such representation errors are only a small part of what you need to worry about. There are many other sources of error.

Rounding and truncation errors occur during conversions between types
Errors are introduced when series expansions are cut off: e.g., cos(0.5*M_PI) != 0.0, even though $\mathsf{\cos(\pi/2)}$ is zero; such behaviour is common to all transcendental functions in the cmath library
Addition of floating point values having very different orders of magnitude may lead to (mild) loss of significance
Subtraction of nearly equal values may lead to (severe) loss of significance
Multiplication can lead to underflow or overflow if the exponents of the the two operands are both large or both small
Division is never entirely safe, since verifying that the divisor is not zero does not guarantee that a division will not overflow and yield infinity

It is also important to understand that some of the familiar algebraic properties of the real numbers no longer hold. Floating-point addition and multiplication are both commutative ($\mathsf{a+b} = \mathsf{b+a}$ and $\mathsf{a\times b} = \mathsf{b\times a}$), but they are not necessarily associative (the sum $\mathsf{(a+b)+c}$ may differ from $\mathsf{a+(b+c)}$) or distributive ($\mathsf{(a+b)\times c}$ may not be the same as $\mathsf{ac+bc}$). A perhaps clearer statement is that floating-point arithmetic operations are unambiguously defined pairwise, but their order of operation matters when more than two operands are involved.

Loss of significance

Of the many ways to lose precision during routine arithmetic operations, subtraction is the most worrisome. For example, consider the difference $\mathsf{1.7890625-1.783203125}$, assuming that we have a floating point representation with 10 digits of precision:

 1.110010100
-1.110010001
------------
 0.000000011

Since the highest eight bits match, the result, whose decimal value is $\mathsf{1.5}\times \mathsf{2^{-8}} = \mathsf{0.005859375}$, has only two significant binary digits. The loss of precision here is catastrophic (from 10 digits to 2). This is not necessarily a problem. If both $\mathsf{1.7890625}$ and $\mathsf{1.783203125}$ are exact then so is the difference. More likely, however, these numbers are floating-point approximations with their own representation error, or worse, the results of arithmetic calculations that include other sources of error. Even assuming the best-case scenario (representation error only), we can see how disastrous the situation is.

If only the least significant bit is in doubt, then the real numbers $\mathsf{x} = \mathsf{1.110010100_2} + \mathsf{\delta_x}$ and $\mathsf{y = 1.110010001_2} + \mathsf{\delta_y}$ lie in the ranges $\mathsf{1.110010011} < \mathsf{x} < \mathsf{1.110010101}$ and $\mathsf{1.110010000} < \mathsf{y} < \mathsf{1.110010010}$. This means that the true difference $\mathsf{x-y}$ might be as small as

 1.110010011
-1.110010010
 ------------
 0.000000001

or as large as

 1.110010101
-1.110010000
 ------------
 0.000000101

This puts it in the range $\mathsf{2^{-9}} < \mathsf{x-y} < \mathsf{1.25}\times \mathsf{2^{-7}}$. Note that a relative error of $\mathsf{2^{-9}} \approx \mathsf{0.2\%}$ on each of $\mathsf{x}$ and $\mathsf{y}$ becomes

\[\epsilon_{\mathsf{x}-\mathsf{y}} = \mathsf{\frac{\delta_{x-y}}{x-y}} = \mathsf{\frac{1.25 \times 2^{-7} - 2^{-9}}{1.5 \times 2^{-8}}} = \mathsf{133\%}\]

on their difference. A relative error greater than 100% means that we know almost nothing about the value.

Addition of many numbers is another important source of error. Suppose that we want to sum the numbers (2-digit decimal floating point with rounding) $\mathsf{1.0}, \mathsf{0.01}, \mathsf{0.01}, \ldots, \mathsf{0.01}$, where the value $\mathsf{0.01}$ occurs 99 times. If we accumulate the numbers from left to right in the list, we find that we simply execute the addition $\mathsf{1.0} + \mathsf{0.01} = \mathsf{1.0}$ (rounded to two digits) each of 99 times, and the result is $\mathsf{1.0}$. On the other hand, if we accumulate in the reverse order, the small numbers add as follows.

\[\begin{aligned} \mathsf{0.01 + 0.01} &\Rightarrow \mathsf{0.02} \\ \mathsf{0.02 + 0.01} &\Rightarrow \mathsf{0.03} \\ \mathsf{0.03 + 0.01} &\Rightarrow \mathsf{0.04} \\ &\,\,\,\vdots \\ \mathsf{0.98 + 0.01} &\Rightarrow \mathsf{0.99} \end{aligned}\]

After the final addition, $\mathsf{0.99} + \mathsf{1.0} \Rightarrow \mathsf{2.0}$, the result is quite different (and considerably closer to the true value $\mathsf{1.99}$). Why the discrepancy? Of course, it has nothing to do with whether we add left-to-right or right-to-left. Rather, in the first case, we were adding in descending order and in the second, ascending. As a general rule, adding in ascending order better preserves significance.

Given a set of floating-point values to be added, the safest procedure is to store the values in an array, sort the elements by absolute value is value in ascending order, and then accumulate the values from lowest to highest array index. This procedure may require considerable memory resources and a lot of computation. Is it really worth the trouble? Only sometimes.

For the loss of significance to really matter,

the floating-point values being summed must cover a very wide range of orders of magnitude (rivalling the number of digits in the significand) and
the smallest values must appear in great enough numbers.

A slightly more formal statement: if we sum a sequence of values $(\mathsf{x_n})$ within the floating-point scheme, the accumulated error will depend on the behaviour at the low and high end of the distribution $\mathsf{p(x)} = \sum_\mathsf{n} \delta(\mathsf{x-x_n})$.

Overflow

Consider how we might calculate the norm of the 2-vector $(\mathsf{x,y})$ or the complex number $\mathsf{x+iy}$. The most natural thing to write is sqrt(x*x+y*y), but why not use an expression based on any of these mathematically equivalent forms?

\[\sqrt{\mathsf{x^2 + y^2}} = |\mathsf{x}|\sqrt{\mathsf{1+(y/x)^2}} = |\mathsf{y}|\sqrt{\mathsf{1+(x/y)^2}}\]

In fact, the natural expression is not the best choice. if $\mathsf{\mathsf{x}}$ or $\mathsf{\mathsf{y}}$ is large enough, then squaring it will yield inf. A better approach is to factor out the component of largest magnitude.

double norm(double x, double y)
{
   const double xx = fabs(x);
   const double yy = fabs(y);
   if (xx > yy)
   {
      const double r = yy/xx;
      return xx*sqrt(1.0+r*r);
   }
   else
   {
      const double r = xx/yy;
      return yy*sqrt(1.0+r*r);
   }
}

It’s always worth giving some thought to how your algorithm will behave in the extreme cases.

Testing for equality

Remember that the operator == tests whether two floating point values are exactly equal: i.e., whether they have identical bit patterns. Sometimes, this is the test you want to perform. More often, you are interested in whether the two numbers are equal within a range of uncertainty. In other words, you want a weaker test that tolerates discrepancy in some number of the least significant bits. For example, these two numbers might be judged to be equal:

\[\begin{aligned} \mathsf{0~10000110~10011011101110111111001}\\ \mathsf{0~10000110~10011011101110101101101} \end{aligned}\]

A common idiom for performing such a test is as follows. Two floating-point numbers x and y are sufficiently close if

fabs(x-y) < fabs(x)*eps and fabs(x-y) < fabs(y)*eps

evaluates to true. Depending on the circumstances, eps may be the machine epsilon or something more forgiving.

Floating-point operations

The floating point value $\mathsf{\hat{x}} = \mathsf{(-1)^s}\times \mathsf{(1.f)_2} \times \mathsf{2^{e-b}}$ is represented in terms of bit strings of finite length that encode the sign ($\mathsf{s}$), fraction ($\mathsf{f}$), and exponent ($\mathsf{e}$). Since the binary value of $\mathsf{e}$ is nonnegative, a bias $\mathsf{b} = \mathsf{011}\ldots \mathsf{1_2}$ is used to make the exponent range symmetric about zero. The maximum value of $\mathsf{e}$ (all bits on) is reserved for the inf, -inf, and NaN values. The mimimum value of $\mathsf{e}$ (all bits off) signals that the number is denormalized. In that case, the hidden bit in the significand is zero rather than one: $\hat{\mathsf{x}} = \mathsf{(-1)^s(0.f)_2} \times \mathsf{2^{-b}}$.

The significand takes values in the range $\mathsf{1} \le \mathsf{(1.f)_2} < \mathsf{2}$, unless the floating-point value is denormalized, in which case $\mathsf{0} \le \mathsf{(0.f)_2} < \mathsf{1}$. These ranges are imposed by our definition of the floating-point format. Where they are not respected, the significand should be rescaled down or up by factors of 2 and the value of exponent correspondingly increased or decreased. (This is completely analogous to the base-10 case, in which we rewrite $\mathsf{0.0567} \times \mathsf{10^{-7}}$ as $\mathsf{5.67} \times \mathsf{10^{-5}}$ in order to comform to the standards of scientific notation.)

We would like to understand how arithmetic operations ($+,-,\times,\div$) on floating-point numbers are carried out. Consider two floating-point numbers $\hat{\mathsf{x}}$ and $\hat{\mathsf{y}}$.

\[\begin{aligned} \hat{\mathsf{x}} &= \mathsf{(-1)^{s_x}} \times \mathsf{(1.f_x)_2 \times 2^{e_x - b}}\\ \hat{\mathsf{y}} &= \mathsf{(-1)^{s_y}} \times \mathsf{(1.f_y)_2 \times 2^{e_y - b}} \end{aligned}\]

Their product is rather simple, since the exponentiated terms just add.

\[\hat{\mathsf{x}}\hat{\mathsf{y}} = \mathsf{(-1)^{s_x+s_y}} \times \mathsf{(1.f_x)_2(1.f_y)_2} \times \mathsf{2^{(e_x + e_y - b) - b}}\]

If we want to interpret the result itself as a floating-point value $\hat{\mathsf{z}} = \hat{\mathsf{x}}\hat{\mathsf{y}}$, then we have to check whether the significand sits in the required range. The sign bit is always the sum $\mathsf{s_z} = \mathsf{s_x+s_y}$; the exponent field is either $\mathsf{e_z} = \mathsf{e_x+e_y - b}$ or $\mathsf{e_z = e_x+e_y - b+1}$ depending on whether $\mathsf{1} \le \mathsf{(1.f_x)_2(1.f_y)_2} < \mathsf{2}$ or $\mathsf{2} \le \mathsf{(1.f_x)_2(1.f_y)_2} < \mathsf{4}$; the significand of the result is interpreted as follows:

\[\mathsf{(1.f_x)_2(1.f_y)_2} = \begin{cases} \mathsf{(1.f_z)_2} & \textsf{if $\mathsf{1 \le (1.f_x)_2(1.f_y)_2 < 2}$}\\ \mathsf{2(1.f_z)_2} & \textsf{if $\mathsf{2 \le (1.f_x)_2(1.f_y)_2 < 4}$} \end{cases}\]

The multiplication of the two significands should ideally take place in a register wider than the $\mathsf{f_x}$ and $\mathsf{f_y}$ bit fields and the result should only be reduced to the correct width at the end of the calculation (by rounding rather than truncating). On a typical computer, such calculations are performed in a dedicated floating-point unit (FPU). Let’s work out two detailed examples: the following multiplications, using four bits for $\mathsf{f_x}$ and $\mathsf{f_y}$ and eight bits for the intermediate calculations, yield $\mathsf{f_z} = \mathsf{1101}$ and $\mathsf{f_z} = \mathsf{1000}$.

  1.1010 (1.fx)
× 1.0010 (1.fy)
--------
  0.00110100
+ 1.10100000
------------
  1.11010100
------------
  1.1101


  1.1010 (1.fx)
× 1.1101 (1.fy)
--------
  0.00011010
  0.01101000
  0.11010000
+ 1.10100000
------------
 10.11111010
-------------
  1.011111010 × 10
-------------
  1.1000 × 10

Of course, we have ignored many possibilities that should be accounted for in a real implementation: one or both of $\hat{\mathsf{x}}$ and $\hat{\mathsf{y}}$ may be denormalized; the exponent term $\mathsf{e_z}$ may overflow or underflow.

Let’s now think about the addition operation. The sum of $\hat{\mathsf{x}}$ and $\hat{\mathsf{y}}$ is given by

\[\begin{aligned} \hat{\mathsf{x}} + \hat{\mathsf{y}} &= \mathsf{(-1)^{s_x}(1.f_x)_2} \times \mathsf{2^{(e_x-b)_2}} + \mathsf{(-1)^{s_y}(1.f_y)_2} \times \mathsf{2^{(e_y-b)_2}}\\ &= \mathsf{(-1)^{s_x}(1.f_x)_2} \times \mathsf{2^{(e_x-b)_2}} \Bigl[\mathsf{(1.f_x)_2 + (-1)^{s_x+s_y}(1.f_y)_2} \times \mathsf{2^{(e_y-e_x)_2}} \Bigr] \end{aligned}\]

In the second line above, we’ve reorganized the result under the assumption that the magnitude of $\hat{\mathsf{x}}$ is equal to or larger than that of $\hat{\mathsf{y}}$ (no loss in generality). What we’ve done is factor out the sign and magnitude of $\hat{\mathsf{x}}$. The part in square brackets determines the significand and exponent shift of the floating-point value $\hat{\mathsf{z}} = \hat{\mathsf{x}}+\hat{\mathsf{y}}$. It’s evaluation is quite complicated, since there are five cases to consider (more, if you take account of denormalized values, overflow, and underflow).

The corresponding exponents are $\mathsf{e_z} = \mathsf{e_x}$, $\mathsf{e_x+1}$, $\mathsf{e_x-1}$, $\mathsf{e_x+1}$, $\mathsf{e_x-p}$. The last case is the most dangerous one: if $\mathsf{x}$ and $\mathsf{y}$ are the same order of magnitude but opposite sign and if the $\mathsf{p}$ highest bits of $\mathsf{1.f_x}$ and $\mathsf{1.f_y}$ agree ($\mathsf{1} \le \mathsf{p} \le \textsf{width}(\mathsf{f})$), then $\mathsf{p}$ digits of significance are lost. The first case is also problematic: $\mathsf{e_x-e_y}$ digits of $\hat{\mathsf{y}}$ are truncated, which can be serious if $\mathsf{e_-e_y}$ is comparable to the number of digits in the fraction field; if exey exceeds the width of $\mathsf{f}$ then $\mathsf{\hat{x}+\hat{y}}$ and $\hat{\mathsf{x}}$ are identical.

case 1: $\mathsf{e_y - e_x} = \mathsf{-3}$, $\mathsf{f_z} = \mathsf{1000}$, $\mathsf{e_z} = \mathsf{e_x}$

   1.0101
+ (1.1001 / 1000)
-----------------
   1.01010000
 + 0.00110010
-------------
   1.10000010
-------------
   1.1000
     ffff

case 2: $\mathsf{e_y-e_x} = \mathsf{-2}$, $\mathsf{f_z} = \mathsf{1000}$, $\mathsf{e_z} = \mathsf{e_x+1}$

   1.1101
+ (1.011 / 100)
---------------
   1.11010000
 + 0.01011000
-------------
  10.00101000
-------------
   1.000101000 × 10
-------------------
   1.0001
     ffff

case 5: $\mathsf{e_y-e_x} = \mathsf{0}$, $\mathsf{f_z} = \mathsf{0000}$, $\mathsf{e_z} = \mathsf{e_x-4}$

  1.0111
− 1.0110
--------
  1.01110000
− 1.01100000
------------
  0.00010000
------------
  1.00000000 / 10000
--------------------
  1.0000
    ffff

In the first two examples above, some accuracy is lost when the final result is reduced to four fraction bits, but in both cases the 5 remaining digits are all significant. In the last example, only the first digit of the result is significant.

Propagation of errors

Floating-point numbers are an approximation to the reals. We can express the discrepancy in either absolute or relative terms: $\mathsf{x} = \hat{\mathsf{x}} + \delta_\mathsf{x} = \hat{\mathsf{x}}(\mathsf{1+\epsilon_x})$. The relative error is related to the concept of “number of digits of accuracy.” Roughly speaking, the number of binary digits that are meaningful is $-\textsf{log}_2(|\epsilon_\mathsf{x}|)$.

To order $\mathsf{O}(\epsilon^\mathsf{2})$, we find that under multiplication, the relative errors add ($\epsilon_{\mathsf{xy}} = \epsilon_\mathsf{x}+\epsilon_\mathsf{y}$):

\[\mathsf{x(1+\epsilon_x)} \times \mathsf{y(1+\epsilon_y)} = \mathsf{xy(1+\epsilon_x)(1+\epsilon_y)} = \mathsf{xy(1+\epsilon_x+\epsilon_y)}\]

Under addition, it is the absolute errors that add ($\mathsf{\delta_{x+y}} = \mathsf{\delta_x} + \mathsf{\delta_y}$):

\[\mathsf{(x + \delta_x)} + \mathsf{(y + \delta_y)} = \mathsf{(x + y)} + \mathsf{(\delta_x + \delta_y)} = \mathsf{(x + y)}\Biggl(\mathsf{1 + \frac{x\epsilon_x + y\epsilon_y}{x+y}}\Biggr)\]

The relative errors combine in a more complicated way. The relative error of the result is a weighted average of the individual relative errors that depends explicitly on $\mathsf{x}$ and $\mathsf{y}$. The two important limiting cases are $\mathsf{x} \to -\mathsf{y}$ and $\mathsf{x} \to \mathsf{y}$. Since $\mathsf{x+y}$ appears in the denominator, it’s clear that the error swamps the result when $\mathsf{x}$ becomes sufficiently close to $\mathsf{y}$. On the other hand, if $\mathsf{x}$ and $\mathsf{y}$ are very close in value, the relative error goes as $\mathsf{(x+y)/2}$, which is safe.