# Floating-point representations

See also

## Machine representations of real numbers

We have already discussed how a base-2 positional number system can be used to represent the set of integers (i.e., \(\mathsf{\ldots a_3a_2a_1a_0}\) with \(\mathsf{a_k \in \{ 0, 1 \}}\)). What prevents us from faithfully implementing such a system on a computer is that the number of available bits is limited. Moreover, for practical reasons, it is best if

the widths of the various integer types are standardized

those widths are all multiples of one byte (8 bits)

This means that we can at best represent integers in the ranges 0, …, 255 (8-bit), 0, …, 65535 (16-bit), or 0, …, 4294967295 (32-bit). With two’s complement, we can represent negative numbers by shifting these ranges to be (almost) symmetric about zero.

But the integers are special. If we count the integers
\(\mathsf{i}\) in a bounded range \(\mathsf{a
< i < b}\), we find that there are always a finite number of them. The
reals don’t share this property. There are an *uncountably* infinite
number of real values \(\mathsf{x}\) in every bounded range
\(\mathsf{a < x < b}\). Hence, it is not sufficient to
restrict the range of the reals in order to build a finite
representation for them. Instead, the usual strategy is to work with
rational numbers (\(\mathbb{Q} \subset \mathbb{R}\)) and then
limit ourselves to a finite subset of those.

## Radix point binary numbers

A complete description of the reals would involve a representation with a potentially infinite number of integer digits (left of the radix point) and an potentially infinite number of fractional digits (right of the radix point).

As usual, the positive subscripts indicate multiples of \(\mathsf{2^0}\), \(\mathsf{2^1}\), \(\mathsf{2^3}\) and the negative ones multiples of \(\mathsf{2^{-1}}\), \(\mathsf{2^{-2}}\), \(\mathsf{2^{-3}}\). For example,

## Fixed-point binary

So what if we make the representation finite by truncating the integer and fractional digits? In that case, the radix can be eliminated by turning the number into a fraction. Consider the the 8-bit binary number with three fractional digits:

Clearly, this representation spans the region from 0 to 255/8 in increments of 1/8. Even if we use many more digits than 8, we will always run into the problem that extremely large and extremely small numbers can’t be represented.

## Floating-point numbers

For reliable numerical computation, we need a representation that can
work across many orders of magnitude. *Floating-point* is
a finite-precision approximation to the familiar scientific notation:

The floating-point representation is a *binary* version of scientific
notation that uses a fixed number of bits to store the significand and
the exponent. Formally speaking, this scheme approximates the reals by
a subset of the rationals. It manages to cover many orders of magnitude
by implementing a logarithmic mesh: i.e., there are an equal number of
uniformly spaced values in each interval
\(\mathsf{[2^n,2^{n+1})}\).

For example, various intervals are covered as follows.

\(\mathsf{[1,2)}\) in a grid of spacing \(\mathsf{2^{-23}}\) (`float`

) or \(\mathsf{2^{-52}}\) (`double`

)

\(\mathsf{[2,4)}\) in a grid of spacing \(\mathsf{2^{-22}}\) (`float`

) or \(\mathsf{2^{-51}}\) (`double`

)

\(\mathsf{[4,8)}\) in a grid of spacing \(\mathsf{2^{-21}}\) (`float`

) or \(\mathsf{2^{-50}}\) (`double`

)

etc.

A floating-point number is represented as

where \(\mathsf{s=\pm 1}\) is the sign, \(\mathsf{m}\) is the significand (or mantissa), \(\mathsf{e}\) the exponent, and \(\mathsf{b}\) the offset (or bias). Each of these parts is itself stored in a fix-width format.

## Organization of the bit patterns

The internal format (i.e., the storage arrangement of the bits) for the
32- and 64-bit floating point types is arbitrary but has been
standardized across platforms since the mid-1980s. For the 32-bit type
(called `float`

in C++, `f32`

in Rust, `Float32`

in Julia) there
is a 1-bit sign field, an 8-bit exponent field biased by 127, and
a 23-bit fraction field; for the 64-bit type (double) the exponent field
is 11 bits wide, offset by 1023, and the fraction field is 52 bits wide.

Consider the real number \(\mathsf{25.5}\), which happens to
have an exact floating-point representation. Stored as a `float`

, its
exponent field is set to \(\mathsf{131=10000011_2}\); its
fraction field has three active bits in the first, fourth, and fifth
position.

It is often convenient to denote the bit pattern in base 16, since it is much more compact (requiring four times fewer digits than binary). In this case, the correct value is 0x41CC0000. (Recall that 0x is the C++ prefix that marks a number as hexadecimal.)

The Institute of Electrical and Electronics Engineers (IEEE) has
standardized several floating point formats. The two most common
correspond to the C++ types `float`

and `double`

. They differ only
in the number of bits used in each field.

The single-precision floating point (`float`

) is represented this way:

It has a 1-bit sign field, an 8-bit exponent field, and a 23-bit
fraction field. The exponent \(\mathsf{e \in \{ 0, 1,
\ldots, 254 \}}\) is offset by \(\mathsf{b = 127
= 01111111_2}\). The value \(\mathsf{e = 255}\) is reserved for
representing infinity and various error states. Hence, the actual
exponent range is \(\mathsf{-127 \le e-b \le 127}\). The
mantissa is understood to be \(\mathsf{m = 1.f}\), unless
\(\mathsf{e = 0}\), in which case the floating-point number is
said to be *denormalized* and is interpreted as \(\mathsf{s
\times 0.f \times 2^{1-b}}\). Note that the hidden bit in the mantissa
has flipped.

The double-precision floating point (`double`

in C++, `f64`

in Rust,
`Float64`

in Julia, and `float`

in Python) is identical to the
single-precision version except for the larger width of its fields:

It has a 1-bit sign field, an 11-bit exponent field, and a 52-bit fraction field. The exponent \(\mathsf{e \in \{ 0,1,\ldots ,2047 \}}\) is offset by \(\mathsf{b = 1023 = 01111111111_2}\), but the value \(\mathsf{e = 2047}\) is reserved. The exponent range is \(\mathsf{-1023 \le e-b \le 1024}\) (\(\mathsf{e-b = -1023}\) and \(\mathsf{e-b = 1023}\) are reserved.)

## Some floating-point examples

The integers closest to zero can all be represented in floating-point format:

For very large integers, however, there are gaps. The numbers 50331649 and 50331652 are adjacent, and there’s no way to represent 50331649, 50331650, or 50331651!

And most real numbers (e.g., \(\mathsf{1.285}\)) cannot be represented exactly either.

Sufficiently large or small numbers cannot be represented at all. The normalized values of maximum and minimum magnitude are

## Denormalized numbers

The mantissa is interpreted as \(\mathsf{1.f}\), unless the exponent
field is exactly zero. In that case, there is no implicit leading one
in the significand, i.e., \(\mathsf{m = 0.f}\), and the number is
said to be *denormalized*. Such numbers have an effective magnitude
below the usual range, but at the cost of lost precision.

The smallest non-denormalized number is

The largest denormalized number is

Consider the following sequence of successive division by two, where each roman letter represents a binary digit.

Eventually, all digits of precision are lost and the division underflows to zero.

## Special floating-point values

The floating-point representation includes both positive and negative values of zero.

It excludes numbers with the highest possible value of the exponent,
which are reserved for the special values `inf`

, `-inf`

, and
`nan`

.

In the above, at least one of the bits marked x is nonzero.

One of the two `inf`

values arises when the result of an arithmetic
calculation is too large in magnitude to be represented. Similarly,
a calculation whose result is too small in magnitude yields a zero.
In either case, the sign information is preserved. If a calculation
has no numerical sense (e.g., `sqrt(-2.0)`

or `0/0`

), then the result
will likely be a `nan`

.

These special values have well-defined arithmetic properties:

## Representation errors

Floating-point numbers have a few obvious drawbacks:

most real numbers cannot be represented exactly;

small representation errors can accumulate systematically during arithmetic operations.

Let’s write a program that (in a quite naive way) attempts to range over the values −3, −2.9, −2.8, …, 2.8, 2.9 using a for loop that increments a float variable.

```
#include <iostream>
using std::cout;
using std::endl;
#include <iomanip>
using std::setw;
int main()
{
cout.precision(10);
int n = -30;
for (float x = -3.0; x < 3.0; x += 0.1, n += 1)
cout << setw(16) << x << setw(6) << 0.1*n << endl;
return 0;
}
```

As the output of this program makes clear, repeatedly adding `0.1`

to
`-3.0`

does not give the desired result. Each value is slightly off,
and the comparison (`x < 3.0`

) fails to cut off the sequence at
`2.9`

.

```
-3 -3
-2.900000095 -2.9
-2.800000191 -2.8
-2.700000286 -2.7
-2.600000381 -2.6
-0.400000602 -0.4
-0.300000608 -0.3
-0.2000006139 -0.2
-0.1000006124 -0.1
-6.124377023e-07 0
0.09999939054 0.1
0.199999392 0.2
0.2999993861 0.3
0.3999993801 0.4
2.699999094 2.7
2.799998999 2.8
2.899998903 2.9
2.999998808 3
```

Why does this happen? Although `-3.0`

can be exactly represented,
`0.1`

cannot. The closest we can get is `0.1000000015`

, which is
represented as follows.

The exact number is a nonterminating fraction: \(\mathsf{0.1} = \mathsf{1.6/16} = \mathsf{1.6} \times \mathsf{2^{-4}} = \mathsf{(1.10011001100\overline{1100}\ldots)_2} \times \mathsf{2^{123-127}}\), with the \(\mathsf{1100}\) sequence continuing indefinitely. This can be shoehorned into the 23-bit fractional field either by truncation or by rounding (banker’s rounding is often used):

Let’s look at what happens when we add 0.1 to −3.0 within the the floating-point scheme. The two representations are

The fractional parts can only be added once the exponent fields have been made equal. Let’s show the mantissa’s hidden bit

and systematically shift the smaller magnitude number by factors of 2 until its exponent agrees with that of the larger:

At this point, we can carry out conventional binary subtraction \(-(\mathsf{3.0}-\mathsf{0.1})\)

```
1 10000000 1.10000000000000000000000
0 10000000 0.00001100110011001100110
------------------------------------
1 10000000 1.01111000000000000000000
0 10000000 0.00000100110011001100110
------------------------------------
1 10000000 1.01110100000000000000000
0 10000000 0.00000000110011001100110
------------------------------------
1 10000000 1.01110011100000000000000
0 10000000 0.00000000010011001100110
------------------------------------
1 10000000 1.01110011010000000000000
0 10000000 0.00000000000011001100110
------------------------------------
1 10000000 1.01110011001110000000000
0 10000000 0.00000000000001001100110
------------------------------------
1 10000000 1.01110011001101000000000
0 10000000 0.00000000000000001100110
------------------------------------
1 10000000 1.01110011001100111000000
0 10000000 0.00000000000000000100110
------------------------------------
1 10000000 1.01110011001100110100000
0 10000000 0.00000000000000000000110
------------------------------------
1 10000000 1.01110011001100110011100
0 10000000 0.00000000000000000000010
------------------------------------
1 10000000 1.01110011001100110011010
0 10000000 0.00000000000000000000000
------------------------------------
1 10000000 1.01110011001100110011010
```

The final result

corresponds to the decimal number \(\mathsf{−2.9000000954}\). Beware
of putting too much faith in the final few decimal digits reported by
`cout`

if your display precision exceeds the number of significant
digits. For example, if we set `cout.precision(11)`

, we find that the
last 3 decimal digits are inconsistent:

```
-3
+0.10000000149
--------------
-2.89999999851 != -2.9000000954
```

Some practical advice: if you want to range over an evenly-spaced set of floating-point values (e.g., over an interval \([\mathsf{a,b}]\) in \(\mathsf{N}\) equal steps)

```
const double a = ...;
const double b = ...;
assert(a < b);
const int N = ...;
assert(N > 0);
const double width = b-a;
const double dx = width/N;
```

Then it’s best to replace loops like this

```
for (double x = a; x <= b; x += dx)
foo(x);
```

with ones based on integer counting:

```
for (int n = 0; n <= N; ++n)
{
const double x_n = a + dx*n;
foo(x_n);
}
```

Constantly refreshing the value of `x_n`

prevents the errors from
accumulating. And using an integer counter gives you greater control
over how you treat the endpoints of the interval.

## Numerical Error

Floating point numbers are a peculiar finite subset of the rationals, designed to span many orders of magnitude and to have a consistent number of values in each factor-two interval. Since the exponent field is finite, there are minimum and maximum values that can be represented.

type

largest number

smallest number

epsilon

float

3.40E+38

1.5E-45

1.19E-7

double

1.80E+308

5.0E-324

2.22E-16

This means that any number represented as a `float`

that is greater
than \(\mathsf{3.4}\times \mathsf{10^{38}}\) is an `inf`

; any that
is smaller than \(\mathsf{1.5}\times \mathsf{10^{-45}}\) is
identically `0`

. Within the representable range, an arbitrary real
number \(\mathsf{x} = \hat{\mathsf{x}} + \delta_\mathsf{x}\) will be
closest to some floating point value \(\hat{\mathsf{x}}\) but
a distance \(\delta_\mathsf{x}\) away. It’s easy to show that the
relative error \(\epsilon_\mathsf{x}
= \delta_\mathsf{x}/\mathsf{x}\) is bounded by the machine epsilon
\(\epsilon_\textsf{m}\), defined to be the smallest value such that
\(1 + \epsilon_\textsf{m} > 1\). (That is, the bounds on the
*absolute error* \(\delta_\mathsf{x}\) scale with
\(|\mathsf{x}|\).)

In the case of a `float`

, \(\mathsf{1} = \mathsf{+2^{127-127}}
\times \mathsf{(1+0.0\ldots 0)}\). The next highest number has an
identical bit pattern except that the least significant bit is on.

The difference corresponds to \(\epsilon_\textsf{m}
= \mathsf{2^{-23}} = \mathsf{1.19}\times \mathsf{10^{-7}}\). Hence, the
boolean test `1.0F + 1E-9F == 1.0F`

evaluates to `true`

(whereas
`1.0 + 1E-9 == 1.0`

evaluates to `false`

, because
\(\epsilon_\textsf{m} = \mathsf{2^{-52}} = \mathsf{2.22} \times
\mathsf{10^{-16}}\) for `double`

s).

Such representation errors are only a small part of what you need to worry about. There are many other sources of error.

Rounding and truncation errors occur during conversions between types

Errors are introduced when series expansions are cut off: e.g.,

`cos(0.5*M_PI) != 0.0`

, even though \(\mathsf{\cos(\pi/2)}\) is zero; such behaviour is common to all transcendental functions in the`cmath`

libraryAddition of floating point values having very different orders of magnitude may lead to (mild) loss of significance

Subtraction of nearly equal values may lead to (severe) loss of significance

Multiplication can lead to underflow or overflow if the exponents of the the two operands are both large or both small

Division is never entirely safe, since verifying that the divisor is not zero does not guarantee that a division will not overflow and yield infinity

It is also important to understand that some of the familiar algebraic properties of the real numbers no longer hold. Floating-point addition and multiplication are both commutative (\(\mathsf{a+b} = \mathsf{b+a}\) and \(\mathsf{a\times b} = \mathsf{b\times a}\)), but they are not necessarily associative (the sum \(\mathsf{(a+b)+c}\) may differ from \(\mathsf{a+(b+c)}\)) or distributive (\(\mathsf{(a+b)\times c}\) may not be the same as \(\mathsf{ac+bc}\)). A perhaps clearer statement is that floating-point arithmetic operations are unambiguously defined pairwise, but their order of operation matters when more than two operands are involved.

## Loss of significance

Of the many ways to lose precision during routine arithmetic operations, subtraction is the most worrisome. For example, consider the difference \(\mathsf{1.7890625-1.783203125}\), assuming that we have a floating point representation with 10 digits of precision:

```
1.110010100
-1.110010001
------------
0.000000011
```

Since the highest eight bits match, the result, whose decimal value is \(\mathsf{1.5}\times \mathsf{2^{-8}} = \mathsf{0.005859375}\), has only two significant binary digits. The loss of precision here is catastrophic (from 10 digits to 2). This is not necessarily a problem. If both \(\mathsf{1.7890625}\) and \(\mathsf{1.783203125}\) are exact then so is the difference. More likely, however, these numbers are floating-point approximations with their own representation error, or worse, the results of arithmetic calculations that include other sources of error. Even assuming the best-case scenario (representation error only), we can see how disastrous the situation is.

If only the least significant bit is in doubt, then the real numbers
\(\mathsf{x} = \mathsf{1.110010100_2} + \mathsf{\delta_x}\) and
\(\mathsf{y = 1.110010001_2} + \mathsf{\delta_y}\) lie in the ranges
\(\mathsf{1.110010011} < \mathsf{x} < \mathsf{1.110010101}\) and
\(\mathsf{1.110010000} < \mathsf{y} < \mathsf{1.110010010}\). This
means that the *true* difference \(\mathsf{x-y}\) might be as small
as

```
1.110010011
-1.110010010
------------
0.000000001
```

or as large as

```
1.110010101
-1.110010000
------------
0.000000101
```

This puts it in the range \(\mathsf{2^{-9}} < \mathsf{x-y} < \mathsf{1.25}\times \mathsf{2^{-7}}\). Note that a relative error of \(\mathsf{2^{-9}} \approx \mathsf{0.2\%}\) on each of \(\mathsf{x}\) and \(\mathsf{y}\) becomes

on their difference. A relative error greater than 100% means that we
know almost *nothing* about the value.

Addition of many numbers is another important source of error. Suppose that we want to sum the numbers (2-digit decimal floating point with rounding) \(\mathsf{1.0}, \mathsf{0.01}, \mathsf{0.01}, \ldots, \mathsf{0.01}\), where the value \(\mathsf{0.01}\) occurs 99 times. If we accumulate the numbers from left to right in the list, we find that we simply execute the addition \(\mathsf{1.0} + \mathsf{0.01} = \mathsf{1.0}\) (rounded to two digits) each of 99 times, and the result is \(\mathsf{1.0}\). On the other hand, if we accumulate in the reverse order, the small numbers add as follows.

After the final addition, \(\mathsf{0.99} + \mathsf{1.0} \Rightarrow \mathsf{2.0}\), the result is quite different (and considerably closer to the true value \(\mathsf{1.99}\)). Why the discrepancy? Of course, it has nothing to do with whether we add left-to-right or right-to-left. Rather, in the first case, we were adding in descending order and in the second, ascending. As a general rule, adding in ascending order better preserves significance.

Given a set of floating-point values to be added, the safest procedure is to store the values in an array, sort the elements by absolute value is value in ascending order, and then accumulate the values from lowest to highest array index. This procedure may require considerable memory resources and a lot of computation. Is it really worth the trouble? Only sometimes.

For the loss of significance to really matter,

the floating-point values being summed must cover a very wide range of orders of magnitude (rivalling the number of digits in the significand) and

the smallest values must appear in great enough numbers.

A slightly more formal statement: if we sum a sequence of values \((\mathsf{x_n})\) within the floating-point scheme, the accumulated error will depend on the behaviour at the low and high end of the distribution \(\mathsf{p(x)} = \sum_\mathsf{n} \delta(\mathsf{x-x_n})\).

## Overflow

Consider how we might calculate the norm of the 2-vector
\((\mathsf{x,y})\) or the complex number \(\mathsf{x+iy}\). The
most natural thing to write is `sqrt(x*x+y*y)`

, but why not use an
expression based on any of these mathematically equivalent forms?

In fact, the natural expression is not the best choice. if
\(\mathsf{\mathsf{x}}\) or \(\mathsf{\mathsf{y}}\) is large
enough, then squaring it will yield `inf`

. A better approach is to
factor out the component of largest magnitude.

```
double norm(double x, double y)
{
const double xx = fabs(x);
const double yy = fabs(y);
if (xx > yy)
{
const double r = yy/xx;
return xx*sqrt(1.0+r*r);
}
else
{
const double r = xx/yy;
return yy*sqrt(1.0+r*r);
}
}
```

It’s always worth giving some thought to how your algorithm will behave in the extreme cases.

## Testing for equality

Remember that the operator `==`

tests whether two floating point
values are exactly equal: i.e., whether they have identical bit
patterns. Sometimes, this is the test you want to perform. More often,
you are interested in whether the two numbers are equal within a range
of uncertainty. In other words, you want a weaker test that tolerates
discrepancy in some number of the least significant bits. For example,
these two numbers might be judged to be equal:

A common idiom for performing such a test is as follows. Two floating-point numbers
`x`

and `y`

are sufficiently close if

```
fabs(x-y) < fabs(x)*eps and fabs(x-y) < fabs(y)*eps
```

evaluates to `true`

. Depending on the circumstances, `eps`

may be
the machine epsilon or something more forgiving.

## Floating-point operations

The floating point value \(\mathsf{\hat{x}} = \mathsf{(-1)^s}\times
\mathsf{(1.f)_2} \times \mathsf{2^{e-b}}\) is represented in terms of bit
strings of finite length that encode the sign (\(\mathsf{s}\)),
fraction (\(\mathsf{f}\)), and exponent (\(\mathsf{e}\)). Since
the binary value of \(\mathsf{e}\) is nonnegative, a bias
\(\mathsf{b} = \mathsf{011}\ldots \mathsf{1_2}\) is used to make the
exponent range symmetric about zero. The maximum value of
\(\mathsf{e}\) (all bits on) is reserved for the `inf`

, `-inf`

,
and `NaN`

values. The mimimum value of \(\mathsf{e}\) (all bits
off) signals that the number is denormalized. In that case, the hidden
bit in the significand is zero rather than one: \(\hat{\mathsf{x}}
= \mathsf{(-1)^s(0.f)_2} \times \mathsf{2^{-b}}\).

The significand takes values in the range \(\mathsf{1} \le \mathsf{(1.f)_2} < \mathsf{2}\), unless the floating-point value is denormalized, in which case \(\mathsf{0} \le \mathsf{(0.f)_2} < \mathsf{1}\). These ranges are imposed by our definition of the floating-point format. Where they are not respected, the significand should be rescaled down or up by factors of 2 and the value of exponent correspondingly increased or decreased. (This is completely analogous to the base-10 case, in which we rewrite \(\mathsf{0.0567} \times \mathsf{10^{-7}}\) as \(\mathsf{5.67} \times \mathsf{10^{-5}}\) in order to comform to the standards of scientific notation.)

We would like to understand how arithmetic operations (\(+,-,\times,\div\)) on floating-point numbers are carried out. Consider two floating-point numbers \(\hat{\mathsf{x}}\) and \(\hat{\mathsf{y}}\).

Their product is rather simple, since the exponentiated terms just add.

If we want to interpret the result itself as a floating-point value \(\hat{\mathsf{z}} = \hat{\mathsf{x}}\hat{\mathsf{y}}\), then we have to check whether the significand sits in the required range. The sign bit is always the sum \(\mathsf{s_z} = \mathsf{s_x+s_y}\); the exponent field is either \(\mathsf{e_z} = \mathsf{e_x+e_y - b}\) or \(\mathsf{e_z = e_x+e_y - b+1}\) depending on whether \(\mathsf{1} \le \mathsf{(1.f_x)_2(1.f_y)_2} < \mathsf{2}\) or \(\mathsf{2} \le \mathsf{(1.f_x)_2(1.f_y)_2} < \mathsf{4}\); the significand of the result is interpreted as follows:

The multiplication of the two significands should ideally take place in a register wider than the \(\mathsf{f_x}\) and \(\mathsf{f_y}\) bit fields and the result should only be reduced to the correct width at the end of the calculation (by rounding rather than truncating). On a typical computer, such calculations are performed in a dedicated floating-point unit (FPU). Let’s work out two detailed examples: the following multiplications, using four bits for \(\mathsf{f_x}\) and \(\mathsf{f_y}\) and eight bits for the intermediate calculations, yield \(\mathsf{f_z} = \mathsf{1101}\) and \(\mathsf{f_z} = \mathsf{1000}\).

```
1.1010 (1.fx)
× 1.0010 (1.fy)
--------
0.00110100
+ 1.10100000
------------
1.11010100
------------
1.1101
1.1010 (1.fx)
× 1.1101 (1.fy)
--------
0.00011010
0.01101000
0.11010000
+ 1.10100000
------------
10.11111010
-------------
1.011111010 × 10
-------------
1.1000 × 10
```

Of course, we have ignored many possibilities that should be accounted for in a real implementation: one or both of \(\hat{\mathsf{x}}\) and \(\hat{\mathsf{y}}\) may be denormalized; the exponent term \(\mathsf{e_z}\) may overflow or underflow.

Let’s now think about the addition operation. The sum of \(\hat{\mathsf{x}}\) and \(\hat{\mathsf{y}}\) is given by

In the second line above, we’ve reorganized the result under the assumption that the magnitude of \(\hat{\mathsf{x}}\) is equal to or larger than that of \(\hat{\mathsf{y}}\) (no loss in generality). What we’ve done is factor out the sign and magnitude of \(\hat{\mathsf{x}}\). The part in square brackets determines the significand and exponent shift of the floating-point value \(\hat{\mathsf{z}} = \hat{\mathsf{x}}+\hat{\mathsf{y}}\). It’s evaluation is quite complicated, since there are five cases to consider (more, if you take account of denormalized values, overflow, and underflow).

The corresponding exponents are \(\mathsf{e_z} = \mathsf{e_x}\), \(\mathsf{e_x+1}\), \(\mathsf{e_x-1}\), \(\mathsf{e_x+1}\), \(\mathsf{e_x-p}\). The last case is the most dangerous one: if \(\mathsf{x}\) and \(\mathsf{y}\) are the same order of magnitude but opposite sign and if the \(\mathsf{p}\) highest bits of \(\mathsf{1.f_x}\) and \(\mathsf{1.f_y}\) agree (\(\mathsf{1} \le \mathsf{p} \le \textsf{width}(\mathsf{f})\)), then \(\mathsf{p}\) digits of significance are lost. The first case is also problematic: \(\mathsf{e_x-e_y}\) digits of \(\hat{\mathsf{y}}\) are truncated, which can be serious if \(\mathsf{e_-e_y}\) is comparable to the number of digits in the fraction field; if exey exceeds the width of \(\mathsf{f}\) then \(\mathsf{\hat{x}+\hat{y}}\) and \(\hat{\mathsf{x}}\) are identical.

case 1: \(\mathsf{e_y - e_x} = \mathsf{-3}\), \(\mathsf{f_z} = \mathsf{1000}\), \(\mathsf{e_z} = \mathsf{e_x}\)

```
1.0101
+ (1.1001 / 1000)
-----------------
1.01010000
+ 0.00110010
-------------
1.10000010
-------------
1.1000
ffff
```

case 2: \(\mathsf{e_y-e_x} = \mathsf{-2}\), \(\mathsf{f_z} = \mathsf{1000}\), \(\mathsf{e_z} = \mathsf{e_x+1}\)

```
1.1101
+ (1.011 / 100)
---------------
1.11010000
+ 0.01011000
-------------
10.00101000
-------------
1.000101000 × 10
-------------------
1.0001
ffff
```

case 5: \(\mathsf{e_y-e_x} = \mathsf{0}\), \(\mathsf{f_z} = \mathsf{0000}\), \(\mathsf{e_z} = \mathsf{e_x-4}\)

```
1.0111
− 1.0110
--------
1.01110000
− 1.01100000
------------
0.00010000
------------
1.00000000 / 10000
--------------------
1.0000
ffff
```

In the first two examples above, some accuracy is lost when the final result is reduced to four fraction bits, but in both cases the 5 remaining digits are all significant. In the last example, only the first digit of the result is significant.

## Propagation of errors

Floating-point numbers are an approximation to the reals. We can express the discrepancy in either absolute or relative terms: \(\mathsf{x} = \hat{\mathsf{x}} + \delta_\mathsf{x} = \hat{\mathsf{x}}(\mathsf{1+\epsilon_x})\). The relative error is related to the concept of “number of digits of accuracy.” Roughly speaking, the number of binary digits that are meaningful is \(-\textsf{log}_2(|\epsilon_\mathsf{x}|)\).

To order \(\mathsf{O}(\epsilon^\mathsf{2})\), we find that under multiplication, the relative errors add (\(\epsilon_{\mathsf{xy}} = \epsilon_\mathsf{x}+\epsilon_\mathsf{y}\)):

Under addition, it is the absolute errors that add (\(\mathsf{\delta_{x+y}} = \mathsf{\delta_x} + \mathsf{\delta_y}\)):

The relative errors combine in a more complicated way. The relative error of the result is a weighted average of the individual relative errors that depends explicitly on \(\mathsf{x}\) and \(\mathsf{y}\). The two important limiting cases are \(\mathsf{x} \to -\mathsf{y}\) and \(\mathsf{x} \to \mathsf{y}\). Since \(\mathsf{x+y}\) appears in the denominator, it’s clear that the error swamps the result when \(\mathsf{x}\) becomes sufficiently close to \(\mathsf{y}\). On the other hand, if \(\mathsf{x}\) and \(\mathsf{y}\) are very close in value, the relative error goes as \(\mathsf{(x+y)/2}\), which is safe.