Assignment 2
In the usual nomenclature, 32-bit and 64-bit (IEEE Standard 754) floating point numbers are referred to as single precision and double precision. Suppose that we invent an 8-bit quarter-precision standard consisting of one sign bit, three exponent bits, and four mantissa bits:
s |
eee |
mmmm |
The sign bit signals whether the number is positive (\(\mathsf{0}\)) or negative (\(\mathsf{1}\)), and the exponent uses an offset \(\mathsf{(011)_2 = 0+2+1 = 3}\). This means that the numbers \(\mathsf{1}\) and \(\mathsf{-2}\) are represented as
\(\mathsf{+1= + 1 \times 2^0 = +(1.000)_2 \times 2^{3-3} =}\)
0 |
011 |
0000 |
\(\mathsf{-2= - 1 \times 2^1 = +(1.000)_2 \times 2^{4-3} =}\)
1 |
100 |
0000 |
What is the largest, positive, finite number that can be represented in this scheme. Report the decimal value and the underlying bit pattern.
What is the largest (in magnitude) negative, finite number that can be represented in this scheme. Report the decimal value and the underlying bit pattern.
What is the smallest, positive, nondenormalized nonzero number that can be represented? Report the decimal value and the underlying bit pattern.
Here is the bit pattern for zero:
0
000
0000
Report the decimal value and bit pattern of the positive, denormalized number that is closest to zero; also report the largest, positive denormalized number.
Consider the following numbers:
\[\begin{aligned} \mathsf{u = 2 = \bigl(1+0\bigr)\times 2^1} &= \mathsf{(1.0000)_2 \times 2^{4-3}}\\ \mathsf{w = 2.25 = \bigl(1+\tfrac{1}{8}\bigr)\times 2^1} &= \mathsf{(1.0010)_2 \times 2^{4-3}}\\ \mathsf{x = 4.25 = \bigl(1+\tfrac{1}{16}\bigr)\times 2^2} &= \mathsf{(1.0001)_2 \times 2^{5-3}}\\ \mathsf{y = 4.5 = \bigl(1+\tfrac{1}{8}\bigr) \times 2^2} &= \mathsf{(1.0010)_2 \times 2^{5-3}} \end{aligned}\]Express each of \(\mathsf{u}\), \(\mathsf{w}\), \(\mathsf{x}\), \(\mathsf{y}\) in the 8-bit floating-point scheme. Then determine the difference of their squares, computed as \(\mathsf{x^2 - y^2}\) and \(\mathsf{(x+y)(x-y)}\); and \(\mathsf{u^2 - w^2}\) and \(\mathsf{(u+w)(u-w)}\).