Lab 3

Get the code

Find the Lab 3 code in the class repository.

$ ls
README.md  classes/   labs/      scripts/
$ ls labs/lab3
README    cpp/      julia/    python/   rust/
$ ls labs/lab3/cpp
add_methods.cpp  bitwise.cpp      fp.cpp           sumdata.cpp
add_methods.h    curves.cpp       makefile         view-sum.gp
bitconvert.h     denorm.cpp       max.cpp

The instructions here focus on C++, but you may carry out the comparable steps in Julia, Python, or Rust. You only need to submit your solutions for a single programming language.

Curves

The lab3 directory contains a short C++ program named curves.cpp that instructs the computer to output $\mathsf{x}$ and $\mathsf{x^2 + 2x + 3}$ in two columns (of width 20 characters) with $\mathsf{x}$ ranging from −10 to 10 in increments of 0.2. The make command invokes the compiler, which creates an executable named curves.

make curves
g++ -o curves curves.cpp -O -ansi -pedantic -Wall
$ ls
add_methods.cpp  bitwise.cpp      denorm.cpp       max.cpp
add_methods.h    curves*          fp.cpp           sumdata.cpp
bitconvert.h     curves.cpp       makefile         view-sum.gp

The program outputs directly to the terminal. There are likely more lines of output from curves than there are lines in your terminal window. You can scroll through the output by piping it to the more command.

$ ./curves
              -10                  83
             -9.8               79.44
             -9.6               75.96
             -9.4               72.56
                .                   .
                .                   .
                .                   .
             -0.4                2.36
             -0.2                2.64
     -2.05391e-15                   3
              0.2                3.44
              0.4                3.96
                .                   .
                .                   .
                .                   .
              9.4              110.16
              9.6              114.36
              9.8              118.64
               10                 123
$ ./curves | more

Notice that we don’t get exactly $\mathsf{x=0}$. Why might that be?

Use the shell’s redirection operator (>) to save the curves output to a file named poly.dat. Use tail to inspect the last five lines of the file. The data can be viewed graphically in a program called gnuplot.

$ ./curves > poly.dat
$ tail -n5 poly.dat
                 9.2              106.04
                 9.4              110.16
                 9.6              114.36
                 9.8              118.64
                  10                 123
$ gnuplot
> plot "poly.dat"
> plot "poly.dat" with lines
> plot "poly.dat", x**2 + 2*x + 3
> quit

The plot command will display both data from a file and arithmetic expressions in x. The with directive sets the style for the plot. You can see all the possible styles by typing help with at the gnuplot prompt. Note that gnuplot has an exponentiation operator (**), whereas C++ does not.

Open the curves.cpp file in your favourite text editor and modify it so that the program now outputs three columns of data containing $\mathsf{x}$, $\mathsf{p(x)}$, and $\mathsf{q(x)}$, where $\mathsf{p(x)} = \mathsf{x^2 + 2x + 3}$ and $\mathsf{q(x)} = \mathsf{-\tfrac{1}{8}x^4 - \tfrac{1}{2}x^3 + x + 2}$. (Warning: the expressions 1/8 and 1.0/8 do not evaluate to the same number!) You will have to write a new polynom function to handle the fourth-order polynomial. While you’re at it, change the range to $\mathsf{-15} \le \mathsf{x} \le \mathsf{15}$ and the increment to 0.1. Check that you can exactly reproduce the session below.

$ make
g++ -o curves curves.cpp -O -ansi -pedantic -Wall
$ ./curves > poly.dat
$ head -n3 poly.dat
                 -15                 198            -4653.62
               -14.9              195.21            -4519.98
               -14.8              192.44            -4389.22
$ tail -n3 poly.dat
                14.8              251.64            -7601.41
                14.9              254.81            -7798.13
                  15                 258            -7998.62
$ gnuplot
> plot "poly.dat"
> plot "poly.dat", x**2 + 2*x + 3
> plot "poly.dat" using 1:2, x**2 + 2*x + 3
> plot "poly.dat" using 1:3, -0.125*x**4 - 0.5*x**3 + x + 2
> plot "poly.dat" using 1:2 with line, "poly.dat" using 1:3 with lines
> plot[-5:5] "poly.dat" using 1:2 with line, "poly.dat" using 1:3 with lines
> plot[-4:1][0:10] "poly.dat" using 1:2 with line, "poly.dat" using 1:3 with lines
> plot[-3:1] "poly.dat" using 1:(8*$3-2*$2), -x**4-4*x**3-2*x**2+4*x+10
> quit

The range is set explicitly by including [xmin:xmay][ymin:ymax]. In the final plot command shown above, the using 1:(8*$3-2*$2) directive says to plot column 1 on the x-axis and 8 times column 3 minus 2 times column 2 on the y-axis. This linear combination corresponds to the polynomial

\[\begin{aligned} \mathsf{8q(x) - 2p(x)} &= \mathsf{8\times \bigl( -\tfrac{1}{8}x^4 - \tfrac{1}{2}x^3 + x + 2 \bigr) - 2 \times \bigl( x^2 + 2x + 3 \bigr)}\\ &= \mathsf{-x^4 - 4x^3 + 8x + 16 - 2x^2 - 4x - 6}\\ &= \mathsf{-x^4 - 4x^3 - 2x^2 + 4x + 10}. \end{aligned}\]

By default, on my machine at least, gnuplot displays its output in a qt window. (On legacy unix systems, the default gnuplot terminal may be x11.) You can specify that the output be sent to a file instead (in any one of many possible graphics formats). For example, the following session creates a plot in Portable Document Format.

$ gnuplot
> plot[-3:1] "poly.dat" using 1:(8*$3-2*$2), -x**4-4*x**3-2*x**2+4*x+10
> set terminal pdfcairo
> set output 'poly.pdf'
> replot
> quit
$ ${YOUR_PDF_READER} poly.pdf

gnuplot can fit numerical data to a given analytical form. In the following session, a general fourth-order polynomial f(x) is defined with five free parameters. Those free parameters are passed to the fit command using via.

$ gnuplot
> plot[-3:1] "poly.dat" using 1:(8*$3-2*$2)
> f(x) = a*x**4 + b*x**3 + c*x**2 + d*x + e
> fit[-3:1] f(x) "poly.dat" using 1:(8*$3-2*$2) via a,b,c,d,e
> plot[-3:1] "poly.dat" using 1:(8*$3-2*$2), f(x)
> quit

Find the best fit to a general third-order polynomial over the range [-3:1]. The fit should be quite bad. Now redo the fit for a fifth-order polynomial. What is the value of the $\mathsf{x^5}$ coefficient? Why isn’t it exactly zero?

Floating-point arithmetic

The program fp.cpp is designed to showcase the rules for floating point arithmetic—in particular, how the special values inf, -inf, and nan are generated and how they propagate through arithmetic calculations. Make the necessary changes to the program so that it produces the output shown below:

$ make fp
g++ -o fp fp.cpp -O2 -ansi -pedantic -Wall -lm
$ ./fp
Floating point division:
1 / 0 = inf
-1 / 0 = -inf
0 / 0 = nan
inf / 0 = inf
-inf / 0 = -inf
inf / inf = nan

Floating point multiplication:
1 * 0 = 0
-1 * 0 = -0
1 * inf = inf
-1 * inf = -inf
0 * inf = nan
-inf * 0 = nan
inf * inf = inf
inf * -inf = -inf
1 * nan = nan
-1 * nan = nan
0 * nan = nan
-0 * nan = nan

Floating point addition and subtraction:
1 + inf = inf
1 - inf = -inf
inf + inf = inf
inf - inf = nan
1 + nan = nan
1 - nan = nan

Can you think of a good reason why both 0 and inf should carry sign information?

Be aware that languages other than C++ may flag some of these operations as errors or exceptions. For instance, dividing by zero in Python triggers a ZeroDivisionError exception, which must be manually caught lest it crash the program:

a=1
b=0
try:
   value = a/b
except ZeroDivisionError:
   value = float('Inf')
print("{}/{} = {}".format(a,b,value))

Underlying bit patterns

The program bitwise.cpp displays the underlying 32-bit binary patterns that encode the floating point representations of $\mathsf{1}$, $\mathsf{3}$, and $\mathsf{-18}$. The header bitconvert.h loads in a special type named convert32_t, which can be interpreted both as a float and as a 32-bit unsigned integer type (uint32_t). The values $\mathsf{3F800000_{16}}$, $\mathsf{40400000_{16}}$, and $\mathsf{C1900000_{16}}$ are directly assigned to its .I32 data member; the corresponding values 1.0F, 3.0F, and -18.0F are accessed from .F32. C++ uses the convention that numbers beginning with the 0x prefix are interpreted in hexadecimal (base~16).

Extend the program so that it also outputs the bit encoding for 214.25F and -inf. Those values should be assigned to .I32 using the appropriate hexadecimal value. You should make use of the identity

\[\begin{aligned} \mathsf{214.25} & = \mathsf{128\biggl(1+ \frac{1}{2} + \frac{1}{8} + \frac{1}{32} + \frac{1}{64} + \frac{1}{512} \biggr)}\\ &= \mathsf{+2^{134-127}\bigl(1+2^{-1}+2^{-3}+2^{-5}+2^{-6}+2^{-9} \bigr)}. \end{aligned}\]

An infinite value is designated by all 1s in the exponent field and all 0s in the fraction field.

$ make bitwise
g++ -o bitwise bitwise.cpp -O2 -ansi -pedantic -Wall -lm
$ ./bitwise
Single-precision floating point:
s/eeeeeeee/fffffffffffffffffffffff
0 01111111 00000000000000000000000
1

s/eeeeeeee/fffffffffffffffffffffff
0 10000000 10000000000000000000000
3

s/eeeeeeee/fffffffffffffffffffffff
1 10000011 00100000000000000000000
-18

s/eeeeeeee/fffffffffffffffffffffff
0 10000110 10101100100000000000000
214.25

s/eeeeeeee/fffffffffffffffffffffff
1 11111111 00000000000000000000000
-inf

Overflow

Compile the program max.cpp and pipe the resulting executable to tIIe more command. This will allow you to scroll through the results (page-by-page using the space bar, and line-by-line using the up and down arrow keys).

This program computes the sequence $\mathsf{1,10,100,1000,\ldots}$ using floating point numbers of varying width. Convince yourself that the largest finite numerical values of type float, double, and long double are on the order of $\mathsf{10^{38}}$, $\mathsf{10^{308}}$, and $\mathsf{10^{4932}}$, respectively. (The size of the long double is architecture dependent, and it may differ on your machine.) Notice that there is no loss of precision in the sequence until an element is no longer representable.

In Julia, you can directly specify variables to have single or double precision with a type specifyier: e.g., a::Float32 and b::Float64.

Underflow

Create a program file min.cpp using max.cpp as a template. It may be convenient to start with a copy:

$ cp max.cpp min.cpp
$ vim min.cpp

The new program should compute the decreasing sequence $\mathsf{1,\tfrac{1}{10},\tfrac{1}{100},\tfrac{1}{1000},\ldots}$ using the float, double, and long double types. Have the program indicate exact zeros in text form.

$ make min
g++ -o min min.cpp -O2 -ansi -pedantic -Wall -lm
$ ./min | more
     1.00000e+00     1.00000000000000e+00     1.00000000000000000e+00
     1.00000e-01     1.00000000000000e-01     1.00000000000000000e-01
     1.00000e-02     1.00000000000000e-02     1.00000000000000000e-02
               .                        .                           .
               .                        .                           .
     1.00000e-38     1.00000000000000e-38     1.00000000000000000e-38
     1.00000e-39     1.00000000000000e-39     1.00000000000000000e-39
     1.00001e-40     1.00000000000000e-40     1.00000000000000000e-40
     9.99967e-42     1.00000000000000e-41     1.00000000000000000e-41
     1.00053e-42     1.00000000000000e-42     1.00000000000000000e-42
     9.94922e-44     1.00000000000000e-43     1.00000000000000000e-43
     9.80909e-45     1.00000000000000e-44     1.00000000000000000e-44
     1.40130e-45     1.00000000000000e-45     1.00000000000000000e-45
      exact zero     1.00000000000000e-46     1.00000000000000000e-46
      exact zero     1.00000000000000e-47     1.00000000000000000e-47
               .                        .                           .
               .                        .                           .
      exact zero    1.00000000000002e-308    1.00000000000000000e-308
      exact zero    1.00000000000002e-309    1.00000000000000001e-309
      exact zero    9.99999999999997e-311    1.00000000000000001e-310
      exact zero    9.99999999999948e-312    1.00000000000000001e-311
      exact zero    9.99999999998465e-313    1.00000000000000001e-312
      exact zero    1.00000000001329e-313    1.00000000000000001e-313
      exact zero    9.99999999963881e-315    1.00000000000000001e-314
      exact zero    9.99999998481684e-316    1.00000000000000001e-315
      exact zero    9.99999983659714e-317    1.00000000000000001e-316
      exact zero    1.00000023069254e-317    1.00000000000000001e-317
      exact zero    9.99998748495600e-319    1.00000000000000001e-318
      exact zero    9.99988867182683e-320    1.00000000000000001e-319
      exact zero    9.99988867182683e-321    1.00000000000000001e-320
      exact zero    9.98012604599318e-322    1.00000000000000001e-321
      exact zero    9.88131291682493e-323    1.00000000000000001e-322
      exact zero    9.88131291682493e-324    1.00000000000000001e-323
      exact zero               exact zero    1.00000000000000001e-324
      exact zero               exact zero    1.00000000000000001e-325
               .                        .                           .
               .                        .                           .
      exact zero               exact zero   1.00000000000000027e-4935
      exact zero               exact zero   1.00000000000000100e-4936
      exact zero               exact zero   9.99999999999990067e-4938
      exact zero               exact zero   1.00000000000006297e-4938
      exact zero               exact zero   9.99999999999698452e-4940
      exact zero               exact zero   9.99999999996053252e-4941
      exact zero               exact zero   9.99999999850245271e-4942
      exact zero               exact zero   9.99999999485725318e-4943
      exact zero               exact zero   1.00000001771172298e-4943
      exact zero               exact zero   9.99999981259727658e-4945
      exact zero               exact zero   9.99998523179914905e-4946
      exact zero               exact zero   9.99987587581319258e-4947
      exact zero               exact zero   9.99878231595362783e-4948
      exact zero               exact zero   9.98784671735798041e-4949
      exact zero               exact zero   9.84203873608268143e-4950
      exact zero               exact zero   1.09355985956474238e-4950
      exact zero               exact zero                  exact zero

The approach to zero is quite different from the approach to infinity that we encountered previously. Here, there is a steady loss of precision in the steps immediately preceding exact zero. Think about why that it is.

Denormalization

Modify the program denorm.cpp so that the convert32_t variable S.F32 = 5.9E-37F is repeatedly halved until all of its bits are equal to zero. This process illustrates how loss of precision comes about in so-called denormalized numbers.

$ make denorm
g++ -o denorm denorm.cpp -O2 -ansi -pedantic -Wall -lm
$ ./denorm | more
Single-precision floating point:
s/eeeeeeee/fffffffffffffffffffffff
0 00000110 10010001100010001000000
5.9e-37

s/eeeeeeee/fffffffffffffffffffffff
0 00000101 10010001100010001000000
2.95e-37

s/eeeeeeee/fffffffffffffffffffffff
0 00000100 10010001100010001000000
1.475e-37

s/eeeeeeee/fffffffffffffffffffffff
0 00000011 10010001100010001000000
7.375e-38

s/eeeeeeee/fffffffffffffffffffffff
0 00000010 10010001100010001000000
3.6875e-38

s/eeeeeeee/fffffffffffffffffffffff
0 00000001 10010001100010001000000
1.84375e-38

s/eeeeeeee/fffffffffffffffffffffff
0 00000000 11001000110001000100000
9.21875e-39

s/eeeeeeee/fffffffffffffffffffffff
0 00000000 01100100011000100010000
4.60938e-39

s/eeeeeeee/fffffffffffffffffffffff
0 00000000 00110010001100010001000
2.30469e-39

s/eeeeeeee/fffffffffffffffffffffff
0 00000000 00011001000110001000100
1.15234e-39

s/eeeeeeee/fffffffffffffffffffffff
0 00000000 00001100100011000100010
5.76172e-40

s/eeeeeeee/fffffffffffffffffffffff
0 00000000 00000110010001100010001
2.88086e-40

s/eeeeeeee/fffffffffffffffffffffff
0 00000000 00000011001000110001000
1.44042e-40
.
.
s/eeeeeeee/fffffffffffffffffffffff
0 00000000 00000000000000000000010
2.8026e-45

s/eeeeeeee/fffffffffffffffffffffff
0 00000000 00000000000000000000001
1.4013e-45

s/eeeeeeee/fffffffffffffffffffffff
0 00000000 00000000000000000000000
0

Tally up, tally down

Write a program sum.cpp that computes the partial sum $\mathsf{\sum_{n=1}^N \tfrac{1}{n}}$ for each of $\mathsf{N}=\mathsf{10}, \mathsf{100}, \mathsf{1000}, \ldots, \mathsf{100\,000\,000}$. Accumulate the terms in a float (i.e. single-precision, 32-bit) variable in both ascending and descending order (with respect to $\mathsf{n}$), and be sure to store the index $\mathsf{n}$ in an integer type wide enough to accommodate the largest upper limit of the sum. The two results should be written in columns alongside the corresponding value of $\mathsf{N}$. The view-sum.gp script performs a visual comparison with the correct asymptotic behaviour,

\[\mathsf{\log N + \gamma + \frac{1}{2N} + \frac{1}{12N^2} + \cdots}\]

Which summation direction is better and why?

$ make sum
g++ -o sum sum.cpp -O2 -ansi -pedantic -Wall -lm
$ ./sum
             10             2.92897             2.92897
            100             5.18738             5.18738
           1000             7.48548             7.48547
          10000             9.78761              9.7876
         100000             12.0909             12.0902
        1000000             14.3574             14.3927
       10000000             15.4037              16.686
      100000000             15.4037             18.8079
$ gnuplot -persist view-sum.gp

Accumulation strategies

In various scientific settings, we often need to accumulate numerical values and their powers (e.g., to compute the standard deviation of a large data set). When the values span a wide range of orders of magnitude, the way in which the numbers are summed can make a big difference in how much accuracy is retained. A particularly difficult case occurs when we try to sum non-converging partial series. Let’s compare to some exact results:

\[\mathsf{\sum_{n=1}^N n} = \mathsf{\frac{N(N+1)}{2}},~~~ \mathsf{\sum_{n=1}^N n^2} = \mathsf{\frac{N}{6}\bigl(2N^2+3N+1\bigr)},~~~ \mathsf{\sum_{n=1}^N n^3} = \mathsf{\frac{N^2}{4}\bigl(N^2+2N+1\bigr)}.\]

The values for $\mathsf{N} = \mathsf{100\,000}$ are given in the table below. We discover that the second and third powers can’t even be represented exactly within the single-precision floating-point scheme!

power	exact value of the sum	floating-point
1	5,000,050,000	5.0000500e+09
2	333,338,333,350,000	3.3333833e+14
3	25,000,500,002,500,000,000	2.5000500e+19

We will use the seq command (which you may have to replace with jot in BSD Unix environments) to generate incrementing sequences and awk to raise those numbers to the requisite powers. For example,

$ seq 5 | awk '{ print $1, $1*$1, $1*$1*$1 }'
1 1
4 8
9 27
16 64
25 125

The program sumdata.cpp reads values from cin as type float and reports their sum based on the functions provided in add_methods.cpp. For everything to work correctly, you’ll have to fill in the missing function bodies in that file. The methods provided include summing the terms in increasing and decreasing order, summing the sorted list pairwise (as we discussed in class), summing in conventional order with a high-precision (long double) accumulator variable, and a compensated summation scheme due to Kahan. For those methods that require sorting, make use of the STL sort function provided by the algorithm header. You can read about compensated summation in the refences below.