Lab 3
Get the code
Find the Lab 3 code in the class repository.
$ ls
README.md classes/ labs/ scripts/
$ ls labs/lab3
README cpp/ julia/ python/ rust/
$ ls labs/lab3/cpp
add_methods.cpp bitwise.cpp fp.cpp sumdata.cpp
add_methods.h curves.cpp makefile view-sum.gp
bitconvert.h denorm.cpp max.cpp
The instructions here focus on C++, but you may carry out the comparable steps in Julia, Python, or Rust. You only need to submit your solutions for a single programming language.
Curves
The lab3
directory contains a short C++ program named curves.cpp
that instructs the computer to output \(\mathsf{x}\) and
\(\mathsf{x^2 + 2x + 3}\) in two columns (of width 20 characters)
with \(\mathsf{x}\) ranging from −10 to 10 in increments of 0.2. The
make
command invokes the compiler, which creates an executable named
curves
.
make curves
g++ -o curves curves.cpp -O -ansi -pedantic -Wall
$ ls
add_methods.cpp bitwise.cpp denorm.cpp max.cpp
add_methods.h curves* fp.cpp sumdata.cpp
bitconvert.h curves.cpp makefile view-sum.gp
The program outputs directly to the terminal. There are likely more
lines of output from curves
than there are lines in your terminal
window. You can scroll through the output by piping it to the
more
command.
$ ./curves
-10 83
-9.8 79.44
-9.6 75.96
-9.4 72.56
. .
. .
. .
-0.4 2.36
-0.2 2.64
-2.05391e-15 3
0.2 3.44
0.4 3.96
. .
. .
. .
9.4 110.16
9.6 114.36
9.8 118.64
10 123
$ ./curves | more
Notice that we don’t get exactly \(\mathsf{x=0}\). Why might that be?
Use the shell’s redirection operator (>
) to save the curves
output
to a file named poly.dat
. Use tail
to inspect the last five
lines of the file. The data can be viewed graphically in a program
called gnuplot
.
$ ./curves > poly.dat
$ tail -n5 poly.dat
9.2 106.04
9.4 110.16
9.6 114.36
9.8 118.64
10 123
$ gnuplot
> plot "poly.dat"
> plot "poly.dat" with lines
> plot "poly.dat", x**2 + 2*x + 3
> quit
The plot
command will display both data from a file and arithmetic
expressions in x
. The with
directive sets the style for the plot.
You can see all the possible styles by typing help with
at the
gnuplot
prompt. Note that gnuplot
has an exponentiation operator
(**
), whereas C++ does not.
Open the curves.cpp
file in your favourite text editor and modify it
so that the program now outputs three columns of data containing
\(\mathsf{x}\), \(\mathsf{p(x)}\), and \(\mathsf{q(x)}\),
where \(\mathsf{p(x)} = \mathsf{x^2 + 2x + 3}\) and
\(\mathsf{q(x)} = \mathsf{-\tfrac{1}{8}x^4 - \tfrac{1}{2}x^3
+ x + 2}\). (Warning: the expressions 1/8
and 1.0/8
do not
evaluate to the same number!) You will have to write a new polynom
function to handle the fourth-order polynomial. While you’re at it,
change the range to \(\mathsf{-15} \le \mathsf{x} \le
\mathsf{15}\) and the increment to 0.1. Check that you can exactly
reproduce the session below.
$ make
g++ -o curves curves.cpp -O -ansi -pedantic -Wall
$ ./curves > poly.dat
$ head -n3 poly.dat
-15 198 -4653.62
-14.9 195.21 -4519.98
-14.8 192.44 -4389.22
$ tail -n3 poly.dat
14.8 251.64 -7601.41
14.9 254.81 -7798.13
15 258 -7998.62
$ gnuplot
> plot "poly.dat"
> plot "poly.dat", x**2 + 2*x + 3
> plot "poly.dat" using 1:2, x**2 + 2*x + 3
> plot "poly.dat" using 1:3, -0.125*x**4 - 0.5*x**3 + x + 2
> plot "poly.dat" using 1:2 with line, "poly.dat" using 1:3 with lines
> plot[-5:5] "poly.dat" using 1:2 with line, "poly.dat" using 1:3 with lines
> plot[-4:1][0:10] "poly.dat" using 1:2 with line, "poly.dat" using 1:3 with lines
> plot[-3:1] "poly.dat" using 1:(8*$3-2*$2), -x**4-4*x**3-2*x**2+4*x+10
> quit
The range is set explicitly by including [xmin:xmay][ymin:ymax]
. In
the final plot
command shown above, the using 1:(8*$3-2*$2)
directive says to plot column 1 on the x-axis and 8 times column 3
minus 2 times column 2 on the y-axis. This linear combination
corresponds to the polynomial
By default, on my machine at least, gnuplot displays its output in a qt window. (On legacy unix systems, the default gnuplot terminal may be x11.) You can specify that the output be sent to a file instead (in any one of many possible graphics formats). For example, the following session creates a plot in Portable Document Format.
$ gnuplot
> plot[-3:1] "poly.dat" using 1:(8*$3-2*$2), -x**4-4*x**3-2*x**2+4*x+10
> set terminal pdfcairo
> set output 'poly.pdf'
> replot
> quit
$ ${YOUR_PDF_READER} poly.pdf
gnuplot
can fit numerical data to a given analytical form. In the
following session, a general fourth-order polynomial f(x)
is defined
with five free parameters. Those free parameters are passed to the fit
command using via
.
$ gnuplot
> plot[-3:1] "poly.dat" using 1:(8*$3-2*$2)
> f(x) = a*x**4 + b*x**3 + c*x**2 + d*x + e
> fit[-3:1] f(x) "poly.dat" using 1:(8*$3-2*$2) via a,b,c,d,e
> plot[-3:1] "poly.dat" using 1:(8*$3-2*$2), f(x)
> quit
Find the best fit to a general third-order polynomial over the range
[-3:1]
. The fit should be quite bad. Now redo the fit for
a fifth-order polynomial. What is the value of the \(\mathsf{x^5}\)
coefficient? Why isn’t it exactly zero?
Floating-point arithmetic
The program fp.cpp
is designed to showcase the rules for floating
point arithmetic—in particular, how the special values inf
,
-inf
, and nan
are generated and how they propagate through
arithmetic calculations. Make the necessary changes to the program so
that it produces the output shown below:
$ make fp
g++ -o fp fp.cpp -O2 -ansi -pedantic -Wall -lm
$ ./fp
Floating point division:
1 / 0 = inf
-1 / 0 = -inf
0 / 0 = nan
inf / 0 = inf
-inf / 0 = -inf
inf / inf = nan
Floating point multiplication:
1 * 0 = 0
-1 * 0 = -0
1 * inf = inf
-1 * inf = -inf
0 * inf = nan
-inf * 0 = nan
inf * inf = inf
inf * -inf = -inf
1 * nan = nan
-1 * nan = nan
0 * nan = nan
-0 * nan = nan
Floating point addition and subtraction:
1 + inf = inf
1 - inf = -inf
inf + inf = inf
inf - inf = nan
1 + nan = nan
1 - nan = nan
Can you think of a good reason why both 0
and inf
should carry
sign information?
Be aware that languages other than C++ may flag some of these operations as
errors or exceptions. For instance, dividing by zero in Python triggers a
ZeroDivisionError
exception, which must be manually caught lest it
crash the program:
a=1
b=0
try:
value = a/b
except ZeroDivisionError:
value = float('Inf')
print("{}/{} = {}".format(a,b,value))
Underlying bit patterns
The program bitwise.cpp
displays the underlying 32-bit binary
patterns that encode the floating point representations of
\(\mathsf{1}\), \(\mathsf{3}\), and \(\mathsf{-18}\). The
header bitconvert.h
loads in a special type named convert32_t
,
which can be interpreted both as a float
and as a 32-bit unsigned
integer type (uint32_t
). The values \(\mathsf{3F800000_{16}}\),
\(\mathsf{40400000_{16}}\), and \(\mathsf{C1900000_{16}}\) are
directly assigned to its .I32
data member; the corresponding values
1.0F
, 3.0F
, and -18.0F
are accessed from .F32
. C++ uses
the convention that numbers beginning with the 0x
prefix are
interpreted in hexadecimal (base~16).
Extend the program so that it also outputs the bit encoding for
214.25F
and -inf
. Those values should be assigned to .I32
using the appropriate hexadecimal value. You should make use of the
identity
An infinite value is designated by all 1s in the exponent field and all 0s in the fraction field.
$ make bitwise
g++ -o bitwise bitwise.cpp -O2 -ansi -pedantic -Wall -lm
$ ./bitwise
Single-precision floating point:
s/eeeeeeee/fffffffffffffffffffffff
0 01111111 00000000000000000000000
1
s/eeeeeeee/fffffffffffffffffffffff
0 10000000 10000000000000000000000
3
s/eeeeeeee/fffffffffffffffffffffff
1 10000011 00100000000000000000000
-18
s/eeeeeeee/fffffffffffffffffffffff
0 10000110 10101100100000000000000
214.25
s/eeeeeeee/fffffffffffffffffffffff
1 11111111 00000000000000000000000
-inf
Overflow
Compile the program max.cpp
and pipe the resulting executable to
tIIe more
command. This will allow you to scroll through the results
(page-by-page using the space bar, and line-by-line using the up and
down arrow keys).
This program computes the sequence \(\mathsf{1,10,100,1000,\ldots}\)
using floating point numbers of varying width. Convince yourself that
the largest finite numerical values of type float
, double
, and
long double
are on the order of \(\mathsf{10^{38}}\),
\(\mathsf{10^{308}}\), and \(\mathsf{10^{4932}}\), respectively.
(The size of the long double
is architecture dependent, and it may
differ on your machine.) Notice that there is no loss of precision in the
sequence until an element is no longer representable.
In Julia, you can directly specify variables to have single or double precision with a type specifyier: e.g., a::Float32
and b::Float64
.
Underflow
Create a program file min.cpp
using max.cpp
as a template. It
may be convenient to start with a copy:
$ cp max.cpp min.cpp
$ vim min.cpp
The new program should compute the decreasing sequence
\(\mathsf{1,\tfrac{1}{10},\tfrac{1}{100},\tfrac{1}{1000},\ldots}\)
using the float
, double
, and long double
types. Have the
program indicate exact zeros in text form.
$ make min
g++ -o min min.cpp -O2 -ansi -pedantic -Wall -lm
$ ./min | more
1.00000e+00 1.00000000000000e+00 1.00000000000000000e+00
1.00000e-01 1.00000000000000e-01 1.00000000000000000e-01
1.00000e-02 1.00000000000000e-02 1.00000000000000000e-02
. . .
. . .
1.00000e-38 1.00000000000000e-38 1.00000000000000000e-38
1.00000e-39 1.00000000000000e-39 1.00000000000000000e-39
1.00001e-40 1.00000000000000e-40 1.00000000000000000e-40
9.99967e-42 1.00000000000000e-41 1.00000000000000000e-41
1.00053e-42 1.00000000000000e-42 1.00000000000000000e-42
9.94922e-44 1.00000000000000e-43 1.00000000000000000e-43
9.80909e-45 1.00000000000000e-44 1.00000000000000000e-44
1.40130e-45 1.00000000000000e-45 1.00000000000000000e-45
exact zero 1.00000000000000e-46 1.00000000000000000e-46
exact zero 1.00000000000000e-47 1.00000000000000000e-47
. . .
. . .
exact zero 1.00000000000002e-308 1.00000000000000000e-308
exact zero 1.00000000000002e-309 1.00000000000000001e-309
exact zero 9.99999999999997e-311 1.00000000000000001e-310
exact zero 9.99999999999948e-312 1.00000000000000001e-311
exact zero 9.99999999998465e-313 1.00000000000000001e-312
exact zero 1.00000000001329e-313 1.00000000000000001e-313
exact zero 9.99999999963881e-315 1.00000000000000001e-314
exact zero 9.99999998481684e-316 1.00000000000000001e-315
exact zero 9.99999983659714e-317 1.00000000000000001e-316
exact zero 1.00000023069254e-317 1.00000000000000001e-317
exact zero 9.99998748495600e-319 1.00000000000000001e-318
exact zero 9.99988867182683e-320 1.00000000000000001e-319
exact zero 9.99988867182683e-321 1.00000000000000001e-320
exact zero 9.98012604599318e-322 1.00000000000000001e-321
exact zero 9.88131291682493e-323 1.00000000000000001e-322
exact zero 9.88131291682493e-324 1.00000000000000001e-323
exact zero exact zero 1.00000000000000001e-324
exact zero exact zero 1.00000000000000001e-325
. . .
. . .
exact zero exact zero 1.00000000000000027e-4935
exact zero exact zero 1.00000000000000100e-4936
exact zero exact zero 9.99999999999990067e-4938
exact zero exact zero 1.00000000000006297e-4938
exact zero exact zero 9.99999999999698452e-4940
exact zero exact zero 9.99999999996053252e-4941
exact zero exact zero 9.99999999850245271e-4942
exact zero exact zero 9.99999999485725318e-4943
exact zero exact zero 1.00000001771172298e-4943
exact zero exact zero 9.99999981259727658e-4945
exact zero exact zero 9.99998523179914905e-4946
exact zero exact zero 9.99987587581319258e-4947
exact zero exact zero 9.99878231595362783e-4948
exact zero exact zero 9.98784671735798041e-4949
exact zero exact zero 9.84203873608268143e-4950
exact zero exact zero 1.09355985956474238e-4950
exact zero exact zero exact zero
The approach to zero is quite different from the approach to infinity that we encountered previously. Here, there is a steady loss of precision in the steps immediately preceding exact zero. Think about why that it is.
Denormalization
Modify the program denorm.cpp
so that the convert32_t
variable
S.F32 = 5.9E-37F
is repeatedly halved until all of its bits are
equal to zero. This process illustrates how loss of precision comes
about in so-called denormalized numbers.
$ make denorm
g++ -o denorm denorm.cpp -O2 -ansi -pedantic -Wall -lm
$ ./denorm | more
Single-precision floating point:
s/eeeeeeee/fffffffffffffffffffffff
0 00000110 10010001100010001000000
5.9e-37
s/eeeeeeee/fffffffffffffffffffffff
0 00000101 10010001100010001000000
2.95e-37
s/eeeeeeee/fffffffffffffffffffffff
0 00000100 10010001100010001000000
1.475e-37
s/eeeeeeee/fffffffffffffffffffffff
0 00000011 10010001100010001000000
7.375e-38
s/eeeeeeee/fffffffffffffffffffffff
0 00000010 10010001100010001000000
3.6875e-38
s/eeeeeeee/fffffffffffffffffffffff
0 00000001 10010001100010001000000
1.84375e-38
s/eeeeeeee/fffffffffffffffffffffff
0 00000000 11001000110001000100000
9.21875e-39
s/eeeeeeee/fffffffffffffffffffffff
0 00000000 01100100011000100010000
4.60938e-39
s/eeeeeeee/fffffffffffffffffffffff
0 00000000 00110010001100010001000
2.30469e-39
s/eeeeeeee/fffffffffffffffffffffff
0 00000000 00011001000110001000100
1.15234e-39
s/eeeeeeee/fffffffffffffffffffffff
0 00000000 00001100100011000100010
5.76172e-40
s/eeeeeeee/fffffffffffffffffffffff
0 00000000 00000110010001100010001
2.88086e-40
s/eeeeeeee/fffffffffffffffffffffff
0 00000000 00000011001000110001000
1.44042e-40
.
.
s/eeeeeeee/fffffffffffffffffffffff
0 00000000 00000000000000000000010
2.8026e-45
s/eeeeeeee/fffffffffffffffffffffff
0 00000000 00000000000000000000001
1.4013e-45
s/eeeeeeee/fffffffffffffffffffffff
0 00000000 00000000000000000000000
0
Tally up, tally down
Write a program sum.cpp
that computes the partial sum
\(\mathsf{\sum_{n=1}^N \tfrac{1}{n}}\) for each of
\(\mathsf{N}=\mathsf{10}, \mathsf{100}, \mathsf{1000}, \ldots,
\mathsf{100\,000\,000}\). Accumulate the terms in a
float
(i.e. single-precision, 32-bit) variable
in both ascending and descending order (with respect to
\(\mathsf{n}\)), and be sure to store the index
\(\mathsf{n}\) in an integer type wide enough to
accommodate the largest upper limit of the sum. The two
results should be written in columns alongside the
corresponding value of \(\mathsf{N}\). The
view-sum.gp
script performs a visual comparison with
the correct asymptotic behaviour,
Which summation direction is better and why?
$ make sum
g++ -o sum sum.cpp -O2 -ansi -pedantic -Wall -lm
$ ./sum
10 2.92897 2.92897
100 5.18738 5.18738
1000 7.48548 7.48547
10000 9.78761 9.7876
100000 12.0909 12.0902
1000000 14.3574 14.3927
10000000 15.4037 16.686
100000000 15.4037 18.8079
$ gnuplot -persist view-sum.gp
Accumulation strategies
In various scientific settings, we often need to accumulate numerical values and their powers (e.g., to compute the standard deviation of a large data set). When the values span a wide range of orders of magnitude, the way in which the numbers are summed can make a big difference in how much accuracy is retained. A particularly difficult case occurs when we try to sum non-converging partial series. Let’s compare to some exact results:
The values for \(\mathsf{N} = \mathsf{100\,000}\) are given in the table below. We discover that the second and third powers can’t even be represented exactly within the single-precision floating-point scheme!
power |
exact value of the sum |
floating-point |
1 |
5,000,050,000 |
5.0000500e+09 |
2 |
333,338,333,350,000 |
3.3333833e+14 |
3 |
25,000,500,002,500,000,000 |
2.5000500e+19 |
We will use the seq
command (which you may have to replace with
jot
in BSD Unix environments) to generate incrementing sequences and
awk
to raise those numbers to the requisite powers. For example,
$ seq 5 | awk '{ print $1, $1*$1, $1*$1*$1 }'
1 1 1
2 4 8
3 9 27
4 16 64
5 25 125
The program sumdata.cpp
reads values from cin
as type float
and reports their sum based on the functions provided in
add_methods.cpp
. For everything to work correctly, you’ll have to
fill in the missing function bodies in that file. The methods provided
include summing the terms in increasing and decreasing order, summing
the sorted list pairwise (as we discussed in class), summing in
conventional order with a high-precision (long double
) accumulator
variable, and a compensated summation scheme due to Kahan. For those
methods that require sorting, make use of the STL sort function provided by the
algorithm
header. You can read about compensated summation in the
refences below.
See also
- Online references:
(For the second link, you’ll have to click “Next” at the bottom to flip through the program listings referred to in the text.)
When you’re done, you should be able to reproduce the following session exactly.
$ make sumdata
g++ -c -o sumdata.o sumdata.cpp
g++ -c -o add_methods.o add_methods.cpp
g++ -o sumdata sumdata.o add_methods.o -O2 -ansi -pedantic -Wall -lm
$ seq 100000 | ./sumdata
I've read in 100000 elements
s/eeeeeeee/fffffffffffffffffffffff
conventional sum: 4.9999903e+09 : 0 10011111 00101010000001011100110
increasing sum: 4.9999903e+09 : 0 10011111 00101010000001011100110
decreasing sum: 5.0000865e+09 : 0 10011111 00101010000001110100010
pairwise sum: 5.0000497e+09 : 0 10011111 00101010000001101011010
compensated sum: 5.0000502e+09 : 0 10011111 00101010000001101011011
high-prec. sum: 5.0000502e+09 : 0 10011111 00101010000001101011011
Note that as many as nine binary digits in the fraction field may disagree depending on the method. The last two methods agree and produce values quite close to the exact result.
$ seq 100000 | awk '{ print $1*$1 }' | ./sumdata
I've read in 100000 elements
s/eeeeeeee/fffffffffffffffffffffff
conventional sum: 3.3333845e+14 : 0 10101111 00101111001010110110100
increasing sum: 3.3333845e+14 : 0 10101111 00101111001010110110100
decreasing sum: 3.3332671e+14 : 0 10101111 00101111001010001010110
pairwise sum: 3.3333832e+14 : 0 10101111 00101111001010110110000
compensated sum: 3.3333832e+14 : 0 10101111 00101111001010110110000
high-prec. sum: 3.3333832e+14 : 0 10101111 00101111001010110110000
$ seq 100000 | awk '{ print $1*$1*$1 }' | ./sumdata
I've read in 100000 elements
s/eeeeeeee/fffffffffffffffffffffff
conventional sum: 2.5000484e+19 : 0 10111111 01011010111100111001000
increasing sum: 2.5000484e+19 : 0 10111111 01011010111100111001000
decreasing sum: 2.4998857e+19 : 0 10111111 01011010111011011100100
pairwise sum: 2.5000500e+19 : 0 10111111 01011010111100111001111
compensated sum: 2.5000500e+19 : 0 10111111 01011010111100111001111
high-prec. sum: 2.5000500e+19 : 0 10111111 01011010111100111001111
The following series has alternating, rapidly decaying terms. It should converge to 2.5.
$ seq 100 | awk '{ print (-1)**(i++)/3.0**i }' | ./sumdata
I've read in 79 elements
s/eeeeeeee/fffffffffffffffffffffff
conventional sum: 2.4999973e-01 : 0 01111100 11111111111111111101110
increasing sum: 2.4999970e-01 : 0 01111100 11111111111111111101100
decreasing sum: 2.4999969e-01 : 0 01111100 11111111111111111101011
pairwise sum: 2.4999970e-01 : 0 01111100 11111111111111111101100
compensated sum: 2.4999972e-01 : 0 01111100 11111111111111111101101
high-prec. sum: 2.4999972e-01 : 0 01111100 11111111111111111101101