Integer representations

Integers

The fact that computers are finite has important design implications. It means that computers can never faithfully represent the sets of integers or real numbers, both of which are infinite. Since these are generally what we work with in physics and mathematics, it’s important to understand how we approximately represent integers and reals on a computer. (What we’ll discover is that finite-fields are substituted for the integers and rationals for the reals.)

But first, let’s consider how we might represent numbers at all. What strategies could we employ? Historically, the most common are these two:

simple enumeration (/ // /// //// …)
grouping and labelling (e.g., Roman numerals I II III IV V X L C D M; 1998 = MCMXCVIII)

Neither, however, is suitable for computation. The first is grossly inefficient, requiring \(\mathsf{N}\) bits to store numbers as large as \(\mathsf{N}\). The second requires an ever-growing family of new symbols to represent large values. Moreover, we need a systematic and extensible representation in which basic arithmetic operations are mechanistic. The usual solution to this dilema is the following.

positional number systems: (\(\mathsf{a_3a_2a_1a_0.a_{-1}a_{-2}}\)) base \(\mathsf{b}\) with \(\mathsf{0 \le a_k < b}\)

By convention, the leading (high) digit is most significant and the trailing (low) digit the least. The part to the right of the radix point is understood to be fractional. Our conventional decimal number system corresponds to \(\mathsf{b = 10}\) with digits \(\mathsf{a_k \in \{ 0,1,2,\ldots,9 \}}\). Bases \(\mathsf{b = 2, 8, 16}\) (all powers of two) are the most commonly used in computer science, since computers use the binary representation in hardware.

base	common name	math name	example	decimal conversion
2	binary		10010111₂	167
8	octal	octonal	1735₈	989
10	decimal		234	234
16	hexadecimal	sexadecimal	3F7A₁₆	16250
60	sexagesimal		23 44’ 12”	23.736666

The conversion to decimal can be carried out by summing powers.

\[\begin{aligned} \mathsf{10010111_2} &= \mathsf{2^7+ 2^4 + 2^2 + 2^1 + 2^0 = 167} \\ \mathsf{1735_8} &= \mathsf{1\times 8^3 + 7\times 8^2+ 3\times 8^1 + 5\times 8^ 0 = 989} \end{aligned}\]

Note that for bases greater than 10, we run out of arabic numerals. The convention is to fill out the missing digits using roman letters.

\[\mathsf{a_k \in \{ 0,1,2,{\cdots},9,\text{A},\text{B},\text{C},\text{D},\text{E},\text{F} \}}\]

Hence,

\[\mathsf{\textsf{3F7A}_{16} = 3\times 16^3 + 15\times 16^2 + 7\times 16^1 + 10\times 16^0 = 16250}\]

Of course, for base \(\mathsf{b > 36}\) we run out of symbols, and the notation becomes idiosyncratic. The convention for fractions of degrees is to use an increasing number of primes to mark places:

\[\mathsf{23\ 44'\ 12'' = 23 + 44\times 60^{-1} + 12\times 60^{-2} = 23.736666}\]

Binary is not as foreign as it first seems. Humans have invented many binary number systems; good examples are the western system of musical notation (1 whole note = 2 half notes = 4 quarter notes = 8 eights notes, etc.) and the British system of weights and measures (1 gallon = 2 pottles = 4 quarts = 8 pints, etc.).

Signed and unsigned binary numbers

A fixed-width binary number is a sequence of \(\mathsf{N}\) bits. The smallest possible number is \(\mathsf{0}\) and the largest is \(\mathsf{2^N-1}\). For example, with eight bits the numbers \(\mathsf{0}\) to \(\mathsf{255}\) are represented by the \(\mathsf{2^8}\) unique patterns of \(\mathsf{0}\) and \(\mathsf{1}\).

0	0 0 0 0 0 0 0 0
1	0 0 0 0 0 0 0 1
2	0 0 0 0 0 0 1 0
3	0 0 0 0 0 0 1 1
255	1 1 1 1 1 1 1 1

Negative numbers can be represented using what’s called the two’s complement scheme. Here, the numbers \(\mathsf{\{0,1,\ldots,255\}}\) are reinterpreted as \(\mathsf{\{-128,-127,\ldots,127\}}\) with the leading (most significant) bit signalling that \(\mathsf{x \to x-256}\).

two’s complement	bit pattern	unsigned binary
−128	1 0 0 0 0 0 0 0	128
−3	1 1 1 1 1 1 0 1	253
−2	1 1 1 1 1 1 1 0	254
−1	1 1 1 1 1 1 1 1	255
0	0 0 0 0 0 0 0 0	0
1	0 0 0 0 0 0 0 1	1
2	0 0 0 0 0 0 1 0	2
3	0 0 0 0 0 0 1 1	3
127	0 1 1 1 1 1 1 1	127

Under two’s complement, the high bit effectively contains the sign information. Still, this is quite different than having the high bit represent the sign explicitly and the remaining low bits the magnitude. Under that scheme we would have the numbers

\[\mathsf{-127, -126, \ldots, -2, -1, -0, +0, 1, 2, \ldots, 127}\]

Note that there are two representations of zero. Two’s complement has the advantage that the zero is unique.

A more important advantage is that addition and subtraction of two’s complement numbers can be carried out in exactly the same way as for unsigned binary numbers. The result of the following sum is shown in grey; the carry bits are shown in red. The addition operation can equally be interpreted as acting on signed or unsigned integers.

\[\begin{aligned} &\ \ {\color{red} \mathsf{1}}\\ &\mathsf{0011}\\ \mathsf{+}&\mathsf{1010}\\ \hline &\mathsf{1101} \end{aligned}\]

This 4-digit binary computation can be interpreted in two ways: unsigned \(\mathsf{3 + 10 = 13}\) or two’s complement \(\mathsf{3 + (-6) = -3}\).

Limitations

Fixed width binary numbers can only represent a limited range of integers. The result of an operation (such as addition or multiplication) performed on a pair of representable integers may not be representable itself. This condition is called overflow.

The next example provides a simple demonstration. The overflow result can be understood in terms of arithmetic modulo \(\mathsf{2^4 = 16}\).

\[\begin{aligned} {\color{red} \mathsf{1}}& \ \ \ \ {\color{red} \mathsf{1}}\\ &\mathsf{1001}\\ \mathsf{+}&\mathsf{1101}\\ \hline &\mathsf{0110} \end{aligned} \]

For unsigned binary, \(\mathsf{9 + 13 = (overflow)\ 6}\), which is \(\mathsf{22\ \text{mod}\ 16}\). For two’s complement, \(\mathsf{(-7) + (-3) = (overflow)\ 6}\), which is \(\mathsf{-10\ \text{mod}\ 16}\).

C++ integer types

The integer types provided by C++ do not have a guaranteed width. Rather, only their relative sizes are enforced.

assert( sizeof(char) <= sizeof(short) and
        sizeof(short) <= sizeof(int) and
        sizeof(int) <= sizeof(long) );

The sizes, however, are standardized across 32-bit Intel machines.

C++ integer types on the (IA32) Intel architecture
type	width
`char`	8 bits, 1 byte
`short int`	16 bits, 2 bytes, 1 word
`int`, `long int`	32 bits, 4 bytes, 1 double word

By default, the integer types are signed. On almost all architectures, they are represented internally using the two’s complement scheme. Unsigned binary versions can be specified by prepending the unsigned keyword: e.g., unsigned char, unsigned short int, unsigned int, unsigned long int.

From 2011 on, thanks to the C++11 language update, we can now specify signed and unsigned integer types of fixed bit width.

#include <cstdint>
assert( sizeof(int8_t) == 1 );
assert( sizeof(int16_t) == 2 );
assert( sizeof(int32_t) == 3 );
assert( sizeof(int64_t) == 4 );
assert( sizeof(uint8_t) == 1 );
assert( sizeof(uint16_t) == 2 );
assert( sizeof(uint32_t) == 3 );
assert( sizeof(uint64_t) == 4 );

Python uses a different approach. It supports unlimited-precision integers. This is done in software rather than hardware, so it’s slower; but not having to worry about overflow can be very convenient.