1.1. Floating-point numbers#

The real number set \(\real\) is infinite in two ways: it is unbounded and continuous. In most practical computing, the latter kind of infiniteness is much more consequential than the former, so we turn our attention there first.

Definition 1.1.1 :  Floating-point numbers

The set \(\float\) of floating-point numbers consists of zero and all numbers of the form

(1.1.1)#\[ \pm (1 + f) \times 2^n,\]

where \(n\) is an integer called the exponent, and \(1+f\) is the mantissa or significand, in which

(1.1.2)#\[ f = \sum_{i=1}^d b_i \, 2^{-i}, \qquad b_i\in\{0,1\},\]

for a fixed integer \(d\) called the binary precision.

Equation (1.1.2) represents the significand as a number in \([1,2)\) in base-2 form. Equivalently,

(1.1.3)#\[f = 2^{-d}\, \sum_{i=1}^{d} b_{i} \, 2^{d-i} = 2^{-d} z\]

for an integer \(z\) in the set \(\{0,1,\ldots,2^d-1\}\). Consequently, starting at \(2^n\) and ending just before \(2^{n+1}\) there are exactly \(2^d\) evenly spaced numbers belonging to \(\float\).

Example 1.1.2

Suppose \(d=2\). Taking \(n=0\) in (1.1.1), we enumerate

\[ 1 + \frac{0}{4}, \: 1 + \frac{1}{4}, \: 1 + \frac{2}{4}, \: 1 + \frac{3}{4}. \]

These are the only members of \(\float\) in the semi-closed interval \([1,2)\), and they are separated by spacing \(\tfrac{1}{4}\).

Taking \(n=1\) doubles each of the values in the list above, and \(n=-1\) halves them. These give the floating-point numbers in \([2,4)\) and \([1/2,1)\), respectively. The spacing between them also is doubled and halved, respectively.

Observe that the smallest element of \(\float\) that is greater than 1 is \(1+2^{-d}\), and we call the difference machine epsilon.1

Definition 1.1.3 :  Machine epsilon

For a floating-point set with \(d\) binary digits of precision, machine epsilon (or machine precision) is \(\macheps = 2^{-d}\).

We define the rounding function \(\fl(x)\) as the map from real number \(x\) to the nearest member of \(\float\). The distance between the floating-point numbers in \([2^n,2^{n+1})\) is \(2^n\macheps=2^{n-d}\). As a result, every real \(x \in [2^n,2^{n+1})\) is no farther than \(2^{n-d-1}\) away from a member of \(\float\). Therefore we conclude that \(|\fl(x)-x| \le \tfrac{1}{2}(2^{n-d})\), which leads to the bound

(1.1.4)#\[\frac{|\fl(x)-x|}{|x|} \le \frac{2^{n-d-1}}{2^n} \le \tfrac{1}{2}\macheps.\]

In words, every real number is represented with a uniformly bounded relative precision. Inequality (1.1.4) holds true for negative \(x\) as well. In Exercise 2 you are asked to show that an equivalent statement is that

(1.1.5)#\[ \fl(x)=x(1+\epsilon) \quad \text{for some $|\epsilon|\le \tfrac{1}{2}\macheps$.}\]

The value of \(\epsilon\) depends on \(x\), but this dependence is not usually shown explicitly.

Precision and accuracy#

It may help to recast (1.1.1) and (1.1.2) in terms of base 10:

\[ \pm \left(b_0 + \sum_{i=1}^d b_i \, 10^{-i} \right) \times 10^n = \pm (b_0.b_1b_2\cdots b_d) \times 10^n, \]

where each \(b_i\) is in \(\{0,1,\ldots,9\}\) and \(b_0\neq 0\). This is simply scientific notation with \(d+1\) significant digits. For example, Planck’s constant is \(6.626068\times 10^{-34}\) m\({}^2\cdot\)kg/sec to seven digits. If we alter just the last digit from \(8\) to \(9\), the relative change is

\[ \frac{0.000001\times 10^{-34}}{6.626068\times 10^{-34}} \approx 1.51\times 10^{-7}. \]

We therefore say that the constant is given with 7 decimal digits of precision. That’s in contrast to noting that the value is given to 40 decimal places. A major advantage of floating point is that the relative precision does not depend on the choice of physical units. For instance, when expressed in eV\(\cdot\)sec, Planck’s constant is \(4.135668\times 10^{-15}\), which still has 7 digits but only 21 decimal places.

Floating-point precision functions the same way, except that computers prefer base 2 to base 10. The precision of a floating-point number is always \(d\) binary digits, implying a resolution of the real numbers according to (1.1.4).

It can be easy to confuse precision with accuracy, especially when looking at the result of a calculation on the computer. Every result is computed and represented using \(d\) binary digits, but not all of those digits may accurately represent an intended value. Suppose \(x\) is a number of interest and \(\tilde{x}\) is an approximation to it. The absolute accuracy of \(\tilde{x}\) is

\[|\tilde{x} - x|,\]

while the relative accuracy is

\[\frac{|\tilde{x} - x|}{|x|}.\]

Absolute accuracy has the same units as \(x\), while relative accuracy is dimensionless. We can also express the relative accuracy as the number of accurate digits, computed in base 10 as

(1.1.6)#\[-\log_{10} \left| \frac{\tilde{x}-x}{x} \right|.\]

We often round this value down to an integer, but it does make sense to speak of “almost seven digits” or “ten and a half digits.”

Demo 1.1.4

Recall the grade-school approximation to the number \(\pi\).

@show p = 22/7;
p = 22 / 7 = 3.142857142857143
@show float(π);
float(π) = 3.141592653589793
acc = abs(p-π)
println("absolute accuracy = $acc")
println("relative accuracy = $(acc/π)")
absolute accuracy = 0.0012644892673496777
relative accuracy = 0.0004024994347707008
println("Number of accurate digits = $(-log10(acc/π))")
Number of accurate digits = 3.3952347251747166

This last value could be rounded down by using floor.

Double precision#

Most numerical computing today is done in the IEEE 754 standard. This defines single precision with \(d=23\) binary digits for the fractional part \(f\), and the more commonly used double precision with \(d=52\). In double precision,

(1.1.7)#\[\macheps = 2^{-52} \approx 2.2\times 10^{-16}.\]

We often speak of double-precision floating-point numbers as having about 16 decimal digits. The 52-bit significand is paired with a sign bit and 11 binary bits to represent the exponent \(n\) in (1.1.1), for a total of 64 binary bits per floating-point number.

Demo 1.1.5

In Julia, 1 and 1.0 are different values, because they have different types:

@show typeof(1);
@show typeof(1.0);
typeof(1) = Int64
typeof(1.0) = Float64

The standard choice for floating-point values is Float64, which is double precision using 64 binary bits. We can see all the bits by using bitstring.

bitstring(1.0)
"0011111111110000000000000000000000000000000000000000000000000000"
[bitstring(1.0), bitstring(-1.0)]
2-element Vector{String}:
 "0011111111110000000000000000000000000000000000000000000000000000"
 "1011111111110000000000000000000000000000000000000000000000000000"

The next 11 bits determine the exponent (scaling) of the number, and so on.

[bitstring(1.0), bitstring(2.0)]
2-element Vector{String}:
 "0011111111110000000000000000000000000000000000000000000000000000"
 "0100000000000000000000000000000000000000000000000000000000000000"

The sign bit, exponent, and significand in (1.1.1) are all directly accessible.

x = 1.0
@show sign(x),exponent(x),significand(x);
(sign(x), exponent(x), significand(x)) = (1.0, 0, 1.0)
x = 0.125
@show sign(x),exponent(x),significand(x);
(sign(x), exponent(x), significand(x)) = (1.0, -3, 1.0)
eps()
2.220446049250313e-16

Because double precision allocates 52 bits to the significand, the default value of machine epsilon is \(2^{-52}\).

log2(eps())
-52.0

The spacing between adjacent floating-point values is proportional to the magnitude of the value itself. This is how relative precision is kept roughly constant throughout the range of values. You can get the adjusted spacing by calling eps with a value.

eps(1.618)
2.220446049250313e-16
eps(161.8)
2.842170943040401e-14
nextfloat(161.8)
161.80000000000004
ans - 161.8
2.842170943040401e-14

A common mistake is to think that \(\epsilon_\text{mach}\) is the smallest floating-point number. It’s only the smallest relative to 1. The correct perspective is that the scaling of values is limited by the exponent, not the mantissa. The actual range of positive values in double precision is

@show floatmin(),floatmax();
(floatmin(), floatmax()) = (2.2250738585072014e-308, 1.7976931348623157e308)

For the most part you can mix integers and floating-point values and get what you expect.

1/7
0.14285714285714285
37.3 + 1
38.3
2^(-4)
0.0625

There are some exceptions. A floating-point value can’t be used as an index into an array, for example, even if it is numerically equal to an integer. In such cases you use Int to convert it.

@show 5.0,Int(5.0);
(5.0, Int(5.0)) = (5.0, 5)

If you try to convert a noninteger floating-point value into an integer you get an InexactValue error. This occurs whenever you try to force a type conversion that doesn’t make clear sense.

Our theoretical description of \(\float\) did not place limits on the exponent, but in double precision its range is limited to \(-1022\le n \le 1023\). Thus, the largest number is just short of \(2^{1024}\approx 2\times 10^{308}\), which is enough in most applications. Results that should be larger are said to overflow and will actually result in the value Inf. Similarly, the smallest positive number is \(2^{-1022}\approx 2\times 10^{-308}\), and smaller values are said to underflow to zero.2

Note the crucial difference between \(\macheps=2^{-52}\), which is the distance between 1 and the next larger double-precision number, and \(2^{-1022}\), which is the smallest positive double-precision number. The former has to do with relative precision, while the latter is about absolute precision. Getting close to zero always requires a shift in thinking to absolute precision because any finite error is infinite relative to zero.

One more double-precision value is worthy of note: NaN, which stands for Not a Number. It is the result of an undefined arithmetic operation such as 0/0.

Floating-point arithmetic#

Computer arithmetic is performed on floating-point numbers and returns floating-point results. We assume the existence of machine-analog operations for real functions such as \(+\), \(-\), \(\times\), \(/\), \(\sqrt{\quad}\), and so on. Without getting into the details, we will suppose that each elementary machine operation creates a floating-point result whose relative error is bounded by \(\macheps\). For example, if \(x\) and \(y\) are in \(\float\), then for machine addition \(\oplus\) we have the bound

\[\frac{ |(x \oplus y)-(x+y)| }{ |x+y| } \le \macheps.\]

Hence the relative error in arithmetic is essentially the same as for the floating-point representation itself. However, playing by these rules can lead to disturbing results.

Demo 1.1.6

There is no double-precision number between \(1\) and \(1+\epsilon_\text{mach}\). Thus the following difference is zero despite its appearance.

e = eps()/2
(1.0 + e) - 1.0
0.0

However, the spacing between floats in \([1/2,1)\) is \(\macheps/2\), so both \(1-\macheps/2\) and its negative are represented exactly:

1.0 + (e - 1.0)
1.1102230246251565e-16

This is now the expected result. But we have found a rather shocking breakdown of the associative law of addition!

There are two ways to look at Demo 1.1.6. On one hand, its two versions of the result differ by less than \(1.2\times 10^{-16}\), which is very small — not just in everyday terms, but with respect to the operands, which are all close to 1 in absolute value. On the other hand, the difference is as large as the exact result itself! We formalize and generalize this observation in the next section. In the meantime, keep in mind that exactness cannot be taken for granted in floating-point computation.

Observation 1.1.7

We should not expect that two mathematically equivalent results will be equal when computed in floating point, only that they be relatively close together.

Exercises#

Note

Exercises marked with ✍ are intended to be done by hand or with the aid of a simple calculator. Exercises marked with ⌨ are intended to be solved using a computer.

  1. ✍ Consider a floating-point set \(\float\) defined by (1.1.1) and (1.1.2) with \(d=4\).

    (a) How many elements of \(\float\) are there in the real interval \([1/2,4]\), including the endpoints?

    (b) What is the element of \(\float\) closest to the real number \(1/10\)? (Hint: Find the interval \([2^n,2^{n+1})\) that contains \(1/10\), then enumerate all the candidates in \(\float\).)

    (c) What is the smallest positive integer not in \(\float\)? (Hint: For what value of the exponent does the spacing between floating-point numbers become larger than 1?)

  2. ✍ Prove that (1.1.4) is equivalent to (1.1.5). This means showing first that (1.1.4) implies (1.1.5), and then separately that (1.1.5) implies (1.1.4).

  3. ⌨ There are much better rational approximations to \(\pi\) than 22/7 as used in Demo 1.1.4. For each one below, find its absolute and relative accuracy, and (rounding down to an integer) the number of accurate digits.

    (a) 355/113

    (b) 103638/32989

  4. ✍ IEEE 754 single precision specifies that 23 binary bits are used for the value \(f\) in the significand \(1+f\) in (1.1.2). Because they need less storage space and can be operated on more quickly than double-precision values, single-precision values can be useful in low-precision applications. (They are supported as type Float32 in Julia.)

    (a) In base-10 terms, what is the first single-precision number greater than \(1\) in this system?

    (b) What is the smallest positive integer that is not a single-precision number? (See the hint to Exercise 1.)

  5. ⌨ Julia defines a function nextfloat that gives the next-larger floating-point value of a given number. What is the next float past floatmax()? What is the next float past -Inf?


1

The terms machine epsilon, machine precision, and unit roundoff aren’t used consistently across references, but the differences are not consequential for our purposes.

2

Actually, there are some still-smaller denormalized numbers that have less precision, but we won’t use that level of detail.