--- jupytext: cell_metadata_filter: -all formats: md:myst text_representation: extension: .md format_name: myst format_version: 0.13 jupytext_version: 1.10.3 kernelspec: display_name: Julia 1.7.1 language: julia name: julia-fast --- {code-cell} :tags: [remove-cell] using FundamentalsNumericalComputation FNC.init_format()  (section-intro-floating-point)= # Floating-point numbers The real number set $\real$ is infinite in two ways: it is unbounded and continuous. In most practical computing, the latter kind of infiniteness is much more consequential than the former, so we turn our attention there first. {index} ! floating-point numbers  ::::{proof:definition} Floating-point numbers The set $\float$ of **floating-point numbers** consists of zero and all numbers of the form {math} :label: floatpoint \pm (1 + f) \times 2^n,  {index} see: mantissa; significand  {index} ! significand  where $n$ is an integer called the **exponent**, and $1+f$ is the **mantissa** or **significand**, in which {math} :label: mantissa f = \sum_{i=1}^d b_i \, 2^{-i}, \qquad b_i\in\{0,1\},  for a fixed integer $d$ called the binary **precision**. :::: Equation {eq}mantissa represents the significand as a number in $[1,2)$ in base-2 form. Equivalently, {math} :label: fp-integer f = 2^{-d}\, \sum_{i=1}^{d} b_{i} \, 2^{d-i} = 2^{-d} z  for an integer $z$ in the set $\{0,1,\ldots,2^d-1\}$. Consequently, starting at $2^n$ and ending just before $2^{n+1}$ there are exactly $2^d$ evenly spaced numbers belonging to $\float$. (example-float-2bits)= {proof:example} Suppose $d=2$. Taking $n=0$ in {eq}floatpoint, we enumerate $$1 + \frac{0}{4}, \: 1 + \frac{1}{4}, \: 1 + \frac{2}{4}, \: 1 + \frac{3}{4}.$$ These are the only members of $\float$ in the semi-closed interval $[1,2)$, and they are separated by spacing $\tfrac{1}{4}$. Taking $n=1$ doubles each of the values in the list above, and $n=-1$ halves them. These give the floating-point numbers in $[2,4)$ and $[1/2,1)$, respectively. The spacing between them also is doubled and halved, respectively.  {index} unit roundoff; see: machine epsilon, ! machine epsilon  Observe that the smallest element of $\float$ that is greater than 1 is $1+2^{-d}$, and we call the difference *machine epsilon*.[^macheps] ::::{proof:definition} Machine epsilon For a floating-point set with $d$ binary digits of precision, **machine epsilon** (or *machine precision*) is $\macheps = 2^{-d}$. :::: [^macheps]: The terms machine epsilon, machine precision, and unit roundoff aren't used consistently across references, but the differences are not consequential for our purposes. We define the rounding function $\fl(x)$ as the map from real number $x$ to the nearest member of $\float$. The distance between the floating-point numbers in $[2^n,2^{n+1})$ is $2^n\macheps=2^{n-d}$. As a result, every real $x \in [2^n,2^{n+1})$ is no farther than $2^{n-d-1}$ away from a member of $\float$. Therefore we conclude that $|\fl(x)-x| \le \tfrac{1}{2}(2^{n-d})$, which leads to the bound {math} :label: fpbound \frac{|\fl(x)-x|}{|x|} \le \frac{2^{n-d-1}}{2^n} \le \tfrac{1}{2}\macheps.  In words, every real number is represented with a uniformly bounded relative precision. Inequality {eq}fpbound holds true for negative $x$ as well. In [Exercise 2](problem-fp-fprelative) you are asked to show that an equivalent statement is that {math} :label: fpboundalt \fl(x)=x(1+\epsilon) \quad \text{for some $|\epsilon|\le \tfrac{1}{2}\macheps$.}  The value of $\epsilon$ depends on $x$, but this dependence is not usually shown explicitly. ## Precision and accuracy {index} significant digits  It may help to recast {eq}floatpoint and {eq}mantissa in terms of base 10: $$\pm \left(b_0 + \sum_{i=1}^d b_i \, 10^{-i} \right) \times 10^n = \pm (b_0.b_1b_2\cdots b_d) \times 10^n,$$ where each $b_i$ is in $\{0,1,\ldots,9\}$ and $b_0\neq 0$. This is simply scientific notation with $d+1$ significant digits. For example, Planck's constant is $6.626068\times 10^{-34}$ m${}^2\cdot$kg/sec to seven digits. If we alter just the last digit from $8$ to $9$, the relative change is $$\frac{0.000001\times 10^{-34}}{6.626068\times 10^{-34}} \approx 1.51\times 10^{-7}.$$ We therefore say that the constant is given with 7 decimal digits of precision. That's in contrast to noting that the value is given to 40 decimal *places*. A major advantage of floating point is that the relative precision does not depend on the choice of physical units. For instance, when expressed in eV$\cdot$sec, Planck's constant is $4.135668\times 10^{-15}$, which still has 7 digits but only 21 decimal places. Floating-point precision functions the same way, except that computers prefer base 2 to base 10. The **precision** of a floating-point number is always $d$ binary digits, implying a resolution of the real numbers according to {eq}fpbound. {index} ! accuracy (relative vs. absolute)  It can be easy to confuse precision with **accuracy**, especially when looking at the result of a calculation on the computer. Every result is computed and represented using $d$ binary digits, but not all of those digits may accurately represent an intended value. Suppose $x$ is a number of interest and $\tilde{x}$ is an approximation to it. The **absolute accuracy** of $\tilde{x}$ is {math} |\tilde{x} - x|,  while the **relative accuracy** is {math} \frac{|\tilde{x} - x|}{|x|}.  Absolute accuracy has the same units as $x$, while relative accuracy is dimensionless. We can also express the relative accuracy as the **number of accurate digits**, computed in base 10 as {math} :label: sigdig -\log_{10} \left| \frac{\tilde{x}-x}{x} \right|.  We often round this value down to an integer, but it does make sense to speak of "almost seven digits" or "ten and a half digits." (demo-float-accuracy)= {proof:demo}  {raw} html
 {raw} latex %%start demo%%  Recall the grade-school approximation to the number $\pi$. {code-cell} @show p = 22/7;  ::::{panels} :column: col-7 left-side :card: border-0 shadow-none {raw} latex \begin{minipage}[t]{0.5\textwidth}  Not all the digits displayed for p are the same as those of $\pi$. {raw} latex \end{minipage}\hfill  --- :column: col-5 right-side :card: shadow-none comment {raw} latex \begin{minipage}[t]{0.4\textwidth}\begin{mdframed}[default]\small  The value of pi is predefined and equivalent to π, which is entered by typing \pi followed immediately by the Tab key. {raw} latex \end{mdframed}\end{minipage}  :::: {code-cell} @show float(π);  {index} ! Julia; string interpolation  ::::{panels} :column: col-7 left-side :card: border-0 shadow-none {raw} latex \begin{minipage}[t]{0.5\textwidth}  The absolute and relative accuracies of the approximation are as follows. {raw} latex \end{minipage}\hfill  --- :column: col-5 right-side :card: shadow-none comment {raw} latex \begin{minipage}[t]{0.4\textwidth}\begin{mdframed}[default]\small  A dollar sign $ in a string substitutes (or *interpolates*) the named variable or expression into the string. {raw} latex \end{mdframed}\end{minipage}  :::: {code-cell} acc = abs(p-π) println("absolute accuracy =$acc") println("relative accuracy = $(acc/π)")  ::::{panels} :column: col-7 left-side :card: border-0 shadow-none {raw} latex \begin{minipage}[t]{0.5\textwidth}  Here we calculate the number of accurate digits in p. {raw} latex \end{minipage}\hfill  --- :column: col-5 right-side :card: shadow-none comment {raw} latex \begin{minipage}[t]{0.4\textwidth}\begin{mdframed}[default]\small  The log function is for the natural log. For other common bases, use log10 or log2. {raw} latex \end{mdframed}\end{minipage}  :::: {code-cell} println("Number of accurate digits =$(-log10(acc/π))")  This last value could be rounded down by using floor. {raw} html
 {raw} latex %%end demo%%  ## Double precision {index} machine epsilon; in double precision, IEEE 754  Most numerical computing today is done in the **IEEE 754** standard. This defines **single precision** with $d=23$ binary digits for the fractional part $f$, and the more commonly used **double precision** with $d=52$. In double precision, {math} :label: doubleprec \macheps = 2^{-52} \approx 2.2\times 10^{-16}.  We often speak of double-precision floating-point numbers as having about 16 decimal digits. The 52-bit significand is paired with a sign bit and 11 binary bits to represent the exponent $n$ in {eq}floatpoint, for a total of 64 binary bits per floating-point number. (demo-float-julia)= {proof:demo}  {raw} html
 {raw} latex %%start demo%%  In Julia, 1 and 1.0 are different values, because they have different types: {code-cell} @show typeof(1); @show typeof(1.0);  The standard choice for floating-point values is Float64, which is double precision using 64 binary bits. We can see all the bits by using bitstring. {code-cell} bitstring(1.0)  ::::{panels} :column: col-7 left-side :card: border-0 shadow-none {raw} latex \begin{minipage}[t]{0.5\textwidth}  The first bit determines the sign of the number: {raw} latex \end{minipage}\hfill  --- :column: col-5 right-side :card: shadow-none comment {raw} latex \begin{minipage}[t]{0.4\textwidth}\begin{mdframed}[default]\small  Square brackets concatenate the contained values into vectors. {raw} latex \end{mdframed}\end{minipage}  :::: {code-cell} [bitstring(1.0), bitstring(-1.0)]  The next 11 bits determine the exponent (scaling) of the number, and so on. {code-cell} [bitstring(1.0), bitstring(2.0)]  The sign bit, exponent, and significand in {eq}floatpoint are all directly accessible. {code-cell} x = 1.0 @show sign(x),exponent(x),significand(x);  {code-cell} x = 0.125 @show sign(x),exponent(x),significand(x);  ::::{panels} :column: col-7 left-side :card: border-0 shadow-none {raw} latex \begin{minipage}[t]{0.5\textwidth}  The spacing between floating-point values in $[2^n,2^{n+1})$ is $2^n \epsilon_\text{mach}$, where $\epsilon_\text{mach}$ is machine epsilon. You can get its value from the eps function in Julia. By default, it returns the value for double precision. {raw} latex \end{minipage}\hfill  --- :column: col-5 right-side :card: shadow-none comment {raw} latex \begin{minipage}[t]{0.4\textwidth}\begin{mdframed}[default]\small  In order to call a function such as eps, you must use parentheses notation, even when there are no input arguments. {raw} latex \end{mdframed}\end{minipage}  :::: {code-cell} julia eps()  Because double precision allocates 52 bits to the significand, the default value of machine epsilon is $2^{-52}$. {code-cell} log2(eps())  The spacing between adjacent floating-point values is proportional to the magnitude of the value itself. This is how relative precision is kept roughly constant throughout the range of values. You can get the adjusted spacing by calling eps with a value. {code-cell} eps(1.618)  {code-cell} eps(161.8)  {code-cell} nextfloat(161.8)  {code-cell} ans - 161.8  A common mistake is to think that $\epsilon_\text{mach}$ is the smallest floating-point number. It's only the smallest *relative to 1*. The correct perspective is that the scaling of values is limited by the exponent, not the mantissa. The actual range of positive values in double precision is {code-cell} @show floatmin(),floatmax();  For the most part you can mix integers and floating-point values and get what you expect. {code-cell} 1/7  {code-cell} 37.3 + 1  {code-cell} 2^(-4)  There are some exceptions. A floating-point value can't be used as an index into an array, for example, even if it is numerically equal to an integer. In such cases you use Int to convert it. {code-cell} @show 5.0,Int(5.0);  If you try to convert a noninteger floating-point value into an integer you get an InexactValue error. This occurs whenever you try to force a type conversion that doesn't make clear sense. {raw} html
 {raw} latex %%end demo%%  Our theoretical description of $\float$ did not place limits on the exponent, but in double precision its range is limited to $-1022\le n \le 1023$. Thus, the largest number is just short of $2^{1024}\approx 2\times 10^{308}$, which is enough in most applications. Results that should be larger are said to *overflow* and will actually result in the value Inf. Similarly, the smallest positive number is $2^{-1022}\approx 2\times 10^{-308}$, and smaller values are said to *underflow* to zero.[^denormalized] [^denormalized]: Actually, there are some still-smaller *denormalized* numbers that have less precision, but we won't use that level of detail. Note the crucial difference between $\macheps=2^{-52}$, which is the distance between 1 and the next larger double-precision number, and $2^{-1022}$, which is the smallest positive double-precision number. The former has to do with relative precision, while the latter is about absolute precision. Getting close to zero always requires a shift in thinking to absolute precision because any finite error is infinite relative to zero. {index} NaN  One more double-precision value is worthy of note: **NaN**, which stands for **Not a Number**. It is the result of an undefined arithmetic operation such as 0/0. ## Floating-point arithmetic {index} floating-point numbers  Computer arithmetic is performed on floating-point numbers and returns floating-point results. We assume the existence of machine-analog operations for real functions such as $+$, $-$, $\times$, $/$, $\sqrt{\quad}$, and so on. Without getting into the details, we will suppose that each elementary machine operation creates a floating-point result whose relative error is bounded by $\macheps$. For example, if $x$ and $y$ are in $\float$, then for machine addition $\oplus$ we have the bound {math} \frac{ |(x \oplus y)-(x+y)| }{ |x+y| } \le \macheps.  Hence the relative error in arithmetic is essentially the same as for the floating-point representation itself. However, playing by these rules can lead to disturbing results. (demo-float-arithmetic)= {proof:demo}  {raw} html
 {raw} latex %%start demo%%  There is no double-precision number between $1$ and $1+\epsilon_\text{mach}$. Thus the following difference is zero despite its appearance. {code-cell} e = eps()/2 (1.0 + e) - 1.0  However, the spacing between floats in $[1/2,1)$ is $\macheps/2$, so both $1-\macheps/2$ and its negative are represented exactly: {code-cell} 1.0 + (e - 1.0)  This is now the expected result. But we have found a rather shocking breakdown of the associative law of addition! {raw} html
 {raw} latex %%end demo%%  There are two ways to look at {numref}Demo %s . On one hand, its two versions of the result differ by less than $1.2\times 10^{-16}$, which is very small — not just in everyday terms, but with respect to the operands, which are all close to 1 in absolute value. On the other hand, the difference is as large as the exact result itself! We formalize and generalize this observation in the next section. In the meantime, keep in mind that exactness cannot be taken for granted in floating-point computation. ::::{proof:observation} We should not expect that two mathematically equivalent results will be equal when computed in floating point, only that they be relatively close together. :::: ## Exercises {note} Exercises marked with ✍ are intended to be done by hand or with the aid of a simple calculator. Exercises marked with ⌨ are intended to be solved using a computer.  (problem-float-d4)= 1. ✍ Consider a floating-point set $\float$ defined by {eq}floatpoint and {eq}mantissa with $d=4$. **(a)** How many elements of $\float$ are there in the real interval $[1/2,4]$, including the endpoints? **(b)** What is the element of $\float$ closest to the real number $1/10$? (Hint: Find the interval $[2^n,2^{n+1})$ that contains $1/10$, then enumerate all the candidates in $\float$.) **(c)** What is the smallest positive integer not in $\float$? (Hint: For what value of the exponent does the spacing between floating-point numbers become larger than 1?) (problem-fp-fprelative)= 2. ✍ Prove that {eq}fpbound is equivalent to {eq}fpboundalt. This means showing first that {eq}fpbound implies {eq}fpboundalt, and then separately that {eq}fpboundalt implies {eq}fpbound. 3. ⌨ There are much better rational approximations to $\pi$ than 22/7 as used in {numref}Demo {number} . For each one below, find its absolute and relative accuracy, and (rounding down to an integer) the number of accurate digits. **(a)** 355/113 **(b)** 103638/32989 {index} IEEE 754  4. ✍ IEEE 754 **single precision** specifies that 23 binary bits are used for the value $f$ in the significand $1+f$ in {eq}mantissa. Because they need less storage space and can be operated on more quickly than double-precision values, single-precision values can be useful in low-precision applications. (They are supported as type Float32 in Julia.) **(a)** In base-10 terms, what is the first single-precision number greater than $1$ in this system? **(b)** What is the smallest positive integer that is not a single-precision number? (See the hint to Exercise 1.) 5. ⌨ Julia defines a function nextfloat that gives the next-larger floating-point value of a given number. What is the next float past floatmax()? What is the next float past -Inf?