The normal equations

3.2. The normal equations#

We now solve the general linear least-squares problem in Definition 3.1.3. That is, given \(\mathbf{A}\in\mathbb{R}^{m \times n}\) and \(\mathbf{b}\in\mathbb{R}^m\), with \(m>n\), find the \(\mathbf{x}\in\mathbb{R}^n\) that minimizes \(\| \mathbf{b} - \mathbf{A}\mathbf{x} \|_2\).

There is a concise explicit solution. In the following proof we make use of the elementary algebraic fact that for two vectors \(\mathbf{u}\) and \(\mathbf{v}\),

\[ (\mathbf{u}+\mathbf{v})^T(\mathbf{u}+\mathbf{v}) = \mathbf{u}^T\mathbf{u} + \mathbf{u}^T\mathbf{v} + \mathbf{v}^T\mathbf{u} + \mathbf{v}^T\mathbf{v} = \mathbf{u}^T\mathbf{u} + 2\mathbf{v}^T\mathbf{u} + \mathbf{v}^T\mathbf{v}.\]

Theorem 3.2.1

If \(\mathbf{x}\) satisfies \(\mathbf{A}^T(\mathbf{A}\mathbf{x}-\mathbf{b})=\boldsymbol{0}\), then \(\mathbf{x}\) solves the linear least-squares problem, i.e., \(\mathbf{x}\) minimizes \(\| \mathbf{b}-\mathbf{A}\mathbf{x} \|_2\).

Proof

Let \(\mathbf{y}\in \mathbb{R}^n\) be any vector. Then \(\mathbf{A}(\mathbf{x}+\mathbf{y})-\mathbf{b}=\mathbf{A}\mathbf{x}-\mathbf{b}+\mathbf{A}\mathbf{y}\), and

\[\begin{split}\begin{split} \| \mathbf{A}(\mathbf{x}+\mathbf{y})-\mathbf{b} \|_2^2 &= [(\mathbf{A}\mathbf{x}-\mathbf{b})+(\mathbf{A}\mathbf{y})]^T[(\mathbf{A}\mathbf{x}-\mathbf{b})+(\mathbf{A}\mathbf{y})]\\ &= (\mathbf{A}\mathbf{x}-\mathbf{b})^T(\mathbf{A}\mathbf{x}-\mathbf{b}) + 2(\mathbf{A}\mathbf{y})^T(\mathbf{A}\mathbf{x}-\mathbf{b}) + (\mathbf{A}\mathbf{y})^T(\mathbf{A}\mathbf{y})\\ &= \| \mathbf{A}\mathbf{x}-\mathbf{b} \|_2^2 + 2\mathbf{y}^T \mathbf{A}^T(\mathbf{A}\mathbf{x}-\mathbf{b}) + \| \mathbf{A}\mathbf{y} \|_2^2 \\ &= \| \mathbf{A}\mathbf{x}-\mathbf{b} \|_2^2 + \| \mathbf{A}\mathbf{y} \|_2^2 \\ & \ge \| \mathbf{A}\mathbf{x}-\mathbf{b} \|_2^2. \end{split}\end{split}\]

Definition 3.2.2 : Normal equations

Given \(\mathbf{A}\in \real^{m\times n}\) and \(\mathbf{b}\in \real^{m}\), the normal equations for the linear least-squares problem \(\operatorname{argmin}\| \mathbf{b}- \mathbf{A} \mathbf{x}\|\) are \(\mathbf{A}^T(\mathbf{A}\mathbf{x}-\mathbf{b})=\boldsymbol{0}\), or equivalently,

(3.2.1)#\[\mathbf{A}^T\mathbf{A}\mathbf{x}=\mathbf{A}^T\mathbf{b}.\]

The normal equations have a geometric interpretation, as shown in Fig. 3.2.1. The vector in the range (column space) of \(\mathbf{A}\) that lies closest to \(\mathbf{b}\) makes the vector difference \(\mathbf{A}\mathbf{x}-\mathbf{b}\) perpendicular to the range. Thus for any \(\mathbf{z}\), we must have \((\mathbf{A} \mathbf{z})^T(\mathbf{A}\mathbf{x}-\mathbf{b})=0\), which is satisfied if \(\mathbf{A}^T(\mathbf{A}\mathbf{x}-\mathbf{b})=\boldsymbol{0}\).

../_images/normaleqns2d.svg — Fig. 3.2.1 Geometry of the normal equations. The smallest residual is orthogonal to the range of the matrix \(\mathbf{A}\).#

Pseudoinverse and definiteness#

If we associate the left-hand side of the normal equations as \((\mathbf{A}^T\mathbf{A})\,\mathbf{x}\), we recognize (3.2.1) as a square \(n\times n\) linear system to solve for \(\mathbf{x}\).

Definition 3.2.3 : Pseudoinverse

If \(\mathbf{A}\in\real^{m\times n}\) with \(m>n\), its pseudoinverse is the \(n\times m\) matrix

(3.2.2)#\[\mathbf{A}^+ = (\mathbf{A}^T\mathbf{A})^{-1}\mathbf{A}^T.\]

Mathematically, the overdetermined least-squares problem \(\mathbf{A}\mathbf{x}\approx \mathbf{b}\) has the solution \(\mathbf{x}=\mathbf{A}^+\mathbf{b}\).

Computationally we can generalize the observation about Julia from Chapter 2: backslash is equivalent mathematically to left-multiplication by the inverse (in the square case) or pseudoinverse (in the rectangular case) of a matrix. One can also compute the pseudoinverse directly using pinv, but as with matrix inverses, this is rarely necessary in practice.

The matrix \(\mathbf{A}^T\mathbf{A}\) appearing in the pseudoinverse has some important properties.

Theorem 3.2.4

For any real \(m\times n\) matrix \(\mathbf{A}\) with \(m\ge n\), the following are true:

\(\mathbf{A}^T\mathbf{A}\) is symmetric.
\(\mathbf{A}^T \mathbf{A}\) is singular if and only if the columns of \(\mathbf{A}\) are linearly dependent. (Equivalently, if and only if the rank of \(\mathbf{A}\) is less than \(n\).)
If \(\mathbf{A}^T\mathbf{A}\) is nonsingular, then it is positive definite.

Proof

The first part is left as Exercise 3. For the second part, suppose that \(\mathbf{A}^T\mathbf{A}\mathbf{z}=\boldsymbol{0}\). Note that \(\mathbf{A}^T\mathbf{A}\) is singular if and only if \(\mathbf{z}\) may be nonzero. Left-multiplying by \(\mathbf{z}^T\), we find that

\[0 = \mathbf{z}^T\mathbf{A}^T\mathbf{A}\mathbf{z}=(\mathbf{A}\mathbf{z})^T(\mathbf{A}\mathbf{z}) = \| \mathbf{A}\mathbf{z} \|_2^2,\]

which is equivalent to \(\mathbf{A}\mathbf{z}=\boldsymbol{0}\). Then \(\mathbf{z}\) may be nonzero if and only if the columns of \(\mathbf{A}\) are linearly dependent.

Finally, we can repeat the manipulations above to show that for any nonzero \(n\)-vector \(\mathbf{v}\), \(\mathbf{v}^T(\mathbf{A}^T\mathbf{A})\mathbf{v}=\| \mathbf{A}\mathbf{v} \|_2^2\ge 0\), and equality is not possible thanks to the second part of the theorem.

Implementation#

The definition of the pseudoinverse involves taking the inverse of a matrix, so it is not advisable to use the pseudoinverse computationally. Instead, we use the definition of the normal equations to set up a linear system, which we already know how to solve. In summary, the steps for solving the linear least squares problem \(\mathbf{A}\mathbf{x}\approx\mathbf{b}\) are:

Algorithm 3.2.5 : Solution of linear least squares by the normal equations

Compute \(\mathbf{N}=\mathbf{A}^T\mathbf{A}\).
Compute \(\mathbf{z} = \mathbf{A}^T\mathbf{b}\).
Solve the \(n\times n\) linear system \(\mathbf{N}\mathbf{x} = \mathbf{z}\) for \(\mathbf{x}\).

In the last step we can exploit the fact, proved in Theorem 3.2.4, that \(\mathbf{N}\) is symmetric and positive definite, and use Cholesky factorization as in Section 2.9. This detail is included in Function 3.2.6.

Function 3.2.6 : lsnormal

Linear least-squares solution by the normal equations

"""
    lsnormal(A,b)

Solve a linear least-squares problem by the normal equations.
Returns the minimizer of ||b-Ax||.
"""
function lsnormal(A,b)
    N = A'*A;  z = A'*b;
    R = cholesky(N).U
    w = forwardsub(R',z)                   # solve R'z=c
    x = backsub(R,w)                       # solve Rx=z
    return x
end

About the code

The syntax on line 9 is a field reference to extract the matrix we want from the structure returned by cholesky.

Steps 1 and 3 of Algorithm 3.2.5 dominate the flop count.

Theorem 3.2.7

Solution of linear least squares by the normal equations takes \(\sim (mn^2 + \frac{1}{3}n^3)\) flops.

Conditioning and stability#

We have already used A\b as the native way to solve the linear least-squares problem \(\mathbf{A}\mathbf{x}\approx\mathbf{b}\) in Julia. The algorithm employed by the backslash does not proceed through the normal equations, because of instability.

The conditioning of the linear least-squares problem relates changes in the solution \(\mathbf{x}\) to those in the data, \(\mathbf{A}\) and \(\mathbf{b}\). A full accounting of the condition number is too messy to present here, but we can get the main idea. We start by generalizing our previous definition of the matrix condition number.

Definition 3.2.8 : Matrix condition number (rectangular case)

If \(\mathbf{A}\) is \(m\times n\) with \(m > n\), then its condition number is defined to be

(3.2.3)#\[\kappa(\mathbf{A}) = \|\mathbf{A}\|_2 \cdot \|\mathbf{A}^{+}\|_2.\]

If the rank of \( \mathbf{A}\) is less than \(n\) (i.e., if it has linearly dependent columns), then \(\kappa(\mathbf{A})=\infty\).

Provided that the minimum residual norm \(\|\mathbf{b}-\mathbf{A}\mathbf{x}\|\) is relatively small, the conditioning of the linear least-squares problem is close to \(\kappa(\mathbf{A})\).

As an algorithm, the normal equations begin by computing the data for the \(n\times n\) system \((\mathbf{A}^T\mathbf{A})\mathbf{x} = \mathbf{A}^T \mathbf{b}\). When these equations are solved, perturbations to the data can be amplified by a factor \(\kappa(\mathbf{A}^T\mathbf{A})\).

The following can be proved using results in Chapter 7.

Theorem 3.2.9 : Condition number in the normal equations

If \(\mathbf{A}\) is \(m\times n\) with \(m > n\), then

(3.2.4)#\[\kappa(\mathbf{A}^T\mathbf{A}) = \kappa(\mathbf{A})^2.\]

This squaring of the condition number in the normal equations is the cause of instability. If \(\kappa(\mathbf{A})\) is large, the squaring of it can destabilize the normal equations: while the solution of the least-squares problem is sensitive, finding it via the normal equations makes it doubly so.

Demo 3.2.10

Because the functions \(\sin^2(t)\), \(\cos^2(t)\), and \(1\) are linearly dependent, we should find that the following matrix is somewhat ill-conditioned.

The local variable scoping rule for loops applies to comprehensions as well.

t = range(0,3,length=400)
f = [ x->sin(x)^2, x->cos((1+1e-7)*x)^2, x->1. ]
A = [ f(t) for t in t, f in f ]
κ = cond(A)

1.8253225426741675e7

Now we set up an artificial linear least-squares problem with a known exact solution that actually makes the residual zero.

x = [1.,2,1]
b = A*x;

Using backslash to find the least-squares solution, we get a relative error that is well below \(\kappa\) times machine epsilon.

x_BS = A\b
@show observed_error = norm(x_BS-x)/norm(x);
@show error_bound = κ*eps();

observed_error = norm(x_BS - x) / norm(x) = 1.0163949045357309e-10
error_bound = κ * eps() = 4.053030228488391e-9

If we formulate and solve via the normal equations, we get a much larger relative error. With \(\kappa^2\approx 10^{14}\), we may not be left with more than about 2 accurate digits.

N = A'*A
x_NE = N\(A'*b)
@show observed_err = norm(x_NE-x)/norm(x);
@show digits = -log10(observed_err);

observed_err = norm(x_NE - x) / norm(x) = 0.021745909192780664
digits = -(log10(observed_err)) = 1.6626224298403076

Exercises#

✍ Work out the least-squares solution when

\[\begin{split}\mathbf{A} = \begin{bmatrix} 2 & -1 \\ 0 & 1 \\ -2 & 2 \end{bmatrix}, \qquad \mathbf{b} = \begin{bmatrix} 1\\-5\\6 \end{bmatrix}.\end{split}\]
✍ Use (3.2.2) to find the pseudoinverse \(\mathbf{A}^+\) of the matrix \(\mathbf{A}=\begin{bmatrix}1&-2&3\end{bmatrix}^T\).
✍ Prove the first statement of Theorem 3.2.4: \(\mathbf{A}^T\mathbf{A}\) is symmetric for any \(m\times n\) matrix \(\mathbf{A}\) with \(m \ge n\).
✍ Prove that if \(\mathbf{A}\) is an invertible square matrix, then \(\mathbf{A}^+=\mathbf{A}^{-1}\).
(a) ✍ Show that for any \(m\times n\) \(\mathbf{A}\) with \(m>n\) for which \(\mathbf{A}^T\mathbf{A}\) is nonsingular, \(\mathbf{A}^+\mathbf{A}\) is the \(n\times n\) identity.

(b) ⌨ Show using an example in Julia that \(\mathbf{A}\mathbf{A}^+\) is not an identity matrix. (This matrix has rank no greater than \(n\), so it can’t be an \(m\times m\) identity.)
✍ Prove that the vector \(\mathbf{A}\mathbf{A}^+\mathbf{b}\) is the vector in the column space (i.e., range) of \(\mathbf{A}\) that is closest to \(\mathbf{b}\) in the sense of the 2-norm.
✍ Show that the flop count for Function 3.2.6 is asymptotically \(\sim 2m n^2 + \tfrac{1}{3}n^3\). (In finding the asymptotic count you can ignore terms like \(m n\) whose total degree is less than 3.)
⌨ Let \(t_1,\ldots,t_m\) be \(m\) equally spaced points in \([0,2\pi]\). In this exercise, use \(m=500\).

(a) Let \(\mathbf{A}_\beta\) be the matrix in (3.1.2) that corresponds to fitting data with the function \(c_1 + c_2 \sin(t) + c_3 \cos(\beta t)\). Using the identity (3.2.4), make a table of the condition numbers of \(\mathbf{A}_\beta\) for \(\beta = 2,1.1,1.01,\ldots,1+10^{-8}\).

(b) Repeat part (a) using the fitting function \(c_1 + c_2 \sin^2(t) + c_3 \cos^2(\beta t).\)

(c) Why does it make sense that \(\kappa\bigl(\mathbf{A}_\beta\bigr)\to \infty\) as \(\beta\to 1\) in part (b) but not in part (a)?
✍ ⌨ When \(\mathbf{A}\) is \(m\times n\) with rank less than \(n\), the pseudoinverse is still defined and can be computed using pinv from LinearAlgebra. However, the behavior in this case is not always intuitive. Let

\[\begin{split}\mathbf{A}_s = \begin{bmatrix} 1 & 1 \\ 0 & 0 \\ 0 & s \end{bmatrix}.\end{split}\]

Then \(\mathbf{A}_0\) has rank equal to 1. Demonstrate experimentally that \(\mathbf{A}_0^+\neq \lim_{s\to 0} \mathbf{A}_s^+\).